Gitlab data loss incident post-mortem #74
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
First of all I'd like to apologize not just to the Pleroma community, but to all the spammers we give free hosting to on our server. They lost more data than anyone. ☹️
Background
Server Specs
The Pleroma GitLab server is donated to the project. The server is physical with 32-core Xeon Gold 6130 CPU and 32GB of RAM. Storage is a 4TB HDD with a secondary 4TB HDD used for storing the backup data. Bandwidth is ... pretty much unlimited, network itself is extremely well managed with lots of layers of monitoring and security, so we wouldn't be helpless in the case of a DDoS or something. Other people would likely notice an event is happening at the same time we do and they'd begin remediating it much like they do for other internal and customer-hosted services.
Management
The server was installed with Ubuntu server and configured by an engineer who works in a different silo from myself. Lain, myself along with a few other people at the company have ssh/sudo access to the server. Internally the company has a really slick backup tool they deploy to every server which sends the data to a huge internal CEPH cluster. Backup space we have access to is ~ infinite (bigger than terabytes...). More on this later...
GitLab Installation
Gitlab is installed via the Docker-based Omnibus deployment method. Due to the frequency of security updates, a few years ago I put on Watchtower and let the server essentially run on autopilot and apply the latest GitLab updates automatically.
GitLab includes its own backup tooling and a cron job is configured to run it daily.
The Incident
WTF Was Feld Doing
I have on my TODO list to occasionally log in to the server, make sure it's healthy, clear the cobwebs, etc etc. Another item on my TODO list has been to engineer a way to track when our GitLab is being hammered with spam accounts. This was my primary motivation for looking at the server yesterday.
While checking on the server I discovered a couple things: Docker container wasn't rotating logs in its current configuration, so there was a ~40GB log file and the GitLab install had retained the old copy of the database after the Postgres upgrade.
The Horrible Event
If you've been managing Postgres databases for a long time you'll recognize how every time you run
pg_upgradeon a database it keeps your old pre-upgrade database intact. If you're happy with the upgrade, it leaves you a script to clean up the old database:delete_old_cluster.sh. It's not complicated, it's just anrm -rfbut with an absolute path to the old data you no longer need. It constructs the path from thepg_upgrade --old-datadir=DATADIRflag so you can't possibly fat finger anything because, you know, databases are important and deleting things in their vicinity is scary.So I did the safe and responsible thing to clean up this data and ran the script. I did not read the script, because I already know exactly how this script is generated and how it is written. But what I didn't know was that GitLab would do something crazy: apparently after it upgrades the database, it renames the old production database directory and moves the new one into its place, and it leaves the
delete_old_cluster.shscript which is an artifact of the previouspg_upgradeexecution around...I executed the script, and noticed in a browser tab that the GitLab returned a 500 error. And my heart sank.
Fear not, we have backups. At worst we lost an issue or two since the last backup. I can restore this pretty quickly, it's not a big deal.
I look at the backups directory and the timestamps are February 2022. That cool backup tool? Instead of being configured to backup the entire OS like usual, was only configured to ingest the GitLab backups directory. The backup server also only had the February 2022 data... so I couldn't just restore the Postgres data files even from an unclean < 24hr old data capture which would have been useful as well.
Now I'm panicking. Even with fancy tricks to restore files from the open filehandles there's still too much that's guaranteed to be lost that wasn't still engaged with an open filehandle.
The engineer who was configured to receive these GitLab backup cron emails, which must have been reporting an error or they just stopped executing, is no longer working with us. I was not aware of this; I usually only crossed paths with him a few times a year.
All I had was this Postgres 14 database copy from the pre-Postgres 16 upgrade that came with GitLab 18 in April. And that's how we got to where we are today.
Moving Forward
I'm choosing to look at this as a wakeup call because a hardware failure would have meant losing years of data. We'll fix that.
The Plan
It's clear we can't just leave this thing on autopilot patching itself at the GitLab and OS level without risking damage to the Pleroma community, so here's what I've got on the schedule:
Thank You
Thank you to Lain who took this bad news with a positive attitude because we didn't lose code, just some project metadata around issues/MRs.
Thank you to HJ and everyone else who is pitching in to clean up the GitLab and get things back on track.
Exfiltrating our data from GitLab into something with simpler management is unlikely to happen. We're likely stuck with GitLab for the forseeable future.
I am mirroring Pleroma and Pleroma-FE into my own Gitea and it is possible to migrate MRs, Issues, and Wiki into Gitea / Forgejo. However, you cannot mirror into them; it has to be a one-shot migration.
Due to the inability to lock the GitLab into a readonly Maintenance Mode, this would be a little tricky if we ever wanted to do a clean cutover. Simplest would be to wipe out everyone's sessions and passwords, then import. But we'd lose user accounts as that's not a supported part of the migration...
Documenting the status of the Issue/MR numbers:
As restored from snapshot:
Reality was:
Notes from today:
There's a dying(?) disk that's also causing trouble(?) and the backup job is stressing it pretty hard. A second disk exists that was never used, so I'm trying to get it back up to the point where I can run the backup job again and have it dump onto the other disk.
Dumping to the other disk worked, SMART looks concerning but I'm not 100% certain it's actually dying. It could very well be a controller issue or something else.
there aren't any scary things in dmesg and the other HDD also has weird numbers in its SMART and that disk was basically idle. i've seen controller/firmware or even just bad cabling trigger those read/ECC errors which are sorta harmless but they do have a performance impact obviously
OS has been upgraded to Ubuntu Jammy LTS so we have a better security posture.
We had additional disks added and the SMART errors showed there too. We had remote hands reseat hardware and that has not fixed it either.
Next step is hardware swaps. This is being scheduled.
Re-enabled the automatic Postgres 16 upgrade for Gitlab 17.11.7 and ran it to completion. It worked.
Upgrade to GitLab v18.0.0 was successful
Upgrade to GitLab v18.1.0 was successful
Upgrade to GitLab v18.2.0 failed and just caused HTTP 500 errors, but the upgrade to v18.2.7 worked (and ran a lot of migrations).
Upgrade to GitLab v18.3.3 was successful.
I do not see a point in upgrading to v18.4.1 at this time
I think a one-shot migration might be worth at least preparing for. The annoying bit would be that our OTP upgrade scripts are gitlab-centric, same thing for the frontend upgrade bits in pleroma-fe, which I think we should decrease anyway as gitlab could just break API.
Like we could drop an index file (like json and/or atom feeds giving version+date+download-URL) into the website or similar file server, and maybe also push the tarballs there.
Continuing putting notes in here about changes until I'm certain we're officially done with this whole incident. We still might have a server migration due to the storage controller or whatever causing so many SMART errors
Upgrade to GitLab v18.3.5 was successful.
The container registry has been migrated to the new database-backed metadata method which will allow online garbage collection and better maintenance. Our registry is very large (~250GB backed up) and we need to slim that down.
The migration was done following these instructions: https://git.pleroma.social/help/administration/packages/container_registry_metadata_database.md
I've enabled cleanup of the registry now. It will run daily and remove all images older than 90 days except the ones that match this regex:
(?:v.+|main|develop|stable|release.*|elixir.*)The configuration page for this is here: https://git.pleroma.social/pleroma/pleroma/-/settings/packages_and_registries/cleanup_image_tags
Our artifacts storage is out of control too, but they're excluded from backup. I added a ci rule to expire them after 1 week. I've also run a server-wide cleanup of the artifacts by running this ruby code on the rails console:
PleromaFE uses
masterfor stable branchNot sure if it was the restore or this, but I think that just tossed out our OTP binaries for stable branch / releases.
correct, but you don't generate container images in PleromaFE repo so it doesn't matter. I originally had the same thought.
The way GitLab manages artifacts is a mess. I think we have to update our gitlab-ci.yml to make sure those specific artifacts are flagged to be kept indefinitely. Investigating so we can correct this permanently.
We can backfill those OTP binaries by re-running the pipelines, but our CI configuration has workflow rules that block the ability to manually trigger the pipeline to do it on those old tags/releases. I've updated the CI to allow it going forward, but when you trigger a pipeline for a specific branch/tag it uses the CI rules as they exist in that branch/tag. Very annoying.
I've found I can trick it into running them by executing the pipeline manually from the Rails console and lying by saying it's triggered by a merge request:
This is what I'll have to do to backfill these.
edit: hmm no, this doesn't run those release jobs because of these rules:
We've kind of boxed ourselves in to a corner here. Was our intention to only publish the latest OTP binaries? I really can't remember and I can't see where we'd link to the OTP for older releases anyway. Perhaps just forcing the pipeline on stable and develop is good enough.
in the process of trying to get a pipeline to succeed on develop branch, I had to engage with the api-docs deployment which was failing (spec-deploy). I discovered that the api-docs CI job had a shell script to fetch the spec.json from a URL that was 404ing. The reason why the URL didn't work is because it doesn't download the latest artifact for that pipeline, it's a known bug. So we were never actually fetching the correct artifact to build the api-docs site. And another API endpoint I found is for the most recent successful pipeline run, not the one in progress... what we really needed was the internal JOB ID to construct the URL to the spec.json that was just built, and I came up with a method to do that via trigger jobs and dotenv storage between jobs.
https://gitlab.com/gitlab-org/gitlab/-/issues/20230