Skip to content

Gitlab data loss incident post-mortem

First of all I'd like to apologize not just to the Pleroma community, but to all the spammers we give free hosting to on our server. They lost more data than anyone. ☹️

Background

Server Specs

The Pleroma GitLab server is donated to the project. The server is physical with 32-core Xeon Gold 6130 CPU and 32GB of RAM. Storage is a 4TB HDD with a secondary 4TB HDD used for storing the backup data. Bandwidth is ... pretty much unlimited, network itself is extremely well managed with lots of layers of monitoring and security, so we wouldn't be helpless in the case of a DDoS or something. Other people would likely notice an event is happening at the same time we do and they'd begin remediating it much like they do for other internal and customer-hosted services.

Management

The server was installed with Ubuntu server and configured by an engineer who works in a different silo from myself. Lain, myself along with a few other people at the company have ssh/sudo access to the server. Internally the company has a really slick backup tool they deploy to every server which sends the data to a huge internal CEPH cluster. Backup space we have access to is ~ infinite (bigger than terabytes...). More on this later...

GitLab Installation

Gitlab is installed via the Docker-based Omnibus deployment method. Due to the frequency of security updates, a few years ago I put on Watchtower and let the server essentially run on autopilot and apply the latest GitLab updates automatically.

GitLab includes its own backup tooling and a cron job is configured to run it daily.

The Incident

WTF Was Feld Doing

I have on my TODO list to occasionally log in to the server, make sure it's healthy, clear the cobwebs, etc etc. Another item on my TODO list has been to engineer a way to track when our GitLab is being hammered with spam accounts. This was my primary motivation for looking at the server yesterday.

📢🤬 I just want to take this moment to send a big F*CK YOU to GitLab for hobbling the Community Edition. They don't just withhold features, they withhold important tools that would make sanely managing a server with open registrations possible. We can't bulk delete accounts, we don't get access to lock the instance into a read-only Maintenance Mode, and they take away ALL the system-level webhooks that can alert you to suspicious activities. They have put in no effort in solving the GitLab spam that plagues the entire internet because they don't care about anything but their hosted/centralized instance.

While checking on the server I discovered a couple things: Docker container wasn't rotating logs in its current configuration, so there was a ~40GB log file and the GitLab install had retained the old copy of the database after the Postgres upgrade.

The Horrible Event

If you've been managing Postgres databases for a long time you'll recognize how every time you run pg_upgrade on a database it keeps your old pre-upgrade database intact. If you're happy with the upgrade, it leaves you a script to clean up the old database: delete_old_cluster.sh. It's not complicated, it's just an rm -rf but with an absolute path to the old data you no longer need. It constructs the path from the pg_upgrade --old-datadir=DATADIR flag so you can't possibly fat finger anything because, you know, databases are important and deleting things in their vicinity is scary.

So I did the safe and responsible thing to clean up this data and ran the script. I did not read the script, because I already know exactly how this script is generated and how it is written. But what I didn't know was that GitLab would do something crazy: apparently after it upgrades the database, it renames the old production database directory and moves the new one into its place, and it leaves the delete_old_cluster.sh script which is an artifact of the previous pg_upgrade execution around...

I executed the script, and noticed in a browser tab that the GitLab returned a 500 error. And my heart sank.

Fear not, we have backups. At worst we lost an issue or two since the last backup. I can restore this pretty quickly, it's not a big deal.

I look at the backups directory and the timestamps are February 2022. That cool backup tool? Instead of being configured to backup the entire OS like usual, was only configured to ingest the GitLab backups directory. The backup server also only had the February 2022 data... so I couldn't just restore the Postgres data files even from an unclean < 24hr old data capture which would have been useful as well.

Now I'm panicking. Even with fancy tricks to restore files from the open filehandles there's still too much that's guaranteed to be lost that wasn't still engaged with an open filehandle.

The engineer who was configured to receive these GitLab backup cron emails, which must have been reporting an error or they just stopped executing, is no longer working with us. I was not aware of this; I usually only crossed paths with him a few times a year.

🫤💭 I pride myself on not losing data, and the last time it happened in some serious capacity was around 2011 when an rsync I ran missed important files for a client because someone did evil things with symlinks to solve a low-storage situation to shard data across several filesystems. And that's why I never run rsync without the -L flag (transform symlink into referent file/dir) anymore when I am doing a "backup" of a path; that trauma is burned into my brain forever. This tip might save your butt someday too, so put that in your toolbox for later.

All I had was this Postgres 14 database copy from the pre-Postgres 16 upgrade that came with GitLab 18 in April. And that's how we got to where we are today.

Moving Forward

I'm choosing to look at this as a wakeup call because a hardware failure would have meant losing years of data. We'll fix that.

The Plan

It's clear we can't just leave this thing on autopilot patching itself at the GitLab and OS level without risking damage to the Pleroma community, so here's what I've got on the schedule:

  • Fix Weblate integration
  • Fix and monitor backups. I ran a manual one and it worked, now to figure out why they weren't running daily.
  • Get the next Pleroma release out
  • Further distribute backups. Perhaps we setup Syncthing and let a few team members hoard copies of the data too.
  • Investigate altering the sequence IDs so new MRs/Issues don't overlap with the lost ones as that will just be confusing. But maybe we don't care.
  • Do an OS upgrade as we should jump to the next Ubuntu LTS
  • Upgrade GitLab to 18 again, which requires a Postgres 16 database upgrade.
  • Possibly change the GitLab config to use an externally/self-managed Postgres server which would make it reasonable to do replication, WAL log backups for PITR recovery, etc. It's not simple to do this with GitLab managing Postgres as they'll overwrite your Postgres configuration files; they instead want to push you to replicate everything to a second GitLab server which is overkill for our needs. (and probably not even supported for the community edition)
  • Setup some monitoring for spam accounts now that I have a tool I think will work well for this

Thank You

Thank you to Lain who took this bad news with a positive attitude because we didn't lose code, just some project metadata around issues/MRs.

Thank you to HJ and everyone else who is pitching in to clean up the GitLab and get things back on track.

Edited by feld