Gitlab data loss incident post-mortem #74

Open
opened 2025-09-16 15:26:33 +00:00 by feld · 20 comments
Owner

First of all I'd like to apologize not just to the Pleroma community, but to all the spammers we give free hosting to on our server. They lost more data than anyone. ☹️

Background

Server Specs

The Pleroma GitLab server is donated to the project. The server is physical with 32-core Xeon Gold 6130 CPU and 32GB of RAM. Storage is a 4TB HDD with a secondary 4TB HDD used for storing the backup data. Bandwidth is ... pretty much unlimited, network itself is extremely well managed with lots of layers of monitoring and security, so we wouldn't be helpless in the case of a DDoS or something. Other people would likely notice an event is happening at the same time we do and they'd begin remediating it much like they do for other internal and customer-hosted services.

Management

The server was installed with Ubuntu server and configured by an engineer who works in a different silo from myself. Lain, myself along with a few other people at the company have ssh/sudo access to the server. Internally the company has a really slick backup tool they deploy to every server which sends the data to a huge internal CEPH cluster. Backup space we have access to is ~ infinite (bigger than terabytes...). More on this later...

GitLab Installation

Gitlab is installed via the Docker-based Omnibus deployment method. Due to the frequency of security updates, a few years ago I put on Watchtower and let the server essentially run on autopilot and apply the latest GitLab updates automatically.

GitLab includes its own backup tooling and a cron job is configured to run it daily.

The Incident

WTF Was Feld Doing

I have on my TODO list to occasionally log in to the server, make sure it's healthy, clear the cobwebs, etc etc. Another item on my TODO list has been to engineer a way to track when our GitLab is being hammered with spam accounts. This was my primary motivation for looking at the server yesterday.

📢🤬 I just want to take this moment to send a big F*CK YOU to GitLab for hobbling the Community Edition. They don't just withhold features, they withhold important tools that would make sanely managing a server with open registrations possible. We can't bulk delete accounts, we don't get access to lock the instance into a read-only Maintenance Mode, and they take away ALL the system-level webhooks that can alert you to suspicious activities. They have put in no effort in solving the GitLab spam that plagues the entire internet because they don't care about anything but their hosted/centralized instance.

While checking on the server I discovered a couple things: Docker container wasn't rotating logs in its current configuration, so there was a ~40GB log file and the GitLab install had retained the old copy of the database after the Postgres upgrade.

The Horrible Event

If you've been managing Postgres databases for a long time you'll recognize how every time you run pg_upgrade on a database it keeps your old pre-upgrade database intact. If you're happy with the upgrade, it leaves you a script to clean up the old database: delete_old_cluster.sh. It's not complicated, it's just an rm -rf but with an absolute path to the old data you no longer need. It constructs the path from the pg_upgrade --old-datadir=DATADIR flag so you can't possibly fat finger anything because, you know, databases are important and deleting things in their vicinity is scary.

So I did the safe and responsible thing to clean up this data and ran the script. I did not read the script, because I already know exactly how this script is generated and how it is written. But what I didn't know was that GitLab would do something crazy: apparently after it upgrades the database, it renames the old production database directory and moves the new one into its place, and it leaves the delete_old_cluster.sh script which is an artifact of the previous pg_upgrade execution around...

I executed the script, and noticed in a browser tab that the GitLab returned a 500 error. And my heart sank.

Fear not, we have backups. At worst we lost an issue or two since the last backup. I can restore this pretty quickly, it's not a big deal.

I look at the backups directory and the timestamps are February 2022. That cool backup tool? Instead of being configured to backup the entire OS like usual, was only configured to ingest the GitLab backups directory. The backup server also only had the February 2022 data... so I couldn't just restore the Postgres data files even from an unclean < 24hr old data capture which would have been useful as well.

Now I'm panicking. Even with fancy tricks to restore files from the open filehandles there's still too much that's guaranteed to be lost that wasn't still engaged with an open filehandle.

The engineer who was configured to receive these GitLab backup cron emails, which must have been reporting an error or they just stopped executing, is no longer working with us. I was not aware of this; I usually only crossed paths with him a few times a year.

🫤💭 I pride myself on not losing data, and the last time it happened in some serious capacity was around 2011 when an rsync I ran missed important files for a client because someone did evil things with symlinks to solve a low-storage situation to shard data across several filesystems. And that's why I never run rsync without the -L flag (transform symlink into referent file/dir) anymore when I am doing a "backup" of a path; that trauma is burned into my brain forever. This tip might save your butt someday too, so put that in your toolbox for later.

All I had was this Postgres 14 database copy from the pre-Postgres 16 upgrade that came with GitLab 18 in April. And that's how we got to where we are today.

Moving Forward

I'm choosing to look at this as a wakeup call because a hardware failure would have meant losing years of data. We'll fix that.

The Plan

It's clear we can't just leave this thing on autopilot patching itself at the GitLab and OS level without risking damage to the Pleroma community, so here's what I've got on the schedule:

  • Fix Weblate integration
  • Fix and monitor backups. I ran a manual one and it worked, now to figure out why they weren't running daily.
  • Get the next Pleroma release out
  • Further distribute backups. Perhaps we setup Syncthing and let a few team members hoard copies of the data too.
  • Investigate altering the sequence IDs so new MRs/Issues don't overlap with the lost ones as that will just be confusing. But maybe we don't care.
  • Do an OS upgrade as we should jump to the next Ubuntu LTS
  • Upgrade GitLab to 18 again, which requires a Postgres 16 database upgrade.
  • Possibly change the GitLab config to use an externally/self-managed Postgres server which would make it reasonable to do replication, WAL log backups for PITR recovery, etc. It's not simple to do this with GitLab managing Postgres as they'll overwrite your Postgres configuration files; they instead want to push you to replicate everything to a second GitLab server which is overkill for our needs. (and probably not even supported for the community edition)
  • Setup some monitoring for spam accounts now that I have a tool I think will work well for this

Thank You

Thank you to Lain who took this bad news with a positive attitude because we didn't lose code, just some project metadata around issues/MRs.

Thank you to HJ and everyone else who is pitching in to clean up the GitLab and get things back on track.

First of all I'd like to apologize not just to the Pleroma community, but to all the spammers we give free hosting to on our server. They lost more data than anyone. ☹️ # Background ## Server Specs The Pleroma GitLab server is donated to the project. The server is physical with 32-core Xeon Gold 6130 CPU and 32GB of RAM. Storage is a 4TB HDD with a secondary 4TB HDD used for storing the backup data. Bandwidth is ... pretty much unlimited, network itself is extremely well managed with lots of layers of monitoring and security, so we wouldn't be helpless in the case of a DDoS or something. Other people would likely notice an event is happening at the same time we do and they'd begin remediating it much like they do for other internal and customer-hosted services. ## Management The server was installed with Ubuntu server and configured by an engineer who works in a different silo from myself. Lain, myself along with a few other people at the company have ssh/sudo access to the server. Internally the company has a really slick backup tool they deploy to every server which sends the data to a huge internal CEPH cluster. Backup space we have access to is ~ infinite (bigger than terabytes...). More on this later... ## GitLab Installation Gitlab is installed via the Docker-based Omnibus deployment method. Due to the frequency of security updates, a few years ago I put on Watchtower and let the server essentially run on autopilot and apply the latest GitLab updates automatically. GitLab includes its own backup tooling and a cron job is configured to run it daily. # The Incident ## WTF Was Feld Doing I have on my TODO list to occasionally log in to the server, make sure it's healthy, clear the cobwebs, etc etc. Another item on my TODO list has been to engineer a way to track when our GitLab is being hammered with spam accounts. This was my primary motivation for looking at the server yesterday. > 📢🤬 I just want to take this moment to send a big **F*CK YOU** to GitLab for hobbling the Community Edition. They don't just withhold features, they withhold important tools that would make sanely managing a server with open registrations possible. We can't bulk delete accounts, we don't get access to lock the instance into a read-only Maintenance Mode, and they take away ALL the system-level webhooks that can alert you to suspicious activities. They have put in no effort in solving the GitLab spam that plagues the entire internet because they don't care about anything but their hosted/centralized instance. While checking on the server I discovered a couple things: Docker container wasn't rotating logs in its current configuration, so there was a ~40GB log file and the GitLab install had retained the old copy of the database after the Postgres upgrade. ## The Horrible Event If you've been managing Postgres databases for a long time you'll recognize how every time you run `pg_upgrade` on a database it keeps your old pre-upgrade database intact. If you're happy with the upgrade, it leaves you a script to clean up the old database: `delete_old_cluster.sh`. It's not complicated, it's just an `rm -rf` but with an absolute path to the **old** data you no longer need. It constructs the path from the `pg_upgrade --old-datadir=DATADIR` flag so you can't possibly fat finger anything because, you know, *databases are important and deleting things in their vicinity is scary*. So I did the safe and responsible thing to clean up this data and ran the script. I did not read the script, because I already know exactly how this script is generated and how it is written. But what I didn't know was that GitLab would do something crazy: apparently after it upgrades the database, it renames the old production database directory and moves the new one into its place, and it leaves the `delete_old_cluster.sh` script which is an artifact of the previous `pg_upgrade` execution around... I executed the script, and noticed in a browser tab that the GitLab returned a 500 error. And my heart sank. Fear not, we have backups. At worst we lost an issue or two since the last backup. I can restore this pretty quickly, it's not a big deal. I look at the backups directory and the timestamps are February 2022. That cool backup tool? Instead of being configured to backup the entire OS like usual, was only configured to ingest the GitLab backups directory. The backup server also only had the February 2022 data... so I couldn't just restore the Postgres data files even from an unclean < 24hr old data capture which would have been useful as well. Now I'm panicking. Even with fancy tricks to restore files from the open filehandles there's still too much that's guaranteed to be lost that wasn't still engaged with an open filehandle. The engineer who was configured to receive these GitLab backup cron emails, which must have been reporting an error or they just stopped executing, is no longer working with us. I was not aware of this; I usually only crossed paths with him a few times a year. > 🫤💭 I pride myself on not losing data, and the last time it happened in some serious capacity was around 2011 when an rsync I ran missed important files for a client because someone did evil things with symlinks to solve a low-storage situation to shard data across several filesystems. And that's why I never run rsync without the `-L` flag (transform symlink into referent file/dir) anymore when I am doing a "backup" of a path; that trauma is burned into my brain forever. This tip might save your butt someday too, so put that in your toolbox for later. All I had was this Postgres 14 database copy from the pre-Postgres 16 upgrade that came with GitLab 18 in April. And that's how we got to where we are today. # Moving Forward I'm choosing to look at this as a wakeup call because a hardware failure would have meant losing years of data. We'll fix that. ## The Plan It's clear we can't just leave this thing on autopilot patching itself at the GitLab and OS level without risking damage to the Pleroma community, so here's what I've got on the schedule: - [ ] Fix Weblate integration - [ ] Fix and monitor backups. I ran a manual one and it worked, now to figure out why they weren't running daily. - [ ] Get the next Pleroma release out - [ ] Further distribute backups. Perhaps we setup Syncthing and let a few team members hoard copies of the data too. - [ ] Investigate altering the sequence IDs so new MRs/Issues don't overlap with the lost ones as that will just be confusing. But maybe we don't care. - [x] Do an OS upgrade as we should jump to the next Ubuntu LTS - [ ] Upgrade GitLab to 18 again, which requires a Postgres 16 database upgrade. - [ ] Possibly change the GitLab config to use an externally/self-managed Postgres server which would make it reasonable to do replication, WAL log backups for PITR recovery, etc. It's not simple to do this with GitLab managing Postgres as they'll overwrite your Postgres configuration files; they instead want to push you to replicate everything to a second GitLab server which is overkill for our needs. (and probably not even supported for the community edition) - [ ] Setup some monitoring for spam accounts now that I have a tool I think will work well for this ## Thank You Thank you to Lain who took this bad news with a positive attitude because we didn't lose code, just some project metadata around issues/MRs. Thank you to HJ and everyone else who is pitching in to clean up the GitLab and get things back on track.
Author
Owner

Exfiltrating our data from GitLab into something with simpler management is unlikely to happen. We're likely stuck with GitLab for the forseeable future.

I am mirroring Pleroma and Pleroma-FE into my own Gitea and it is possible to migrate MRs, Issues, and Wiki into Gitea / Forgejo. However, you cannot mirror into them; it has to be a one-shot migration.

Due to the inability to lock the GitLab into a readonly Maintenance Mode, this would be a little tricky if we ever wanted to do a clean cutover. Simplest would be to wipe out everyone's sessions and passwords, then import. But we'd lose user accounts as that's not a supported part of the migration...

Exfiltrating our data from GitLab into something with simpler management is unlikely to happen. We're likely stuck with GitLab for the forseeable future. I am mirroring Pleroma and Pleroma-FE into my own Gitea and it *is* possible to migrate MRs, Issues, and Wiki into Gitea / Forgejo. However, you cannot *mirror* into them; it has to be a one-shot migration. Due to the inability to lock the GitLab into a readonly Maintenance Mode, this would be a little tricky if we ever wanted to do a clean cutover. Simplest would be to wipe out everyone's sessions and passwords, then import. But we'd lose user accounts as that's not a supported part of the migration...
Author
Owner

Documenting the status of the Issue/MR numbers:

As restored from snapshot:

  • BE MR highest is 4353, Issue 3374
  • FE MR highest is 2059, Issue 1373

Reality was:

  • BE MR highest was 4414, Issue 3399
  • FE MR highest was 2251, Issue 1390
Documenting the status of the Issue/MR numbers: As restored from snapshot: - BE MR highest is 4353, Issue 3374 - FE MR highest is 2059, Issue 1373 Reality was: - BE MR highest was 4414, Issue 3399 - FE MR highest was 2251, Issue 1390
Author
Owner

Notes from today:

There's a dying(?) disk that's also causing trouble(?) and the backup job is stressing it pretty hard. A second disk exists that was never used, so I'm trying to get it back up to the point where I can run the backup job again and have it dump onto the other disk.

Dumping to the other disk worked, SMART looks concerning but I'm not 100% certain it's actually dying. It could very well be a controller issue or something else.

there aren't any scary things in dmesg and the other HDD also has weird numbers in its SMART and that disk was basically idle. i've seen controller/firmware or even just bad cabling trigger those read/ECC errors which are sorta harmless but they do have a performance impact obviously

OS has been upgraded to Ubuntu Jammy LTS so we have a better security posture.

Notes from today: There's a dying(?) disk that's also causing trouble(?) and the backup job is stressing it pretty hard. A second disk exists that was never used, so I'm trying to get it back up to the point where I can run the backup job again and have it dump onto the other disk. Dumping to the other disk worked, SMART looks concerning but I'm not 100% certain it's actually dying. It could very well be a controller issue or something else. there aren't any scary things in dmesg and the other HDD also has weird numbers in its SMART and that disk was basically idle. i've seen controller/firmware or even just bad cabling trigger those read/ECC errors which are sorta harmless but they do have a performance impact obviously OS has been upgraded to Ubuntu Jammy LTS so we have a better security posture.
Author
Owner
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Device Model:     ST4000NM0035-1V4107
Serial Number:    ZC17XH9C
LU WWN Device Id: 5 000c50 0b2928b80
Firmware Version: TN04
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Sep 18 16:20:16 2025 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  575) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 402) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   070   064   044    Pre-fail  Always       -       9558720
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   091   060   045    Pre-fail  Always       -       1322525628
  9 Power_On_Hours          0x0032   032   032   000    Old_age   Always       -       59887
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       7
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   067   040    Old_age   Always       -       30 (Min/Max 21/33)
191 G-Sense_Error_Rate      0x0032   001   001   000    Old_age   Always       -       228933
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1905
193 Load_Cycle_Count        0x0032   094   094   000    Old_age   Always       -       12236
194 Temperature_Celsius     0x0022   030   040   000    Old_age   Always       -       30 (0 20 0 0 0)
195 Hardware_ECC_Recovered  0x001a   001   001   000    Old_age   Always       -       9558720
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       58291h+02m+39.222s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1382169102494
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1582502032306

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Device Model:     ST4000NM0035-1V4107
Serial Number:    ZC17X3TQ
LU WWN Device Id: 5 000c50 0b28dab77
Firmware Version: TN04
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Sep 18 16:20:35 2025 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  584) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 404) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   080   064   044    Pre-fail  Always       -       100743573
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       15
  7 Seek_Error_Rate         0x000f   087   060   045    Pre-fail  Always       -       455425940
  9 Power_On_Hours          0x0032   032   032   000    Old_age   Always       -       59887
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       7
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   074   068   040    Old_age   Always       -       26 (Min/Max 20/30)
191 G-Sense_Error_Rate      0x0032   021   021   000    Old_age   Always       -       158568
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       647
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       215163
194 Temperature_Celsius     0x0022   026   040   000    Old_age   Always       -       26 (0 20 0 0 0)
195 Hardware_ECC_Recovered  0x001a   054   001   000    Old_age   Always       -       100743573
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       31496h+31m+34.949s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1314904999641
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1355898468103

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
``` === START OF INFORMATION SECTION === Model Family: Seagate Enterprise Capacity 3.5 HDD Device Model: ST4000NM0035-1V4107 Serial Number: ZC17XH9C LU WWN Device Id: 5 000c50 0b2928b80 Firmware Version: TN04 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Thu Sep 18 16:20:16 2025 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 575) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 402) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x50bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 070 064 044 Pre-fail Always - 9558720 3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 7 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always - 1322525628 9 Power_On_Hours 0x0032 032 032 000 Old_age Always - 59887 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 7 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 070 067 040 Old_age Always - 30 (Min/Max 21/33) 191 G-Sense_Error_Rate 0x0032 001 001 000 Old_age Always - 228933 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 1905 193 Load_Cycle_Count 0x0032 094 094 000 Old_age Always - 12236 194 Temperature_Celsius 0x0022 030 040 000 Old_age Always - 30 (0 20 0 0 0) 195 Hardware_ECC_Recovered 0x001a 001 001 000 Old_age Always - 9558720 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 58291h+02m+39.222s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 1382169102494 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1582502032306 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ``` ``` === START OF INFORMATION SECTION === Model Family: Seagate Enterprise Capacity 3.5 HDD Device Model: ST4000NM0035-1V4107 Serial Number: ZC17X3TQ LU WWN Device Id: 5 000c50 0b28dab77 Firmware Version: TN04 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Thu Sep 18 16:20:35 2025 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 584) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 404) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x50bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 080 064 044 Pre-fail Always - 100743573 3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 7 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 15 7 Seek_Error_Rate 0x000f 087 060 045 Pre-fail Always - 455425940 9 Power_On_Hours 0x0032 032 032 000 Old_age Always - 59887 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 7 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 074 068 040 Old_age Always - 26 (Min/Max 20/30) 191 G-Sense_Error_Rate 0x0032 021 021 000 Old_age Always - 158568 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 647 193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 215163 194 Temperature_Celsius 0x0022 026 040 000 Old_age Always - 26 (0 20 0 0 0) 195 Hardware_ECC_Recovered 0x001a 054 001 000 Old_age Always - 100743573 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 31496h+31m+34.949s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 1314904999641 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1355898468103 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ```
Author
Owner

We had additional disks added and the SMART errors showed there too. We had remote hands reseat hardware and that has not fixed it either.

Next step is hardware swaps. This is being scheduled.

We had additional disks added and the SMART errors showed there too. We had remote hands reseat hardware and that has not fixed it either. Next step is hardware swaps. This is being scheduled.
Author
Owner

Re-enabled the automatic Postgres 16 upgrade for Gitlab 17.11.7 and ran it to completion. It worked.

Re-enabled the automatic Postgres 16 upgrade for Gitlab 17.11.7 and ran it to completion. It worked.
Author
Owner

Upgrade to GitLab v18.0.0 was successful

Upgrade to GitLab v18.0.0 was successful
Author
Owner

Upgrade to GitLab v18.1.0 was successful

Upgrade to GitLab v18.1.0 was successful
Author
Owner

Upgrade to GitLab v18.2.0 failed and just caused HTTP 500 errors, but the upgrade to v18.2.7 worked (and ran a lot of migrations).

Upgrade to GitLab v18.2.0 failed and just caused HTTP 500 errors, but the upgrade to v18.2.7 worked (and ran a lot of migrations).
Author
Owner

Upgrade to GitLab v18.3.3 was successful.

I do not see a point in upgrading to v18.4.1 at this time

Upgrade to GitLab v18.3.3 was successful. I do not see a point in upgrading to v18.4.1 at this time
Owner

I think a one-shot migration might be worth at least preparing for. The annoying bit would be that our OTP upgrade scripts are gitlab-centric, same thing for the frontend upgrade bits in pleroma-fe, which I think we should decrease anyway as gitlab could just break API.

Like we could drop an index file (like json and/or atom feeds giving version+date+download-URL) into the website or similar file server, and maybe also push the tarballs there.

I think a one-shot migration might be worth at least preparing for. The annoying bit would be that our OTP upgrade scripts are gitlab-centric, same thing for the frontend upgrade bits in pleroma-fe, which I think we should decrease anyway as gitlab could just break API. Like we could drop an index file (like json and/or atom feeds giving version+date+download-URL) into the website or similar file server, and maybe also push the tarballs there.
Author
Owner

Continuing putting notes in here about changes until I'm certain we're officially done with this whole incident. We still might have a server migration due to the storage controller or whatever causing so many SMART errors

Upgrade to GitLab v18.3.5 was successful.

Continuing putting notes in here about changes until I'm certain we're officially done with this whole incident. We still might have a server migration due to the storage controller or whatever causing so many SMART errors Upgrade to GitLab v18.3.5 was successful.
Author
Owner

The container registry has been migrated to the new database-backed metadata method which will allow online garbage collection and better maintenance. Our registry is very large (~250GB backed up) and we need to slim that down.

The migration was done following these instructions: https://git.pleroma.social/help/administration/packages/container_registry_metadata_database.md

I've enabled cleanup of the registry now. It will run daily and remove all images older than 90 days except the ones that match this regex: (?:v.+|main|develop|stable|release.*|elixir.*)

The configuration page for this is here: https://git.pleroma.social/pleroma/pleroma/-/settings/packages_and_registries/cleanup_image_tags

The container registry has been migrated to the new database-backed metadata method which will allow online garbage collection and better maintenance. Our registry is very large (~250GB backed up) and we need to slim that down. The migration was done following these instructions: https://git.pleroma.social/help/administration/packages/container_registry_metadata_database.md I've enabled cleanup of the registry now. It will run daily and remove all images older than 90 days except the ones that match this regex: `(?:v.+|main|develop|stable|release.*|elixir.*)` The configuration page for this is here: https://git.pleroma.social/pleroma/pleroma/-/settings/packages_and_registries/cleanup_image_tags
Author
Owner

Our artifacts storage is out of control too, but they're excluded from backup. I added a ci rule to expire them after 1 week. I've also run a server-wide cleanup of the artifacts by running this ruby code on the rails console:

builds = Ci::Build.all
admin_user = User.find_by(username: 'feld')
builds.where("finished_at < ?", 1.week.ago).each_batch do |batch|
  batch.each do |build|
    print "Ci::Build ID #{build.id}... "

    if build.erasable?
      Ci::BuildEraseService.new(build, admin_user).execute
      puts "Erased"
    else
      puts "Skipped (Nothing to erase or not erasable)"
    end
  end
end
Our artifacts storage is out of control too, but they're excluded from backup. I added a ci rule to expire them after 1 week. I've also run a server-wide cleanup of the artifacts by running this ruby code on the rails console: ``` builds = Ci::Build.all admin_user = User.find_by(username: 'feld') builds.where("finished_at < ?", 1.week.ago).each_batch do |batch| batch.each do |build| print "Ci::Build ID #{build.id}... " if build.erasable? Ci::BuildEraseService.new(build, admin_user).execute puts "Erased" else puts "Skipped (Nothing to erase or not erasable)" end end end ```
Owner

(?:v.+|main|develop|stable|release.*|elixir.*)

PleromaFE uses master for stable branch

>`(?:v.+|main|develop|stable|release.*|elixir.*)` PleromaFE uses `master` for stable branch
Owner

Not sure if it was the restore or this, but I think that just tossed out our OTP binaries for stable branch / releases.

Not sure if it was the restore or this, but I think that just tossed out our OTP binaries for stable branch / releases.
Author
Owner

correct, but you don't generate container images in PleromaFE repo so it doesn't matter. I originally had the same thought.

correct, but you don't generate container images in PleromaFE repo so it doesn't matter. I originally had the same thought.
Author
Owner

The way GitLab manages artifacts is a mess. I think we have to update our gitlab-ci.yml to make sure those specific artifacts are flagged to be kept indefinitely. Investigating so we can correct this permanently.

The way GitLab manages artifacts is a mess. I think we have to update our gitlab-ci.yml to make sure *those* specific artifacts are flagged to be kept indefinitely. Investigating so we can correct this permanently.
Author
Owner

We can backfill those OTP binaries by re-running the pipelines, but our CI configuration has workflow rules that block the ability to manually trigger the pipeline to do it on those old tags/releases. I've updated the CI to allow it going forward, but when you trigger a pipeline for a specific branch/tag it uses the CI rules as they exist in that branch/tag. Very annoying.

I've found I can trick it into running them by executing the pipeline manually from the Rails console and lying by saying it's triggered by a merge request:

project = Project.find_by_full_path('pleroma/pleroma')  
user = User.find_by_username('feld')  
Ci::CreatePipelineService.new(project, user, ref: 'v2.9.1').execute(:merge_request_event)

This is what I'll have to do to backfill these.

edit: hmm no, this doesn't run those release jobs because of these rules:

  only: &release-only
  - stable@pleroma/pleroma
  - develop@pleroma/pleroma
  - /^maint/.*$/@pleroma/pleroma
  - /^release/.*$/@pleroma/pleroma

We've kind of boxed ourselves in to a corner here. Was our intention to only publish the latest OTP binaries? I really can't remember and I can't see where we'd link to the OTP for older releases anyway. Perhaps just forcing the pipeline on stable and develop is good enough.

We can backfill those OTP binaries by re-running the pipelines, but our CI configuration has workflow rules that block the ability to manually trigger the pipeline to do it on those old tags/releases. I've updated the CI to allow it going forward, but when you trigger a pipeline for a specific branch/tag it uses the CI rules as they exist in that branch/tag. Very annoying. I've found I can trick it into running them by executing the pipeline manually from the Rails console and lying by saying it's triggered by a merge request: ``` project = Project.find_by_full_path('pleroma/pleroma') user = User.find_by_username('feld') Ci::CreatePipelineService.new(project, user, ref: 'v2.9.1').execute(:merge_request_event) ``` This is what I'll have to do to backfill these. edit: hmm no, this doesn't run those release jobs because of these rules: ``` only: &release-only - stable@pleroma/pleroma - develop@pleroma/pleroma - /^maint/.*$/@pleroma/pleroma - /^release/.*$/@pleroma/pleroma ``` We've kind of boxed ourselves in to a corner here. Was our intention to only publish the latest OTP binaries? I really can't remember and I can't see where we'd link to the OTP for older releases anyway. Perhaps just forcing the pipeline on stable and develop is good enough.
Author
Owner

in the process of trying to get a pipeline to succeed on develop branch, I had to engage with the api-docs deployment which was failing (spec-deploy). I discovered that the api-docs CI job had a shell script to fetch the spec.json from a URL that was 404ing. The reason why the URL didn't work is because it doesn't download the latest artifact for that pipeline, it's a known bug. So we were never actually fetching the correct artifact to build the api-docs site. And another API endpoint I found is for the most recent successful pipeline run, not the one in progress... what we really needed was the internal JOB ID to construct the URL to the spec.json that was just built, and I came up with a method to do that via trigger jobs and dotenv storage between jobs.

https://gitlab.com/gitlab-org/gitlab/-/issues/20230

in the process of trying to get a pipeline to succeed on develop branch, I had to engage with the api-docs deployment which was failing (spec-deploy). I discovered that the api-docs CI job had a shell script to fetch the spec.json from a URL that was 404ing. The reason why the URL didn't work is because it doesn't download the latest artifact for that pipeline, it's a known bug. So we were never actually fetching the correct artifact to build the api-docs site. And another API endpoint I found is for the most recent successful pipeline run, not the one in progress... what we really needed was the internal JOB ID to construct the URL to the spec.json that was just built, and I came up with a method to do that via trigger jobs and dotenv storage between jobs. https://gitlab.com/gitlab-org/gitlab/-/issues/20230
Sign in to join this conversation.
No labels
BE
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
pleroma/pleroma-meta#74
No description provided.