Retry queue implosion
So lain reported that the retry queue is "imploding" the BEAM on pleroma.soykaf.com.
Ultimately, the problem is:
- too many dead instances are creating retries
- the current implementation has flaws:
- it potentially creates way too much timers
- since re-publishes are executed into the retry queue process, the process inbox queue may overflow with too many queued messages (from the timers)
We could work on three different fixes:
- Store in postgres the last time we successfully heard/sent an activity to a domain. if the timestamp is too far away, we discard retries
- Use a temporary fuse per-domain-- if too many publishes fails to an instance, stop trying
- Use an ets/dets table to store retries, with the key as the time to retry timestamp
Thoughts?