For anyone who cares, here's an explanation of what was happening last week:
The network on which these lists reside underwent a scheduled change of providers, i.e. got a different connection through to the "the Internet". This involves registering changes with the top level providers, as the backbone routing policies are no longer as dynamic as they used to be. It can take up to a week to get a change of CIDR block assignment implemented.
So. The backbone routing was changed, and we moved the connection. We're committed, as it would take a week to get switched back to our original provider.
Everything worked like a charm. For about 2 hours. Then it dies. After a bit of head scratching we rebooted the routers. That fixed it. For about 20 minutes. And so on. After a while, only a hard power cycle would cure the problem.
We replaced everything. We got the guy who wrote the code in the routers on the case. We tried different vendors equipment. That worked! It routed packets great ... except to the machine running the mail hub.
All this time (and this is three days later), mail is being queued up at the secondary MX host in Birmingham, as it can't be delivered to us. We receive and redistribute so much mail, that the secondary MX machine ran out of disk space and crashed. We fixed that.
Suddenly, the router started letting everyone see the garply.com mail machine! So the secondary started dumping all that queued mail onto our mail hub. And so did every other mail server on the Internet that had mail queued up for us ... say, about 3,000 of them, all within the first 30 minutes or so.
This caused our link to hit saturation, raising transit times up to 3000+ ms (from it's usual 30ms or so). It also ran the list server plumb out of 128Mb of real and 370Mb of virtual memory. Suddenly all those SMTP and DNS transactions start timing out, and mail starts queueing up again.
Then the link died again. And the cycle repeated itself over the next three days, although this time I managed to stage the inflood of mail by only selectively opening up routes at our main router and letting the secondary MX host run it's queue before letting anyone else get at our hub.
We finally think we have found the problem (the vendor sent us a new version of the EPROM code), and the link and mail servers have been up for 24 hours without a glitch.
It's been a horrible, horrible week!
-- hugh
PS. Are there any harmonica playing Spurs supporters? Or vice versa?