Fixing my own stupid

So, a while back I did something dumb.

I used to archive incoming mail with a procmail rule like this:

:0 c:
| /usr/bin/gzip >> $HOME/MailArchive/backup.gz

I would roll the archive over manually near the beginning of each month.

This was a pain.

A couple of times (ok, 3) I ended up with half archives that I wanted to combine.

Instead of doing `zcat oldarchive.gz | gzip >> otherarchive.gz` I did `cat oldarchive.gz otherarchive.gz > newarchive.gz` This gave me one file that was really two gzip archives, not one archive with the contents of both gzip files.

Since gnuzip is moderately smart, I could ungzip the first archive in the file, but after it got to the end of the first archive, it would error out. Every piece of mail in the second part of the archive appeared to be lost.

Of course, gnuzip's error message is this:

"gzip: filename decompression OK, trailing garbage ignored" -- not very helpful.

So, I downloaded the source to gzip, modified line 1308/1309 of gzip.c so that it would print the approximate place in the file where it was having problems.

I wasn't sure if I would see a 'hole' in the file where there were a bunch of null bytes (where the file didn't quite make it do the end of an allocated block) or if I wouldn't. A little time with hexdump and I was nowhere. The gzip byte signature just appeared too often in the file for me to narrow down where I was searching, and it was a general pain. I was about to give up on that tactic and perhaps write a custom program to un-munge my files.

So, I downloaded HexEdit. Wow. It's a great tool. Sure enough, there was a big bunch of null bytes around the offset that I got gzip to report. In fact, right after the bunch of nulls, was the gzip file signature.

I used HexEdit to remove the first gzip archive (that I had been able to successfully extract), and then was able to extract the other archive.

I've fixed all three files.

Since gzip is more effective in archiving when I give it a bigger data set than one mail message. I plan on rolling the monthly archives into a yearly archive.

I've also since changed my procmail rule to something more like:
:0 c:
| /usr/bin/gzip >> $HOME/MailArchive/`/bin/date "+%b-%Y"`.gz

So I don't have to bother with manual rolling (I also wrote a quick perl script to add the contents of my Sent Mail folder to the appropriate archives. So, run out of cron everything is automated).

I'm pretty happy because it had been bothering me for a while and I hadn't had time to try and fix it (I wasn't even sure if it would be practical).

It was, and I am victorious.

Powered by Movable Type 4.34-en
Creative Commons License
This blog is licensed under a Creative Commons License.