PDA

View Full Version : "waiting for flush", "wait for writing to tape" and tape being ready?



Andrew Rakowski
February 16th, 2007, 08:18 AM
[Executive summary: two days of backups from roughtly 300 DLEs stuck on holding disks, no longer writing to virtual tapes for some reason, amflush NOT flushing to tape. Details below.]

Hello folks. Yesterday morning, my Amanda backup died for some reason (as I got blown out of my amandabackup user shells as well, I'm guessing I might have hit a resource limit on the box - odd, as it's handled a load of 44 before without dying).

Anyhow, the backups that started at 00:59 in the morning died somewhere aroung 02:15 or 02:31 (not sure, those were the last timestamps in the /etc/amanda/XXXXset1/log directory). When I got in to the office in the morning and looked for my daily e-mail and didn't find it, I checked with an 'amstatus XXXXset1' and saw that things weren't finished. Doing a 'ps -ef|grep am' showed that no Amanda processes were still running, and the holding disks had data in them.

I tried doing an 'amflush XXXXset1' and got a message that I needed to run amcleanup first. I did that, it seems to have succeeded, and I did another amflush which appeared to work just fine.

Since I was expecting a large set of level-0 backups that night, I decided to rerun the backup during the day. I started it, and it ran until about 21:30 or so, when I got my completion e-mail. I then looked, as I noticed it didn't write anything to the tape (XXXXset1-51 showed 0% used). I figured that it needed another amflush (although I don't know why), and tried a "amflush -f XXXXset1" so it would run in the foreground and I could see how long it took. The amflush found several sets of data to flush, I answered 'all' and 'y' and it completed immediately, e-mailing me a AMFLUSH report that showed all 0's, and said that it wrote to tape XXXXset1-52. I tried again, this time using a "amflush -b XXXXset1" so it would just flush all the data that it found. I got another "all 0's" mail report back, so it seems that amflush is not in fact flushing anything.

After these problems (and noticing that there were NO Amanda processes running in the background), and trying a few more times to run amflush (just picking one of the sets that it found, letting it run in the background, etc.) I just looked in the virtual disk "data" directories and saw that there was no actual data, did "amrmtape" and "amlabel" commands on the four tapes it said it used but did not actually write anything too, and figured that the screwup was running more than one backup on the same date (thinking maybe I'm tickling a bug from the conversion from "one run per day" to using timestamps.)

So, fast forward to this morning. The backups completed, saying they only wrote backups to one tape, but in fact writing NO data to any tapes. The "amstatus XXXXset1" shows all the DLEs like this:



...
walleye:/ 1 32m waiting to flush
walleye:/ 1 58m dump done (2:23:05), wait for writing to tape
walleye:/home 1 0m waiting to flush
walleye:/home 1 0m dump done (2:21:01), wait for writing to tape

SUMMARY part real estimated
size size
partition : 592
estimated : 281 199169m
flush : 290 218752m
failed : 23 4247m ( 2.13%)
wait for dumping: 0 0m ( 0.00%)
dumping to tape : 0 0m ( 0.00%)
dumping : 0 0m 0m ( 0.00%) ( 0.00%)
dumped : 279 92493m 194921m ( 47.45%) ( 46.44%)
wait for writing: 279 92493m 194921m ( 47.45%) ( 46.44%)
wait to flush : 289 218752m 218752m (100.00%) ( 0.00%)
writing to tape : 1 0m 0m (100.00%) ( 0.00%)
failed to tape : 0 0m 0m ( 0.00%) ( 0.00%)
taped : 0 0m 0m ( 0.00%) ( 0.00%)
tape 1 : 0 0m 0m ( 0.00%) XXXXset1-55
15 dumpers idle : no-dumpers
taper idle
network free kps: 2252800
holding space : 214695m ( 69.89%)
chunker0 busy : 0:00:00 ( 0.00%)
chunker1 busy : 0:00:00 ( 0.00%)
-sh-3.00$ amtape XXXXset1 current
amtape: scanning current slot in tape-changer rack:
slot 55: date 20070216005901 label XXXXset1-55
-sh-3.00$
-sh-3.00$ cat changerstatus
# multi-changer state cache: DO NOT EDIT!
curslot 55
curloaded 1
-sh-3.00$


So, it seems to want to write to tape XXXXset1-55, that IS the tape in the current tape drive slot, and yet it's doing nothing. The holding disk space is about full, and I don't want to touch it until I can figure out what is the right thing to do here. ALL the DLEs are both waiting for flushing (from yesterday?) as well as writing to tape (today's backups).

I did update from v2.5.1p2 to 2.5.1p3 (on the 4GB 4 CPU Redhat AW4 backup server), on the off chance that some small bug was fixed that affects this problem. I updated using the Zmanda.com binary RPMs. That hope doesn't seem to have panned out.

So, once again, what is it that I'm doing wrong here? I'm getting a little leery of trusting our systems to Amanda when I can't get past these simple problems. Granted, I am aiming for 400 to 500 DLEs worth of backups, using virtual tapes on 7.2TB worth of RAID-5 storage to start, with more growth to additional storage arrays after we get a feel for how much "churn" there is on these systems.

Best regards (and hopes for yet another magic bullet).

-Andrew

shailen
February 16th, 2007, 10:45 AM
can you post the contents of the following:


/etc/amanda/<backup-set in question>/amdump.1
/etc/amanda/<backup-set in question>/amdump.2
/etc/amanda/<backup-set in question>/amdump.3

Also, corresponding to the time date of the 3 most current amdump files, can you post the 3 most recent log files. from the same path.

/etc/amanda/<backup-set in question>/log.200702......

Thank You
-Shailen

martineau
February 16th, 2007, 11:32 AM
The amdump.1 ar amflush.1 file is needed to debug this problem.
Could you also post the taper.<timestamp>.debug file?

Andrew Rakowski
February 16th, 2007, 01:18 PM
The amdump.1 ar amflush.1 file is needed to debug this problem.
Could you also post the taper.<timestamp>.debug file?

The log files that you and Shailen have requested amount to just over 35MB of files, which gzip's down to about 2.8MB at maximum compression. If you have a place I can upload it, I can do that. Otherwise, if there are particular things to look for, I can look around and/or snip out sections of particular interest. I didn't realize that the amflush process left log files behind, and I'll see if there's anything obvious in there.

I didn't find any files starting with the string "taper" anywhere in the /etc/amanda directory tree.

You can contact me off the forum at:

ambackup.rakowski 'at' spamgourmet.com

to arrange for some place where I can put these files for you to look at.

Thanks for helping out...

-Andrew

ps: Is it the weekend yet? Somehow, it feels like a Monday! :)

shailen
February 16th, 2007, 01:50 PM
i will email you the ftp information :)

-shailen

Andrew Rakowski
February 16th, 2007, 06:53 PM
i will email you the ftp information :)

-shailen

I worked with Shailen and Jean-Louis offline and discovered the problem. After deleting the exactly 32KB sized file in the holding disk directory, I was able to run a flush successfully (filled seven tapes). I suspect this file was created just as the system crashed/amanda processes died, and amflush didn't know what to do with it.

Thanks for all the help folks! 8^)

-Andrew