PDA

View Full Version : dumps of Amanda server's local root partition failing



bethany
January 10th, 2007, 06:50 AM
Hi all,
I am trying to troubleshoot repeated dump failurs that are only occuring when my Amanda server tries to back up it's local root partition. Similar failures also occasionally occur on /var/lib/amanda, but usually the disk is re-tried and completes later on in the run.

middenheap.xxx.edu / lev 1 FAILED [cannot read header: got 0 instead of 32768]
middenheap.xxx.edu / lev 1 FAILED [cannot read header: got 0 instead of 32768]
middenheap.xxx.edu / lev 1 FAILED [too many dumper retry: "[request failed: timeout waiting for ACK]

Yesterday I made the following adjustments to amanda.conf and disklist in an effort to troubleshoot:
* increased "inparallel" value from 12 to 14
* gave middenheap's DLEs the following spindle numbers so that the largest disks won't try to back up simultaneously:

middenheap.xxx.edu / {
server-encrypt-root-fast
exclude list ".am_exclude"
} 2 local
middenheap.xxx.edu /etc/amanda amanda-config-backup 2 local
middenheap.xxx.edu /var/lib/amanda amanda-config-backup 3 local
middenheap.xxx.edu /etc include 2 local
middenheap.xxx.edu /var server-encrypt-fast 3 local

The associated backup types:

define dumptype server-encrypt-fast {
global
program "GNUTAR"
comment "dump with fast client compression and server openssl asymmetric encryption"
compress client fast
encrypt server
index
server_encrypt "/usr/sbin/amcrypt-ossl-asym"
server_decrypt_option "-d"
priority medium
}

# high priority for user data
define dumptype server-encrypt-user-fast {
server-encrypt-fast
priority high
}

# low priority for root partitions
define dumptype server-encrypt-root-fast {
server-encrypt-fast
priority low
}
# amanda config/bootstrap file backups
define dumptype amanda-config-backup { #root-tar
server-encrypt-fast comment "force Level 0 backups for Amanda config files"
strategy noinc
}
define dumptype include {
amanda-config-backup comment "force Level 0 backups and only backup files specified in the include list"
priority medium
include list "/var/lib/amanda/srv_configs"
}

.am_exclude for / lists /amandatapes, which is a symbolic link to a separate partition, probably not necessary but placed there anyway for good measure.

Excerpts from /tmp/amanda/client/ISDaily2.5/sendbackup*.debug:
sendbackup: time 318.720: started backup
sendbackup: time 2665.645: 118: strange(?): sed: couldn't write 72 items to stdout: Broken pipe
sendbackup: time 2677.876: 118: strange(?):
sendbackup: time 2677.876: 118: strange(?): gzip: stdout: Broken pipe
sendbackup: time 2678.156: index tee cannot write [Broken pipe]
sendbackup: time 2678.156: pid 30607 finish time Wed Jan 10 04:53:34 2007

sendbackup: time 0.450: started backup
sendbackup: time 927.562: 118: strange(?): sed: couldn't write 47 items to stdout: Broken pipe
sendbackup: time 928.079: 118: strange(?):
sendbackup: time 928.079: 118: strange(?): gzip: stdout: Broken pipe
sendbackup: time 928.088: index tee cannot write [Broken pipe]

Why is the amanda client on middenheap losing connectivity with the amanda server process, also on middenheap, after only 900 seconds? There is a line in my iptables rules explicitly allowing all traffic on lo, so that would seem to rule out firewall issues.
-A RH-Firewall-1-INPUT -i lo -j ACCEPT

I've already made the suggested changes to /etc/modprobe.conf:
options ip_conntrack_amanda master_timeout=3600

And here's a error I've never noticed before in ~amandabackup/ISDaily2.5/amdump - "timeout waiting for REP":

driver: result time 1004.310 from chunker13: FAILED 14-00016 "[cannot read header: got 0 instead of 32768]"
driver: state time 1004.310 free kps: 82966 space: 322408352 taper: idle idle-dumpers: 0 qlen tapeq: 0 runq: 31 roomq: 0 wakeup: 0 driver-idle: no-dumpers
driver: interface-state time 1004.310 if default: free 72566 if LOCAL: free 10000 if LE0: free 400
driver: hdisk-state time 1004.310 hdisk 0: free 322408352 dumpers 14
driver: result time 1004.310 from dumper13: TRY-AGAIN 14-00016 "[request failed: timeout waiting for REP]"

Most of the other related errors in amdump.1 look like this with "timeout waiting for ACK":

driver: hdisk-state time 3108.065 hdisk 0: free 322012874 dumpers 14
driver: result time 3108.065 from dumper13: TRY-AGAIN 13-00017 "[request failed: timeout waiting for ACK]"
driver: dump failed 13-00017 middenheap.xxx.edu /, too many dumper retry: "[request failed: timeout waiting for ACK]"
rename_tmp_holding: /amandahold/20070110031502/middenheap.xxx.edu._.1: empty file?
driver: adjust_diskspace: time 3108.460: middenheap.xxx.edu:/ /amandahold/20070110031502/middenheap.xxx.edu._.1
driver: adjust_diskspace: time 3108.460: hdisk HD1 done, reserved 1344 used 0 diff -1344 alloc 71148758 dumpers 13driver: adjust_diskspace: time 3108.460: after: disk middenheap.xxx.edu:/ used 0


I'm really stumped! All backups on all other clients (49 DLEs total) are working fine. Any ideas?

ppragin
January 10th, 2007, 02:05 PM
I think this could be caused by the firewall not allowing the client to connect back to the server. Since you are using a FQDN middenheap.xxx.edu and not "localhost" the traffic may not be going through "lo". I would suggest trying to use "localhost" instead.

Thanks

bethany
January 12th, 2007, 05:55 AM
I think this could be caused by the firewall not allowing the client to connect back to the server. Since you are using a FQDN middenheap.xxx.edu and not "localhost" the traffic may not be going through "lo". I would suggest trying to use "localhost" instead.

Hi Ppragin - I tried this for last night's dumps, but both / and /var failed with the same errors. From amstatus:

localhost:/ 0 driver: (aborted:"[request failed: timeout waiting for ACK]")(too many dumper retry)
localhost:/var 0 driver: (aborted:"[request failed: timeout waiting for ACK]")(too many dumper retry)

in sendbackup.20070112032821.debug:

sendbackup: time 65.848: started backup
sendbackup: time 2442.495: 118: strange(?): sed: couldn't write 72 items to stdout: Broken pipe
sendbackup: time 2460.388: 118: strange(?):
sendbackup: time 2460.388: 118: strange(?): gzip: stdout: Broken pipe
sendbackup: time 2460.684: index tee cannot write [Broken pipe]
sendbackup: time 2460.684: pid 7933 finish time Fri Jan 12 04:10:14 2007
sendbackup: time 2460.685: 118: strange(?): sendbackup: index tee cannot write [Broken pipe]
sendbackup: time 2460.731: 47: size(|): Total bytes written: 798720 (780KiB, ?/s)
sendbackup: time 2460.731: 118: strange(?): gtar: -: Wrote only 8192 of 10240 bytes
sendbackup: time 2460.731: 118: strange(?): gtar: Error is not recoverable: exiting now
sendbackup: time 2460.732: error [compress returned 1, /bin/tar returned 2]

~/amandabackup/*/amdump.1:

driver: state time 880.426 free kps: 91408 space: 324362304 taper: idle idle-dumpers: 0 qlen tapeq: 0 runq: 32 roomq: 0 wakeup: 0 driver-idle: no-dumpers
driver: interface-state time 880.426 if default: free 81008 if LOCAL: free 10000 if LE0: free 400
driver: hdisk-state time 880.426 hdisk 0: free 324362304 dumpers 14
driver: result time 880.426 from chunker10: FAILED 12-00018 "[cannot read header: got 0 instead of 32768]"
driver: state time 880.426 free kps: 91408 space: 324362304 taper: idle idle-dumpers: 0 qlen tapeq: 0 runq: 32 roomq: 0 wakeup: 0 driver-idle: no-dumpers
driver: interface-state time 880.426 if default: free 81008 if LOCAL: free 10000 if LE0: free 400
driver: hdisk-state time 880.426 hdisk 0: free 324362304 dumpers 14
driver: result time 880.426 from dumper10: TRY-AGAIN 12-00018 "[request failed: timeout waiting for ACK]"
rename_tmp_holding: /amandahold/20070112031502/localhost._.0: empty file?

I'm going to change it back to the FQDN since AMcheck would prefer it that way:

Amanda Backup Client Hosts Check
--------------------------------
WARNING: Usage of fully qualified hostname recommended for Client localhost.

Any other ideas or troubleshooting techniques I can use other than looking through log files?

ktill
January 12th, 2007, 09:51 AM
All the DLE are using amcrypt-ossl-asym, so it eliminates the possbility of OpenSSL setup problem.
Only suggestion I have is to add "compress none" to dumptype server-encrypt-root-fast.
Also look at the log of iptables see if any amanda packets have been rejected.

Hope this helps!

--Kevin Till

bethany
January 17th, 2007, 04:52 AM
Yesterday I changed my server-encrypt-root-fast dumptype to use "compress none", and so far so good - last night's dumps completed without a hitch. Thanks so much!