PDA

View Full Version : Strange failure



NdK
July 4th, 2012, 12:50 AM
Hello all.

I'm using Amanda 3.3.1 (installed from backports) on a Debian Squeeze.
The cron job fails every night, but if I run manually the failing job(s), it works w/o problems.
I'm getting mad at it. Actually, upgrading to 3.3.1 was a "last hope" try.

I use ssh auth, but the jobs that are failing are "local" ones (an NFS-mounted share and 3 Samba shares).
I think the authorized_keys and known_hosts files are OK (or the manual run would fail too...).

My disklist definitions for the failing jobs are:

127.0.0.1 /mnt/home_serverlab/PERSONALE {
daily
exclude list "/etc/amanda/DailySet/exclude_serverlab.list"
} 2 eth_giga
127.0.0.1 /mnt/home_serverlab/STUDENTI {
daily
exclude list "/etc/amanda/DailySet/exclude_serverlab.list"
} 2 eth_giga
127.0.0.1 //urania/www$/ {
daily
# Samba doesn't support exclude lists, just exclude file!
exclude "./Nipoti/thesis_cn.pdf"
} 4 eth_giga
127.0.0.1 //urania/Amministrazione/ daily 4 eth_giga
127.0.0.1 //urania/Tecnici$/ daily 4 eth_giga


The log in /amanda/DailySet/logs/log.20120703210001.0 says:

ERROR planner Some estimate timeout on 127.0.0.1, using server estimate if possible
WARNING planner disk 127.0.0.1://urania/Tecnici$/, estimate of level 0 timed out.
WARNING planner disk 127.0.0.1://urania/Tecnici$/, estimate of level 1 timed out.
WARNING planner disk 127.0.0.1://urania/Amministrazione/, estimate of level 0 timed out.
WARNING planner disk 127.0.0.1://urania/Amministrazione/, estimate of level 1 timed out.
WARNING planner disk 127.0.0.1://urania/www$/, estimate of level 0 timed out.
WARNING planner disk 127.0.0.1://urania/www$/, estimate of level 1 timed out.
WARNING planner disk 127.0.0.1:/mnt/home_r1/STUDENTI/, estimate of level 1 timed out.
FAIL planner 127.0.0.1 //urania/Tecnici$/ 20120703210001 0 "[disk //urania/Tecnici$/, all estimate timed out]"
FAIL planner 127.0.0.1 //urania/Amministrazione/ 20120703210001 0 "[disk //urania/Amministrazione/, all estimate timed out]"
FAIL planner 127.0.0.1 //urania/www$/ 20120703210001 0 "[disk //urania/www$/, all estimate timed out]"
INFO planner Full dump of 127.0.0.1:/mnt/home_serverlab/STUDENTI promoted from 1 day ahead.


And in /var/log/amanda/amandad/amandad.20120703210001.debug I see:

Tue Jul 3 22:03:52 2012: thd-0xef2be0: amandad: sending PREP pkt:
<<<<<
OPTIONS features=ffffffff9efefbffffffffff1f;
/mnt/home_serverlab/PERSONALE 0 SIZE 42973110
/mnt/home_serverlab/PERSONALE 1 SIZE 79050
/mnt/home_serverlab/STUDENTI 0 SIZE 13826100
/mnt/home_serverlab/STUDENTI 1 SIZE 117660
/mnt/home_serverlab/STUDENTI 2 SIZE 114950
/mnt/home_r1/PERSONALE/ 0 SIZE 7783410
/mnt/home_r1/PERSONALE/ 1 SIZE 5730
/mnt/home_r1/STUDENTI/ 0 SIZE 68937560
>>>>>
Tue Jul 3 22:03:52 2012: thd-0xef2be0: amandad: tcpm_send_token: data is still flowing
Tue Jul 3 22:15:03 2012: thd-0xef2be0: amandad: stream_read_callback: data is still flowing
Tue Jul 3 22:15:03 2012: thd-0xef2be0: amandad: security_stream_seterr(0xefe150, EOF)
Tue Jul 3 22:15:03 2012: thd-0xef2be0: amandad: security_seterror(handle=0xefca40, driver=0x7f2abf2e8220 (SSH) error=EOF on read from localhost)
Tue Jul 3 22:15:03 2012: thd-0xef2be0: amandad: receive error: EOF on read from localhost
Tue Jul 3 22:16:35 2012: thd-0xef2be0: amandad: sending PREP pkt:
<<<<<
OPTIONS features=ffffffff9efefbffffffffff1f;
/mnt/home_serverlab/PERSONALE 0 SIZE 42973110
/mnt/home_serverlab/PERSONALE 1 SIZE 79050
/mnt/home_serverlab/STUDENTI 0 SIZE 13826100
/mnt/home_serverlab/STUDENTI 1 SIZE 117660
/mnt/home_serverlab/STUDENTI 2 SIZE 114950
/mnt/home_r1/PERSONALE/ 0 SIZE 7783410
/mnt/home_r1/PERSONALE/ 1 SIZE 5730
/mnt/home_r1/STUDENTI/ 0 SIZE 68937560
/mnt/home_r1/STUDENTI/ 1 SIZE 210390
>>>>>
Tue Jul 3 22:16:35 2012: thd-0xef2be0: amandad: tcpm_send_token: data is still flowing
Tue Jul 3 22:16:35 2012: thd-0xef2be0: amandad: security_stream_seterr(0xefe150, write error to: Bad file descriptor)
Tue Jul 3 22:16:35 2012: thd-0xef2be0: amandad: security_seterror(handle=0xefca40, driver=0x7f2abf2e8220 (SSH) error=write error to: Bad file descriptor)
Tue Jul 3 22:16:36 2012: thd-0xef2be0: amandad: sending PREP pkt:

I'm sure I'm missing something, but I can swear I tried all I could think of...
Any hints?

Tks!

martineau
July 4th, 2012, 04:21 AM
Increase etimeout in amanda.conf

NdK
July 4th, 2012, 05:19 AM
Trying etimeout 1800 ...
Hope it works. I'll know tomorrow. Fingers crossed.

Tks.

martineau
July 4th, 2012, 05:26 AM
Look in the sendsize.*.debug files, what is the time difference between the first and last lines.

NdK
July 4th, 2012, 10:58 PM
First of all: today it worked. No errors. Finally! :) Just a need to run amflush I never saw before.
Thanks!

For sendsize.*.debug file I just noticed it contains:

Tue Jul 3 21:35:30 2012: thd-0xaf3340: sendsize: calculating for amname /mnt/home_r1/STUDENTI/, dirname /mnt/home_r1/STUDENTI/, spindle 3 GNUTAR
Tue Jul 3 21:35:30 2012: thd-0xaf3340: sendsize: getting size via gnutar for /mnt/home_r1/STUDENTI/ level 0
Tue Jul 3 21:35:30 2012: thd-0xaf3340: sendsize: Spawning "/usr/lib/amanda/runtar runtar DailySet /bin/tar --create --file /dev/null --numeric-owner --directory /mnt/home_r1/STUDENTI/ --one-file-system --listed
-incremental /var/lib/amanda/gnutar-lists/127.0.0.1_mnt_home_r1_STUDENTI__0.new --sparse --ignore-failed-read --totals --exclude-from /tmp/amanda/sendsize._mnt_home_r1_STUDENTI_.20120703213530.exc lude ." in pipe
line
Tue Jul 3 21:49:40 2012: thd-0xaf3340: sendsize: /bin/tar: ./user.name/.gnupg/S.gpg-agent: socket ignored
[... cut more 'socket ignored'...]
Tue Jul 3 22:03:52 2012: thd-0xaf3340: sendsize: Total bytes written: 70592061440 (66GiB, 40MiB/s)
Tue Jul 3 22:03:52 2012: thd-0xaf3340: sendsize: .....
Tue Jul 3 22:03:52 2012: thd-0xaf3340: sendsize: estimate time for /mnt/home_r1/STUDENTI/ level 0: 1701.855
Tue Jul 3 22:03:52 2012: thd-0xaf3340: sendsize: estimate size for /mnt/home_r1/STUDENTI/ level 0: 68937560 KB
Tue Jul 3 22:03:52 2012: thd-0xaf3340: sendsize: waiting for runtar "/mnt/home_r1/STUDENTI/" child
Tue Jul 3 22:03:52 2012: thd-0xaf3340: sendsize: after runtar /mnt/home_r1/STUDENTI/ wait
Tue Jul 3 22:03:52 2012: thd-0xaf3340: sendsize: getting size via gnutar for /mnt/home_r1/STUDENTI/ level 1
Tue Jul 3 22:03:52 2012: thd-0xaf3340: sendsize: Spawning "/usr/lib/amanda/runtar runtar DailySet /bin/tar --create --file /dev/null --numeric-owner --directory /mnt/home_r1/STUDENTI/ --one-file-system --listed-incremental /var/lib/amanda/gnutar-lists/127.0.0.1_mnt_home_r1_STUDENTI__1.new --sparse --ignore-failed-read --totals --exclude-from /tmp/amanda/sendsize._mnt_home_r1_STUDENTI_.20120703220352.exc lude ." in pipeline
Tue Jul 3 22:16:35 2012: thd-0xaf3340: sendsize: Total bytes written: 215439360 (206MiB, 276KiB/s)
Tue Jul 3 22:16:35 2012: thd-0xaf3340: sendsize: .....
Tue Jul 3 22:16:35 2012: thd-0xaf3340: sendsize: estimate time for /mnt/home_r1/STUDENTI/ level 1: 763.251
Tue Jul 3 22:16:35 2012: thd-0xaf3340: sendsize: estimate size for /mnt/home_r1/STUDENTI/ level 1: 210390 KB
But no errors/warnings/...
I think that "1701.855" leaves very little margin to my timeout of 1800... Actually, I can see a sendsize log where it took 1830 seconds. Should I increase it more?