Page 1 of 2 12 LastLast
Results 1 to 10 of 18

Thread: localhost backups intermittently failing, localhost NAK: timeout on reply pipe

  1. #1
    Join Date
    Mar 2017
    Posts
    13

    Default localhost backups intermittently failing, localhost NAK: timeout on reply pipe

    Hello,

    We've been running Amanda 3.3.4 and successfully backing directories on localhost for a few years. A few months ago I started getting intermittent failures on four of the server's local mountpoints.

    Amstatus is displaying the following error:
    Code:
    localhost:/srv/amanda/state  0 failed: planner: [localhost NAK: timeout on reply pipe]
    localhost:/srv/reports       0 failed: planner: [localhost NAK: timeout on reply pipe]
    localhost:/tank/backups      0 failed: planner: [localhost NAK: timeout on reply pipe]
    localhost:md1                0 failed: planner: [localhost NAK: timeout on reply pipe]
    Our amanda.conf (hostnames changed to protect the innocent):
    Code:
    org "hatteras"
    includefile "/etc/amanda/default/default.conf"
    logdir "/srv/amanda/state/log/hatteras"
    etimeout 86400
    dtimeout 3600
    amrecover_changer "dell-robot" 
    tpchanger "dell-robot"
    autolabel "$b"
    tapecycle 30
    dumpcycle 14 days
    runtapes 9
    tapetype LTO4
    autoflush yes
    
    flush-threshold-dumped 100
    flush-threshold-scheduled 100
    
    define dumptype gnutar-encrypted {
            "encrypted-gnutar-local"
            auth "ssh"
            ssh_keys "/home/amandabackup/.ssh/id_rsa_amdump"
            amandad_path "/usr/local/libexec/amanda/amandad"
    	estimate calcsize	
            compress server best
            program "GNUTAR"
    }
    
    define dumptype localhost-srv-backups {
            "encrypted-gnutar-local"
            estimate calcsize
    	include "./foo.bar.com"
    }
    
    holdingdisk hd1 {
            directory "/srv/amanda/holding/hatteras"
            chunksize 80 Gb
    }
    Relevant disklist entries:
    Code:
    # Amanda's own state
    localhost md1  encrypted-gnutar-local
    #localhost md2  encrypted-gnutar-local
    localhost /tank/backups encrypted-gnutar-local
    #localhost /srv/backups localhost-srv-backups
    localhost /srv/reports encrypted-gnutar-local
    localhost /srv/amanda/state encrypted-gnutar-local

    As you can see above, localhost backups are currently using the encrypted-gnutar-local backup method. I have tried using the method that worked on an older build of this server, declared in our amanda.conf, localhost-srv-backups, which uses the same backup method but employs the estimate type of calcsize, but amdump errors out on the localhost mountpoints complaining it cannot chdir to md1.

    Below is info on the underlying filesystems:

    Code:
    $ ls -la /srv/
    total 12
    drwxr-xr-x  3 root root 4096 Nov  6  2015 .
    drwxr-xr-x 26 root root 4096 Feb 22 06:54 ..
    lrwxrwxrwx  1 root root   13 Nov  6  2015 amanda -> /tank/amanda/
    lrwxrwxrwx  1 root root   14 Nov  6  2015 backups -> /tank/backups/
    drwxr-xr-x  3 root root 4096 Nov  6  2015 mnt
    lrwxrwxrwx  1 root root   14 Nov  6  2015 reports -> /tank/reports/
    lrwxrwxrwx  1 root root   13 Nov  6  2015 server -> /tank/server/
    
    $ ls -la /tank/
    total 30
    drwxr-xr-x  6 root         root            6 Nov  6  2015 .
    drwxr-xr-x 26 root         root         4096 Feb 22 06:54 ..
    drwxr-xr-x  4 amandabackup amandabackup    4 Dec 16  2015 amanda
    drwxr-xr-x 10 merkin       merkin         10 Feb 23  2016 backups
    drwxr-xr-x  6 amandabackup amandabackup    6 Nov  6  2015 reports
    drwxr-xr-x  9 amandabackup amandabackup   15 Mar  6 10:53 server
    
    $ mount
    /dev/md1 on / type ext4 (rw,errors=remount-ro)
    proc on /proc type proc (rw,noexec,nosuid,nodev)
    sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
    none on /sys/fs/fuse/connections type fusectl (rw)
    none on /sys/kernel/debug type debugfs (rw)
    none on /sys/kernel/security type securityfs (rw)
    udev on /dev type devtmpfs (rw,mode=0755)
    devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
    tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
    none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
    none on /run/shm type tmpfs (rw,nosuid,nodev)
    /dev/md2 on /var type ext4 (rw)
    10.6.1.49:/mnt/vol1/archive-epp on /mnt/archive-epp type nfs (rw,nolock,vers=4,addr=10.6.1.49,clientaddr=10.6.1.45)
    10.6.1.49:/mnt/vol1/epp on /mnt/epp type nfs (rw,nolock,vers=4,addr=10.6.1.49,clientaddr=10.6.1.45)
    tank on /tank type zfs (rw)
    tank/amanda on /tank/amanda type zfs (rw)
    tank/backups on /tank/backups type zfs (rw)
    tank/reports on /tank/reports type zfs (rw)
    tank/server on /tank/server type zfs (rw)
    rpc_pipefs on /run/rpc_pipefs type rpc_pipefs (rw)
    
    $ cat /proc/mdstat 
    Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
    md2 : active raid1 sdb2[1] sda2[0]
          97653688 blocks super 1.2 [2/2] [UU]
          
    md1 : active raid1 sda3[0] sdb3[1]
          359479160 blocks super 1.2 [2/2] [UU]
          
    md0 : active raid1 sda1[0] sdb1[1]
          31248312 blocks super 1.2 [2/2] [UU]
          
    unused devices: <none>

    Requesting assistance troubleshooting this issue, please advise.

  2. #2
    Join Date
    Nov 2005
    Location
    Canada
    Posts
    1,031

    Default

    GNUTAR require a mount point, md1 is not a mount point, use the mount point '/'.

  3. #3
    Join Date
    Mar 2017
    Posts
    13

    Default No dice

    Hi,

    I have adjusted the DLE for the md1 mountpoint on localhost as instructed above:
    Code:
    localhost / encrypted-gnutar-local
    The error remains:
    Code:
    localhost:/                  0 failed: planner: [localhost NAK: timeout on reply pipe]
    localhost:/srv/amanda/state  0 failed: planner: [localhost NAK: timeout on reply pipe]
    localhost:/srv/reports       0 failed: planner: [localhost NAK: timeout on reply pipe]
    localhost:/tank/backups      0 failed: planner: [localhost NAK: timeout on reply pipe]
    please advise.

    J

  4. #4
    Join Date
    Nov 2005
    Location
    Canada
    Posts
    1,031

    Default

    Which auth are you using? is it correctly configured?
    Do 'amcheck -c' succeed, you should not try amdump until amcheck succeed.
    Post all relevant debug files.

  5. #5
    Join Date
    Mar 2017
    Posts
    13

    Default amcheck results, follow-up question

    Results of amcheck:
    Code:
    amcheck hatteras
    Amanda Tape Server Host Check
    -----------------------------
    ERROR: lbl-templ set but no LPR command defined. You should reconfigure amanda
           and make sure it finds a lpr or lp command.
    Holding disk /srv/amanda/holding/hatteras: 6398017536 kB disk space available, using 6398017536 kB
    WARNING: skipping tape test because amdump or amflush seem to be running
    WARNING: if they are not, you must run amcleanup
    NOTE: info dir /srv/amanda/state/curinfo/localhost/_ does not exist
    NOTE: it will be created on the next run.
    NOTE: index dir /srv/amanda/state/index/localhost/_ does not exist
    NOTE: it will be created on the next run.
    Server check took 0.002 seconds
    
    Amanda Backup Client Hosts Check
    --------------------------------
    Client check: 2 hosts checked in 3.324 seconds.  0 problems found.
    
    (brought to you by Amanda 3.3.4)
    Which debug files would you like me to post please?

    J

  6. #6
    Join Date
    Mar 2017
    Posts
    13

    Default no input so here is a wild shot in the dark

    Logs:

    amdump_log: https://paste.ee/d/hSH59

    relevant info:
    Code:
    planner: time 0.080: setting up estimates for localhost:/
    localhost:/ overdue 17233 days for level 0
    setup_estimate: localhost:/: command 0, options: none    last_level -1 next_level0 -17233 level_days 0    getting estimates 0 (-3) -1 (-3) -1 (-3)
    planner: time 0.080: setting up estimates for localhost:/tank/backups
    localhost:/tank/backups overdue 52 days for level 0
    setup_estimate: localhost:/tank/backups: command 0, options: none    last_level 2 next_level0 -52 level_days 2    getting estimates 0 (-3) 2 (-3) 3 (-3)
    planner: time 0.097: setting up estimates for localhost:/srv/reports
    localhost:/srv/reports overdue 47 days for level 0
    setup_estimate: localhost:/srv/reports: command 0, options: none    last_level 1 next_level0 -47 level_days 1    getting estimates 0 (-3) 1 (-3) -1 (-3)
    planner: time 0.122: setting up estimates for localhost:/srv/amanda/state
    localhost:/srv/amanda/state overdue 47 days for level 0
    setup_estimate: localhost:/srv/amanda/state: command 0, options: none    last_level 1 next_level0 -47 level_days 1    getting estimates 0 (-3) 1 (-3) -1 (-3)
    
    ...
    
    GETTING ESTIMATES...
    planner: time 1.306: got partial result for host localhost disk /srv/amanda/state: 0 -> -3K, 1 -> -3K, -1 -> -3K
    planner: time 1.306: got partial result for host localhost disk /srv/reports: 0 -> -3K, 1 -> -3K, -1 -> -3K
    planner: time 1.306: got partial result for host localhost disk /tank/backups: 0 -> -3K, 2 -> -3K, 3 -> -3K
    planner: time 1.307: got partial result for host localhost disk /: 0 -> -3K, -1 -> -3K, -1 -> -3K
    planner: time 10.822: got partial result for host localhost disk /srv/amanda/state: 0 -> -3K, 1 -> -3K, -1 -> -3K
    planner: time 10.822: got partial result for host localhost disk /srv/reports: 0 -> -3K, 1 -> -3K, -1 -> -3K
    planner: time 10.822: got partial result for host localhost disk /tank/backups: 0 -> -3K, 2 -> -3K, 3 -> -3K
    planner: time 10.822: got partial result for host localhost disk /: 0 -> 13425520K, -1 -> -3K, -1 -> -3K
    planner: time 16.726: got partial result for host localhost disk /srv/amanda/state: 0 -> -3K, 1 -> -3K, -1 -> -3K
    planner: time 16.726: got partial result for host localhost disk /srv/reports: 0 -> 28720K, 1 -> -3K, -1 -> -3K
    planner: time 16.726: got partial result for host localhost disk /tank/backups: 0 -> -3K, 2 -> -3K, 3 -> -3K
    planner: time 16.726: got partial result for host localhost disk /: 0 -> 13425520K, -1 -> -3K, -1 -> -3K
    planner: time 16.781: got partial result for host localhost disk /srv/amanda/state: 0 -> -3K, 1 -> -3K, -1 -> -3K
    planner: time 16.781: got partial result for host localhost disk /srv/reports: 0 -> 28720K, 1 -> 1610K, -1 -> -3K
    planner: time 16.781: got partial result for host localhost disk /tank/backups: 0 -> -3K, 2 -> -3K, 3 -> -3K
    planner: time 16.781: got partial result for host localhost disk /: 0 -> 13425520K, -1 -> -3K, -1 -> -3K
    planner: time 42.888: got partial result for host localhost disk /srv/amanda/state: 0 -> 597500K, 1 -> -3K, -1 -> -3K
    planner: time 42.888: got partial result for host localhost disk /srv/reports: 0 -> 28720K, 1 -> 1610K, -1 -> -3K
    planner: time 42.888: got partial result for host localhost disk /tank/backups: 0 -> -3K, 2 -> -3K, 3 -> -3K
    planner: time 42.888: got partial result for host localhost disk /: 0 -> 13425520K, -1 -> -3K, -1 -> -3K
    planner: time 43.636: got partial result for host localhost disk /srv/amanda/state: 0 -> 597500K, 1 -> 315320K, -1 -> -3K
    planner: time 43.636: got partial result for host localhost disk /srv/reports: 0 -> 28720K, 1 -> 1610K, -1 -> -3K
    planner: time 43.636: got partial result for host localhost disk /tank/backups: 0 -> -3K, 2 -> -3K, 3 -> -3K
    planner: time 43.636: got partial result for host localhost disk /: 0 -> 13425520K, -1 -> -3K, -1 -> -3K
    planner: time 18383.327: got partial result for host localhost disk /srv/amanda/state: 0 -> 597500K, 1 -> 315320K, -1 -> -3K
    planner: time 18383.327: got partial result for host localhost disk /srv/reports: 0 -> 28720K, 1 -> 1610K, -1 -> -3K
    planner: time 18383.327: got partial result for host localhost disk /tank/backups: 0 -> 1899165800K, 2 -> -3K, 3 -> -3K
    planner: time 18383.327: got partial result for host localhost disk /: 0 -> 13425520K, -1 -> -3K, -1 -> -3K
    error result for host localhost disk /srv/amanda/state: localhost NAK: timeout on reply pipe
    error result for host localhost disk /srv/reports: localhost NAK: timeout on reply pipe
    error result for host localhost disk /tank/backups: localhost NAK: timeout on reply pipe
    error result for host localhost disk /: localhost NAK: timeout on reply pipe
    planner: time 21601.312: getting estimates took 21601.025 secs
    FAILED QUEUE:
      0: localhost  /srv/amanda/state
      1: localhost  /srv/reports
      2: localhost  /tank/backups
      3: localhost  /   
    
    ...
    
    planner: FAILED localhost /srv/amanda/state 20170308000001 0 "[localhost NAK: timeout on reply pipe]"
    planner: FAILED localhost /srv/reports 20170308000001 0 "[localhost NAK: timeout on reply pipe]"
    planner: FAILED localhost /tank/backups 20170308000001 0 "[localhost NAK: timeout on reply pipe]"
    planner: FAILED localhost / 20170308000001 0 "[localhost NAK: timeout on reply pipe]"
    amanda_log: https://paste.ee/d/XkhGx

    Relevant info:
    Code:
    DISK planner localhost /
    DISK planner localhost /tank/backups
    DISK planner localhost /srv/reports
    DISK planner localhost /srv/amanda/state
    
    ...
    
    INFO planner Adding new disk localhost:/.
    WARNING planner Last full dump of localhost:/tank/backups on tape HAT006L4 overwritten in 4 runs.
    WARNING planner Last level 1 dump of localhost:/tank/backups on tape HAT001L4 overwritten in 4 runs.
    WARNING planner Last full dump of localhost:/srv/reports on tape HAT074L4 overwritten in 5 runs.
    INFO taper taper pid 26600
    WARNING planner Last full dump of localhost:/srv/amanda/state on tape HAT074L4 overwritten in 5 runs.
    INFO taper Slot 11 with label HAT030L4 is not reusable
    INFO taper Slot 12 with label HAT009L4 is not reusable
    INFO taper Slot 13 with label HAT065L4 is usable
    ERROR planner localhost NAK: timeout on reply pipe
    
    ...
    
    FAIL planner localhost /srv/amanda/state 20170308000001 0 "[localhost NAK: timeout on reply pipe]"
    FAIL planner localhost /srv/reports 20170308000001 0 "[localhost NAK: timeout on reply pipe]"
    FAIL planner localhost /tank/backups 20170308000001 0 "[localhost NAK: timeout on reply pipe]"
    FAIL planner localhost / 20170308000001 0 "[localhost NAK: timeout on reply pipe]"

  7. #7
    Join Date
    Nov 2005
    Location
    Canada
    Posts
    1,031

    Default

    SInce you get a timeout, you must check in the amandad and sendsize debug file to find why they didn't reply

  8. #8
    Join Date
    Mar 2017
    Posts
    13

    Default amandad and sendsize logs

    amandad_debug: https://paste.ee/p/jhsCI

    sendsize_log: https://paste.ee/p/BW2JE

    client_sendsize_log (referenced in sendsize log above): https://paste.ee/p/pNxBJ

    I'm seeing the error...

    Code:
    Wed Mar  8 05:06:24 2017: thd-0x615a00: sendsize: errmsg is /usr/libexec/amanda/runtar exited with status 1: see /var/log/amanda/client/hatteras/sendsize.20170308000002.debug
    but the final log references itself.

    Please advise.

  9. #9
    Join Date
    Mar 2017
    Posts
    13

    Default

    bump for inactivity.

  10. #10
    Join Date
    Mar 2017
    Posts
    13

    Default How can I debug the failed runtar command?

    How can I debug the failed runtar command?

    I tried to execute the failing command as amandabackup and received the following error:

    Code:
    /usr/libexec/amanda/runtar runtar hatteras /bin/tar --create --file /dev/null --numeric-owner --directory / --one-file-system --listed-incremental /var/lib/amanda/gnutar-lists/localhost__0.new --sparse --ignore-failed-read --totals .
    runtar: Can only be used to create tar archives
    Please advise.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •