PDA

View Full Version : amrecover with S3 tape stops after 2 blocks



droa
March 7th, 2017, 04:05 AM
We are using Amanda 3.4.3 with an "S3" tape ( actually implemented via EMC ATMOS - so using STORAGE_API "S3" ).

Backups are working fine, and we can recover files using amrecover when the tapes are 1 or 2 blocks ( blocksize set to 50M ).

However if we try to recover files from a tape that has more than 2 blocks, amrecover just hangs and never completes (I've waited for more than 1 hour... ).

These are the last entries in the amidxtaped debug log:



...
Tue Mar 07 11:49:38.583421260 2017: pid 29133: thd-0x1dc1690: amidxtaped: parse_inventory: load slot 16 with label 'ProdSet-16'
Tue Mar 07 11:49:38.602256436 2017: pid 29133: thd-0x1dc1690: amidxtaped: /usr/local/share/perl5/Amanda/Recovery/Scan.pm:307:info:1200000 slot 16
Tue Mar 07 11:49:38.603489703 2017: pid 29133: thd-0x1dc1690: amidxtaped: S3 driver using bucket 'prod-backups', prefix 'ProdSet/slot-16/tape'
Tue Mar 07 11:49:38.603514074 2017: pid 29133: thd-0x1dc1690: amidxtaped: curl version: libcurl/7.53.1 OpenSSL/1.0.1e zlib/1.2.3 c-ares/1.12.0 libssh2/1.8.0 nghttp2/1.6.0
Tue Mar 07 11:49:38.603523865 2017: pid 29133: thd-0x1dc1690: amidxtaped: curl compiled for OPENSSL
Tue Mar 07 11:49:38.608284630 2017: pid 29133: thd-0x1dc1690: amidxtaped: Create 6 threads
Tue Mar 07 11:49:39.080822231 2017: pid 29133: thd-0x1dc1690: amidxtaped: Building type TAPESTART header of 0-52428800 bytes with name='ProdSet-16' disk='' dumplevel=0 and blocksize=0
Tue Mar 07 11:49:39.082250014 2017: pid 29133: thd-0x1dc1690: amidxtaped: /usr/local/share/perl5/Amanda/Recovery/Scan.pm:459:info:1200001 ProdSet-16
Tue Mar 07 11:49:39.082534856 2017: pid 29133: thd-0x1dc1690: amidxtaped: ignoring spurious Amanda::Recovery::Scan abort call
Tue Mar 07 11:49:39.502133244 2017: pid 29133: thd-0x1dc1690: amidxtaped: Amanda::Recovery::Clerk: successfully located first part for recovery
Tue Mar 07 11:49:39.508682900 2017: pid 29133: thd-0x1dc1690: amidxtaped: Building type FILE header of 128-32768 bytes with name='server.example.com' disk='/opt' dumplevel=0 and blocksize=32768
Tue Mar 07 11:49:39.508801525 2017: pid 29133: thd-0x1dc1690: amidxtaped: CTL >> HEADER-SEND-SIZE 551
Tue Mar 07 11:49:39.509333878 2017: pid 29133: thd-0x1dc1690: amidxtaped: CTL << HEADER-READY
Tue Mar 07 11:49:41.228676865 2017: pid 29133: thd-0x1dc1690: amidxtaped: CTL << HEADER-DONE
Tue Mar 07 11:49:41.228870784 2017: pid 29133: thd-0x1dc1690: amidxtaped: CTL >> STATE-SEND
Tue Mar 07 11:49:41.229378230 2017: pid 29133: thd-0x1dc1690: amidxtaped: CTL << STATE-READY
Tue Mar 07 11:49:41.230058205 2017: pid 29133: thd-0x1dc1690: amidxtaped: CTL << STATE-DONE
Tue Mar 07 11:49:41.230139287 2017: pid 29133: thd-0x1dc1690: amidxtaped: CTL >> USE-DAR NO
Tue Mar 07 11:49:41.230473653 2017: pid 29133: thd-0x1dc1690: amidxtaped: CTL << USE-DAR NO
Tue Mar 07 11:49:41.230585459 2017: pid 29133: thd-0x1dc1690: amidxtaped: CTL << AVAIL-DATAPATH AMANDA
Tue Mar 07 11:49:41.231037363 2017: pid 29133: thd-0x1dc1690: amidxtaped: Starting <Xfer@0x2ed9550 (<XferSourceRecovery@0x2ecd000> -> <XferDestFd@0x2cde800>)>
Tue Mar 07 11:49:41.231084087 2017: pid 29133: thd-0x1dc1690: amidxtaped: Final linkage: <XferSourceRecovery@0x2ecd000> -(PULL_BUFFER)-> <XferElementGlue@0x2e72010> -(WRITEFD)-> <XferDestFd@0x2cde800>
Tue Mar 07 11:49:41.231107431 2017: pid 29133: thd-0x1dc1690: amidxtaped: setup_impl: 3, 2
Tue Mar 07 11:49:41.231235341 2017: pid 29133: thd-0x1dc1690: amidxtaped: xfer_queue_message: MSG: <XMsg@0x2cdf350 type=XMSG_READY elt=<XferSourceRecovery@0x2ecd000> version=0>
Tue Mar 07 11:49:41.231267481 2017: pid 29133: thd-0x2cdf5c0: amidxtaped: pull_and_write
Tue Mar 07 11:49:41.231320900 2017: pid 29133: thd-0x1dc1690: amidxtaped: CTL >> USE-DATAPATH AMANDA
Tue Mar 07 11:49:41.232664610 2017: pid 29133: thd-0x1dc1690: amidxtaped: CTL << DATAPATH-OK
Tue Mar 07 11:49:41.232778307 2017: pid 29133: thd-0x1dc1690: amidxtaped: CTL >> DATA-SEND
Tue Mar 07 11:49:41.233137055 2017: pid 29133: thd-0x1dc1690: amidxtaped: CTL << DATA-READY
Tue Mar 07 11:49:41.233220578 2017: pid 29133: thd-0x1dc1690: amidxtaped: Amanda::Recovery::Clerk: starting recovery
Tue Mar 07 11:49:41.235146117 2017: pid 29133: thd-0x1dc1690: amidxtaped: Amanda::Recovery::Clerk: reading file 55 on 'ProdSet-16'
Tue Mar 07 11:49:56.241680180 2017: pid 29133: thd-0x1dc1690: amidxtaped: /usr/local/share/perl5/Amanda/Restore.pm:1698:info:4900000 86755 kb
Tue Mar 07 11:50:11.256969517 2017: pid 29133: thd-0x1dc1690: amidxtaped: /usr/local/share/perl5/Amanda/Restore.pm:1698:info:4900000 102400 kb
Tue Mar 07 11:50:26.271854228 2017: pid 29133: thd-0x1dc1690: amidxtaped: /usr/local/share/perl5/Amanda/Restore.pm:1698:info:4900000 102400 kb
Tue Mar 07 11:50:41.287620042 2017: pid 29133: thd-0x1dc1690: amidxtaped: /usr/local/share/perl5/Amanda/Restore.pm:1698:info:4900000 102400 kb
Tue Mar 07 11:50:56.295778675 2017: pid 29133: thd-0x1dc1690: amidxtaped: /usr/local/share/perl5/Amanda/Restore.pm:1698:info:4900000 102400 kb
Tue Mar 07 11:51:11.304232529 2017: pid 29133: thd-0x1dc1690: amidxtaped: /usr/local/share/perl5/Amanda/Restore.pm:1698:info:4900000 102400 kb
....


I've done a few tests to try to troubleshoot the issue, but haven't got anywhere:


amfetchdump works fine retrieving the the full DLE that we are trying to recover. I can then manually extract our backed up files ok.
enabling the VERBOSE S3 device parameter shows the 2 blocks successfully retrieved, but no attempt to retrieve the third block
increasing the NB_RECOVERY_THREAD S3 device parameter increases the number of blocks retrieved to 2 x the number of threads ( eg 6 threads causes 12 blocks to be retrieved )


Does anyone have any ideas what could be wrong?

Happy to supply more information if that helps.

Kind Regards
Dave

droa
March 7th, 2017, 04:14 AM
Just to add - this behaviour was the same in 3.4.1. I upgraded to 3.4.3 to see if it would help, which it didn't.

martineau
March 8th, 2017, 05:30 AM
Can you attach gdb to the running amidxtaped and get the stack trace of all threads?

$ gdb -p 'pid of amidxtaped'
(gdb) thread apply all bt
...
(gdb) quit

droa
March 8th, 2017, 05:44 AM
Thanks for that, here's the output:



(gdb) thread apply all bt

Thread 2 (Thread 0x7f335a4ec700 (LWP 5252)):
#0 0x0000003dd7a0e7cd in write () from /lib64/libpthread.so.0
#1 0x0000003a9d25d226 in safe_write () from /usr/lib64/amanda/libamanda-3.4.3.so
#2 0x0000003a9d25ca8e in full_write () from /usr/lib64/amanda/libamanda-3.4.3.so
#3 0x0000003a9da069d9 in ?? () from /usr/lib64/amanda/libamxfer-3.4.3.so
#4 0x0000003a9da09c36 in ?? () from /usr/lib64/amanda/libamxfer-3.4.3.so
#5 0x0000003dd8e6a374 in ?? () from /lib64/libglib-2.0.so.0
#6 0x0000003dd7a07aa1 in start_thread () from /lib64/libpthread.so.0
#7 0x0000003dd76e8aad in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f335e4de700 (LWP 5058)):
#0 0x0000003dd76df283 in poll () from /lib64/libc.so.6
#1 0x0000003dd8e449f9 in ?? () from /lib64/libglib-2.0.so.0
#2 0x0000003dd8e44e4c in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#3 0x0000003a9d2393b1 in ?? () from /usr/lib64/amanda/libamanda-3.4.3.so
#4 0x00007f335c9fb42d in _wrap_run_c () from /usr/local/share/perl5/auto/Amanda/MainLoop/libMainLoop.so
#5 0x0000003e3daa6815 in Perl_pp_entersub () from /usr/lib64/perl5/CORE/libperl.so
#6 0x0000003e3daa4b06 in Perl_runops_standard () from /usr/lib64/perl5/CORE/libperl.so
#7 0x0000003e3da4d0d8 in perl_run () from /usr/lib64/perl5/CORE/libperl.so
#8 0x0000000000400e74 in main ()
(gdb) quit


Let me know if you need more information

martineau
March 8th, 2017, 05:53 AM
It is hang in a write.
This write is to either a software decryption/decompression program, do your data is compressed or encrypted?
or it is a write to the amandad process.

droa
March 8th, 2017, 06:03 AM
The data is compressed - but not encrypted

martineau
March 8th, 2017, 06:14 AM
client or server compressed?
Which compression are you using?
What the decompression program is doing? Get a stack trace of it.

The data go from amidxtaped -> decompression -> amandad -> amrecover -> application

As amidxtaped is hang in a write, then it is one of the others processes that do not read the data.

droa
March 8th, 2017, 06:26 AM
We are using "client fast" compression.

I can't see the decompression program running ( I guess I should be looking for gzip? )

Here's the process list and all the stack traces I can get:



[root@localhost template.d]# ps -fu amandabackup
UID PID PPID C STIME TTY TIME CMD
490 12839 12836 0 14:18 ? 00:00:00 sshd: amandabackup@notty
490 12840 12839 0 14:18 ? 00:00:00 /usr/libexec/amanda/amandad -auth=ssh amindexd amidxtaped
490 12849 12840 0 14:18 ? 00:00:00 /usr/libexec/amanda/amindexd amandad ssh
490 12850 12840 0 14:18 ? 00:00:00 [amandad] <defunct>
490 13682 12840 10 14:22 ? 00:00:04 /usr/bin/perl /usr/libexec/amanda/amidxtaped amandad ssh
490 13683 12840 0 14:22 ? 00:00:00 [amandad] <defunct>
[root@localhost template.d]# pstack 12840
#0 0x0000003dd76e0b5b in writev () from /lib64/libc.so.6
#1 0x0000003a9d253fe3 in full_writev () from /usr/lib64/amanda/libamanda-3.4.3.so
#2 0x0000003a9d24c7a1 in tcpm_send_token () from /usr/lib64/amanda/libamanda-3.4.3.so
#3 0x0000003a9d24c916 in tcpm_stream_write () from /usr/lib64/amanda/libamanda-3.4.3.so
#4 0x0000000000406553 in ?? ()
#5 0x0000003a9d239206 in ?? () from /usr/lib64/amanda/libamanda-3.4.3.so
#6 0x0000003dd8e40642 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#7 0x0000003dd8e44c98 in ?? () from /lib64/libglib-2.0.so.0
#8 0x0000003dd8e44e4c in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#9 0x0000003a9d2393b1 in ?? () from /usr/lib64/amanda/libamanda-3.4.3.so
#10 0x0000000000402be4 in main ()
[root@localhost template.d]# pstack 12849
#0 0x0000003dd76db670 in __read_nocancel () from /lib64/libc.so.6
#1 0x0000003dd7671ca8 in _IO_new_file_underflow () from /lib64/libc.so.6
#2 0x0000003dd76737ae in _IO_default_uflow_internal () from /lib64/libc.so.6
#3 0x0000003dd7667e8a in _IO_getline_info_internal () from /lib64/libc.so.6
#4 0x0000003dd7666ce9 in fgets () from /lib64/libc.so.6
#5 0x0000003a9d23a761 in debug_pgets () from /usr/lib64/amanda/libamanda-3.4.3.so
#6 0x00000000004066e6 in main ()
[root@localhost template.d]# pstack 13682
Thread 2 (Thread 0x7f2896b4d700 (LWP 13692)):
#0 0x0000003dd7a0e7cd in write () from /lib64/libpthread.so.0
#1 0x0000003a9d25d226 in safe_write () from /usr/lib64/amanda/libamanda-3.4.3.so
#2 0x0000003a9d25ca8e in full_write () from /usr/lib64/amanda/libamanda-3.4.3.so
#3 0x0000003a9da069d9 in ?? () from /usr/lib64/amanda/libamxfer-3.4.3.so
#4 0x0000003a9da09c36 in ?? () from /usr/lib64/amanda/libamxfer-3.4.3.so
#5 0x0000003dd8e6a374 in ?? () from /lib64/libglib-2.0.so.0
#6 0x0000003dd7a07aa1 in start_thread () from /lib64/libpthread.so.0
#7 0x0000003dd76e8aad in clone () from /lib64/libc.so.6


Should I be concerned by the defunct amandad processes?

martineau
March 8th, 2017, 06:31 AM
defunct process are correct, amanda do not wait for all its child processes.
The gzip should be on the machine you are running amrecover
Can you also get the stacktrace of amrecover? and that gzip?
Post the amrecover debug file.

droa
March 8th, 2017, 06:44 AM
I am running amrecover on the backup server , but I get the same issue if running it remotely. Note: We are using ssh authentication for amrecover.

Here's the process listing for amrecover:



[root@localhost amandad]# ps -fu root | grep pts/0
root 4862 4836 0 13:41 pts/0 00:00:00 su -
root 4876 4862 0 13:41 pts/0 00:00:00 -bash
root 12833 4876 0 14:18 pts/0 00:00:00 amrecover
root 12834 12833 0 14:18 pts/0 00:00:00 /usr/bin/ssh -x -o BatchMode=yes -o PreferredAuthentications=publickey -l amandabackup localhost /usr/libexec/amanda/amandad -auth=ssh
root 13693 12833 0 14:22 pts/0 00:00:00 /bin/gzip -dc
root 13695 12833 0 14:22 pts/0 00:00:00 tar --ignore-zeros --numeric-owner -xpGvf - .


Stacktraces for amrecover and gzip:



[root@localhost amandad]# pstack 13693
#0 0x0000003dd76db6d0 in __write_nocancel () from /lib64/libc.so.6
#1 0x000000000040aade in ?? ()
#2 0x000000000040ab88 in ?? ()
#3 0x00000000004062a2 in ?? ()
#4 0x0000000000406b37 in ?? ()
#5 0x00000000004072d2 in ?? ()
#6 0x000000000040a22d in ?? ()
#7 0x0000000000403fb7 in ?? ()
#8 0x0000000000405b9a in ?? ()
#9 0x0000003dd761ed1d in __libc_start_main () from /lib64/libc.so.6
#10 0x0000000000401729 in ?? ()
#11 0x00007ffe70f55dc8 in ?? ()
#12 0x000000000000001c in ?? ()
#13 0x0000000000000002 in ?? ()
#14 0x00007ffe70f5793a in ?? ()
#15 0x00007ffe70f57944 in ?? ()
#16 0x0000000000000000 in ?? ()


[root@localhost amandad]# pstack 12833
Thread 3 (Thread 0x7f6a2a689700 (LWP 12835)):
#0 0x0000003dd7a0e82d in read () from /lib64/libpthread.so.0
#1 0x0000003dd8e41c9b in ?? () from /lib64/libglib-2.0.so.0
#2 0x0000003dd8e6a374 in ?? () from /lib64/libglib-2.0.so.0
#3 0x0000003dd7a07aa1 in start_thread () from /lib64/libpthread.so.0
#4 0x0000003dd76e8aad in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f6a23df5700 (LWP 13694)):
#0 0x0000003dd7a0e7cd in write () from /lib64/libpthread.so.0
#1 0x0000003a9d25d226 in safe_write () from /usr/lib64/amanda/libamanda-3.4.3.so
#2 0x0000003a9d25ca8e in full_write () from /usr/lib64/amanda/libamanda-3.4.3.so
#3 0x0000000000409e76 in ?? ()
#4 0x0000003dd8e6a374 in ?? () from /lib64/libglib-2.0.so.0
#5 0x0000003dd7a07aa1 in start_thread () from /lib64/libpthread.so.0
#6 0x0000003dd76e8aad in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f6a2a8987c0 (LWP 12833)):
#0 0x0000003dd7a0e7cd in write () from /lib64/libpthread.so.0
#1 0x0000003a9d25d226 in safe_write () from /usr/lib64/amanda/libamanda-3.4.3.so
#2 0x0000003a9d25ca8e in full_write () from /usr/lib64/amanda/libamanda-3.4.3.so
#3 0x000000000040e1db in ?? ()
#4 0x0000003a9d249c26 in ?? () from /usr/lib64/amanda/libamanda-3.4.3.so
#5 0x0000003a9d24b965 in ?? () from /usr/lib64/amanda/libamanda-3.4.3.so
#6 0x0000003a9d239206 in ?? () from /usr/lib64/amanda/libamanda-3.4.3.so
#7 0x0000003dd8e40642 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#8 0x0000003dd8e44c98 in ?? () from /lib64/libglib-2.0.so.0
#9 0x0000003dd8e44e4c in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#10 0x0000003a9d2393b1 in ?? () from /usr/lib64/amanda/libamanda-3.4.3.so
#11 0x0000003a9d24c595 in tcpm_stream_read_sync () from /usr/lib64/amanda/libamanda-3.4.3.so
#12 0x00000000004082ca in get_amidxtaped_line ()
#13 0x000000000040e8b5 in writer_intermediary ()
#14 0x000000000040f86e in extract_files ()
#15 0x000000000041421a in yyparse ()
#16 0x0000000000415279 in process_line ()
#17 0x0000000000406eb3 in main ()


and the last lines of the amrecover debug log



Wed Mar 08 14:22:08.027125634 2017: pid 12833: thd-0x1782350: amrecover: append_to_tapelist(tapelist=(nil), storage='ProdSet', label='ProdSet-7', file=-1, partnum=-1, isafile=0)
Wed Mar 08 14:22:08.027159285 2017: pid 12833: thd-0x1782350: amrecover: append_to_tapelist(tapelist=0x17c8790, storage='ProdSet', label='ProdSet-7', file=124, partnum=-1, isafile=0)
Wed Mar 08 14:22:08.027185789 2017: pid 12833: thd-0x1782350: amrecover: append_to_tapelist(tapelist=(nil), storage='ProdSet', label='ProdSet-7', file=-1, partnum=-1, isafile=0)
Wed Mar 08 14:22:08.027198438 2017: pid 12833: thd-0x1782350: amrecover: append_to_tapelist(tapelist=0x17c8790, storage='ProdSet', label='ProdSet-7', file=124, partnum=-1, isafile=0)
Wed Mar 08 14:22:08.027226141 2017: pid 12833: thd-0x1782350: amrecover: Requesting tape ProdSet-7 from user
Wed Mar 08 14:22:09.426134029 2017: pid 12833: thd-0x1782350: amrecover: User prompt: 'Continue [?/Y/n/s/d]? '; response: 'y'
Wed Mar 08 14:22:09.426217116 2017: pid 12833: thd-0x1782350: amrecover: security_getdriver(name=ssh) returns 0x3a9d48d2a0
Wed Mar 08 14:22:09.426239316 2017: pid 12833: thd-0x1782350: amrecover: security_handleinit(handle=0x17cc650, driver=0x3a9d48d2a0 (SSH))
Wed Mar 08 14:22:09.426263522 2017: pid 12833: thd-0x1782350: amrecover: security_streaminit(stream=0x17ceb30, driver=0x3a9d48d2a0 (SSH))
Wed Mar 08 14:22:09.722214283 2017: pid 12833: thd-0x1782350: amrecover: security_streaminit(stream=0x17d6bd0, driver=0x3a9d48d2a0 (SSH))
Wed Mar 08 14:22:09.722241069 2017: pid 12833: thd-0x1782350: amrecover: amidxtaped_streams[0].fd = 0x17d6bd0
Wed Mar 08 14:22:09.722277781 2017: pid 12833: thd-0x1782350: amrecover: security_streaminit(stream=0x17dec70, driver=0x3a9d48d2a0 (SSH))
Wed Mar 08 14:22:09.722291107 2017: pid 12833: thd-0x1782350: amrecover: amidxtaped_streams[1].fd = 0x17dec70
Wed Mar 08 14:22:09.722337534 2017: pid 12833: thd-0x1782350: amrecover: security_streaminit(stream=0x17e6d10, driver=0x3a9d48d2a0 (SSH))
Wed Mar 08 14:22:09.722353586 2017: pid 12833: thd-0x1782350: amrecover: amidxtaped_streams[2].fd = 0x17e6d10
Wed Mar 08 14:22:09.722362280 2017: pid 12833: thd-0x1782350: amrecover: security_close(handle=0x17cc650, driver=0x3a9d48d2a0 (SSH))
Wed Mar 08 14:22:09.722371312 2017: pid 12833: thd-0x1782350: amrecover: security_stream_close(0x17ceb30)
Wed Mar 08 14:22:09.722399058 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: FEATURES=ffffffff9efefbfffffffffffffff3fffbf70f

Wed Mar 08 14:22:09.722512107 2017: pid 12833: thd-0x1782350: amrecover: ignoring close stream 2
Wed Mar 08 14:22:09.722959836 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: CONFIG=ProdSet

Wed Mar 08 14:22:09.722998558 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: LABEL=ProdSet:ProdSet-7:124

Wed Mar 08 14:22:09.723031232 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: FSF=124

Wed Mar 08 14:22:09.723062502 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: HEADER

Wed Mar 08 14:22:09.723082581 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: DEVICE=changer

Wed Mar 08 14:22:09.723094186 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: HOST=^localhost.example.com$

Wed Mar 08 14:22:09.723105152 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: DISK=^/opt$

Wed Mar 08 14:22:09.723115621 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: DATESTAMP=20170307183001

Wed Mar 08 14:22:09.723126099 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: END

Wed Mar 08 14:22:10.979256953 2017: pid 12833: thd-0x1782350: amrecover: get amidxtaped line: HEADER-SEND-SIZE 551
Wed Mar 08 14:22:10.979303434 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: HEADER-READY

Wed Mar 08 14:22:10.979680303 2017: pid 12833: thd-0x1782350: amrecover: read header 551 => 551 (551)
Wed Mar 08 14:22:14.250123055 2017: pid 12833: thd-0x1782350: amrecover: User prompt: 'Continue [?/Y/n]? '; response: 'y'
Wed Mar 08 14:22:14.250194901 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: HEADER-DONE

Wed Mar 08 14:22:14.250923455 2017: pid 12833: thd-0x1782350: amrecover: get amidxtaped line: STATE-SEND
Wed Mar 08 14:22:14.250955829 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: STATE-READY

Wed Mar 08 14:22:14.251611740 2017: pid 12833: thd-0x1782350: amrecover: security_stream_seterr(0x17e6d10, EOF)
Wed Mar 08 14:22:14.251637071 2017: pid 12833: thd-0x1782350: amrecover: security_stream_close(0x17e6d10)
Wed Mar 08 14:22:14.251662769 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: STATE-DONE

Wed Mar 08 14:22:14.252106856 2017: pid 12833: thd-0x1782350: amrecover: get amidxtaped line: USE-DAR NO
Wed Mar 08 14:22:14.252124664 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: USE-DAR NO

Wed Mar 08 14:22:14.252145165 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: AVAIL-DATAPATH AMANDA

Wed Mar 08 14:22:14.253390178 2017: pid 12833: thd-0x1782350: amrecover: get amidxtaped line: USE-DATAPATH AMANDA
Wed Mar 08 14:22:14.253407657 2017: pid 12833: thd-0x1782350: amrecover: Using AMANDA data-path
Wed Mar 08 14:22:14.253433682 2017: pid 12833: thd-0x1782350: amrecover: image is compressed
Wed Mar 08 14:22:14.253448408 2017: pid 12833: thd-0x1782350: amrecover: Spawning "/bin/gzip /bin/gzip -dc" in pipeline
Wed Mar 08 14:22:14.254344549 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: DATAPATH-OK

Wed Mar 08 14:22:14.254414851 2017: pid 13695: thd-0x1782350: amrecover: Exec'ing /bin/tar with arguments:
Wed Mar 08 14:22:14.254495024 2017: pid 13695: thd-0x1782350: amrecover: tar
Wed Mar 08 14:22:14.254514598 2017: pid 13695: thd-0x1782350: amrecover: --ignore-zeros
Wed Mar 08 14:22:14.254531309 2017: pid 13695: thd-0x1782350: amrecover: --numeric-owner
Wed Mar 08 14:22:14.254547597 2017: pid 13695: thd-0x1782350: amrecover: -xpGvf
Wed Mar 08 14:22:14.254564689 2017: pid 13695: thd-0x1782350: amrecover: -
Wed Mar 08 14:22:14.254581329 2017: pid 13695: thd-0x1782350: amrecover: .
Wed Mar 08 14:22:14.254789789 2017: pid 12833: thd-0x1782350: amrecover: get amidxtaped line: DATA-SEND
Wed Mar 08 14:22:14.254807253 2017: pid 12833: thd-0x1782350: amrecover: send_to_tape_server: DATA-READY

Wed Mar 08 14:22:20.291140247 2017: pid 12833: thd-0x1782350: amrecover: stream_read_callback: data is still flowing



Let me know if you need the full logfile, as I'll have to do some redacting on it.

The thing I don't understand is why this works for a small filesystem recovery , but not for a large(r) one....

martineau
March 8th, 2017, 06:53 AM
Since you run amrecover on the server, can you try with the 'local' auth?
$ amrecover -oauth=local ...

droa
March 8th, 2017, 06:57 AM
Ok - just tried that with no difference, it hangs in the same place.

martineau
March 8th, 2017, 08:21 AM
Try with:
device-property "nb-threads-backup" "1"
device-property "nb-threads-recovery" "1"
device_property "VERBOSE" "YES"

And post the complete amidxtaped debug files ( add a '.txt' extension to be able to attach it)

droa
March 8th, 2017, 08:41 AM
I've uploaded the logfile as a zip ( it wouldn't go as a plain txt file ). I've had to alter some elements of the log ( ip's, hostnames etc ) but it should be structurally unchanged.

Thanks for continuing to look at this.

Dave

martineau
March 8th, 2017, 08:57 AM
You said you can restore it with amfetchdump, can you restore it again and post the amfetchdump debug file.
I would like to compare them.

droa
March 8th, 2017, 09:15 AM
I have attached a similarly redacted amfetchdump log. I'm coming to the end of our day here, so I'll pick this up again tomorrow morning (UK) if you need any more info.

Once again, many thanks for looking into this.

Kind Regards
Dave

martineau
March 8th, 2017, 11:29 AM
everything looks good.
In the amrecover console, did it show that it read some data?
Do you see something at the same in the system error log?

droa
March 8th, 2017, 11:58 AM
amrecover shows the directories which are created on the recovery filesystem, but then it just stops.

I've checked /var/log/messages - and nothing is shown. The OS is Centos 6.8 ( 64bit ).

I was looking at the stack traces for amfetchdump - and there were three threads running when it was pulling data from S3. In amidxtaped there were only two. Is it possible that something has caused the retrieving thread to die without telling it's parent - so that's just left waiting for data that's never going to come?

I'm at home now - but can send you the amfetchdump stacks tomorrow if that helps?

martineau
March 9th, 2017, 08:47 AM
There is one more thread in amfetchdump because it do the decompression, which is done by amrecover in the amidxtaped case.

Add the following line to amanda.conf


debug-recovery 9

The new debugging information might help us to understand what is going on.

droa
March 9th, 2017, 10:01 AM
Hi Jean-Louis

Thanks for that , I've enabled it, and I've been working on this today - and have now managed to reproduce this on a desktop VM using AWS S3 ( not EMC ATMOS ).

The steps I followed were:

1) Create a simple backup of localhost / (5G) , /var (3G), /boot (75M)
2) Backup to S3
3) Backup again to S3 ( force an L1 backup if necessary )



[amandabackup@localhost ~]$ amadmin ProdSet info localhost

Current info for localhost /boot:
Stats: dump rates (kps), Full: 16246.5, 12997.2, -1.0
Incremental: -1.0, -1.0, -1.0
compressed size, Full: 93.3%, 93.3%,-100.0%
Incremental: -100.0%,-100.0%,-100.0%
Dumps: lev datestmp tape file origK compK secs
0 20170309 ProdSet-6 3 69680 64986 4

Current info for localhost /var:
Stats: dump rates (kps), Full: 14263.8, -1.0, -1.0
Incremental: 2117.5, 1849.0, 1580.0
compressed size, Full: 51.2%,-100.0%,-100.0%
Incremental: 39.7%, 37.3%, 34.4%
Dumps: lev datestmp tape file origK compK secs
0 20170309 ProdSet-1 1 3176180 1626071 114
1 20170309 ProdSet-6 1 10660 4235 2

Current info for localhost /:
Stats: dump rates (kps), Full: 11181.8, 11512.3, 11685.0
Incremental: 159.7, -1.0, -1.0
compressed size, Full: 53.1%, 53.1%, 53.1%
Incremental: 7.1%,-100.0%,-100.0%
Dumps: lev datestmp tape file origK compK secs
0 20170309 ProdSet-5 3 4397330 2337000 209
1 20170309 ProdSet-6 2 13500 958 6


4. Attempt a recovery of /



[root@localhost data]# amrecover -oauth=local
AMRECOVER Version 3.4.3. Contacting server on localhost ...
220 localhost AMANDA index server (3.4.3) ready.
Setting restore date to today (2017-03-09)
200 Working date set to 2017-03-09.
200 Config set to ProdSet.
501 Host localhost.localdomain is not in your disklist.
Trying host localhost ...
200 Dump host set to localhost.
Use the setdisk command to choose dump disk to recover
amrecover> setdisk /
200 Disk set to /.
amrecover> add .
Added dir /. at date 2017-03-09-16-34-49
Added dir /. at date 2017-03-09-17-12-06
amrecover> extract

Extracting files using tape drive changer on host localhost.
The following tapes are needed: ProdSet-5
ProdSet-6

Extracting files using tape drive changer on host localhost.
Load tape ProdSet-5 now
Continue [?/Y/n/s/d]? y
Restoring files into directory /data
All existing files in /data can be deleted
Continue [?/Y/n]? y

./ kb
./bin/
./boot/
./cgroup/
./data/
./dev/
./etc/
./etc/ConsoleKit/
./etc/ConsoleKit/run-seat.d/
./etc/ConsoleKit/run-session.d/
./etc/ConsoleKit/seats.d/
./etc/NetworkManager/
./etc/NetworkManager/dispatcher.d/
./etc/X11/
./etc/X11/applnk/
./etc/X11/fontpath.d/
./etc/abrt/
./etc/abrt/plugins/
./etc/acpi/
./etc/acpi/actions/
./etc/acpi/events/
./etc/alsa/
./etc/alternatives/
....
./lib/modules/2.6.32-642.3.1.el6.x86_64/kernel/drivers/media/video/cx88/
./lib/modules/2.6.32-642.3.1.el6.x86_64/kernel/drivers/media/video/em28xx/
./lib/modules/2.6.32-642.3.1.el6.x86_64/kernel/drivers/media/video/gspca/


Note: The console just hangs at this point.

The amidxtaped log ends with:



[amandabackup@localhost ProdSet]$ tail -f amidxtaped.20170309174247.debug
Thu Mar 09 17:42:50.234280886 2017: pid 6934: thd-0x1133400: amidxtaped: CTL >> USE-DATAPATH AMANDA
Thu Mar 09 17:42:50.234310691 2017: pid 6934: thd-0x22733d0: amidxtaped: pull_and_write
Thu Mar 09 17:42:50.235815562 2017: pid 6934: thd-0x1133400: amidxtaped: CTL << DATAPATH-OK
Thu Mar 09 17:42:50.235944829 2017: pid 6934: thd-0x1133400: amidxtaped: CTL >> DATA-SEND
Thu Mar 09 17:42:50.236185131 2017: pid 6934: thd-0x1133400: amidxtaped: CTL << DATA-READY
Thu Mar 09 17:42:50.236321503 2017: pid 6934: thd-0x1133400: amidxtaped: Amanda::Recovery::Clerk: starting recovery
Thu Mar 09 17:42:50.238149965 2017: pid 6934: thd-0x1133400: amidxtaped: Amanda::Recovery::Clerk: reading file 3 on 'ProdSet-5'
Thu Mar 09 17:42:50.238194434 2017: pid 6934: thd-0x1133400: amidxtaped: XSR: start_part called
Thu Mar 09 17:42:50.238205540 2017: pid 6934: thd-0x1133400: amidxtaped: XSR: triggering condition variable
Thu Mar 09 17:42:50.238284491 2017: pid 6934: thd-0x22733d0: amidxtaped: XSR: first pull_buffer of new part
Thu Mar 09 17:43:05.250453791 2017: pid 6934: thd-0x1133400: amidxtaped: /usr/local/share/perl5/Amanda/Restore.pm:1698:info:4900000 67768 kb
Thu Mar 09 17:43:20.266313401 2017: pid 6934: thd-0x1133400: amidxtaped: /usr/local/share/perl5/Amanda/Restore.pm:1698:info:4900000 102400 kb
Thu Mar 09 17:43:35.281422591 2017: pid 6934: thd-0x1133400: amidxtaped: /usr/local/share/perl5/Amanda/Restore.pm:1698:info:4900000 102400 kb
...


As this is now on a local vm , I have compiled the RPMs from source and installed the debug rpms, so can now get full backtraces:

amrecover:



(gdb) thread apply all bt

Thread 2 (Thread 0x7f1d20820700 (LWP 7168)):
#0 0x00000034f4e0e7cd in write () at ../sysdeps/unix/syscall-template.S:82
#1 0x00007f1d273ffd46 in safe_write (fd=17, buf=0x7f1d2081bd90, count=16384) at safe-read.c:66
#2 0x00007f1d273ff5ae in full_write (fd=17, buf=<value optimized out>, count=16384) at full-write.c:65
#3 0x0000000000409e16 in handle_crc_thread (data=0x625b40) at extract_list.c:3589
#4 0x00000034f626a374 in g_thread_create_proxy (data=0xc905c0) at gthread.c:1955
#5 0x00000034f4e07aa1 in start_thread (arg=0x7f1d20820700) at pthread_create.c:301
#6 0x00000034f4ae8aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7f1d268c27c0 (LWP 7158)):
#0 0x00000034f4e0e7cd in write () at ../sysdeps/unix/syscall-template.S:82
#1 0x00007f1d273ffd46 in safe_write (fd=8, buf=0xcb16e0, count=32768) at safe-read.c:66
#2 0x00007f1d273ff5ae in full_write (fd=8, buf=<value optimized out>, count=32768) at full-write.c:65
#3 0x000000000040e15b in read_amidxtaped_data (cookie=0x625b80, buf=0xcb16e0, size=32768) at extract_list.c:3313
#4 0x00007f1d273ec736 in stream_read_callback (arg=0xca15a0) at security-util.c:2246
#5 0x00007f1d273ee475 in sec_tcp_conn_read_callback (cookie=0xc628f0) at security-util.c:2382
#6 0x00007f1d273dbca6 in event_handle_callback (user_ptr=0xc89600) at event.c:118
#7 0x00000034f6240642 in g_main_dispatch (context=0xc62fb0) at gmain.c:2441
#8 g_main_context_dispatch (context=0xc62fb0) at gmain.c:3014
#9 0x00000034f6244c98 in g_main_context_iterate (context=0xc62fb0, block=1, dispatch=1, self=<value optimized out>) at gmain.c:3092
#10 0x00000034f6244e4c in g_main_context_iteration (context=0xc62fb0, may_block=1) at gmain.c:3155
#11 0x00007f1d273dbe49 in event_loop_wait (wait_eh=0xc8fbe0, nonblock=0, return_when_empty=1) at event.c:427
#12 0x00007f1d273ef0a5 in tcpm_stream_read_sync (s=0xc99500, buf=0x7ffcfdc350a8) at security-util.c:447
#13 0x00000000004082aa in get_amidxtaped_line () at extract_list.c:2943
#14 0x000000000040e835 in writer_intermediary (elist=<value optimized out>) at extract_list.c:2274
#15 0x000000000040f7ee in extract_files () at extract_list.c:2633
#16 0x000000000041419a in yyparse () at uparse.y:258
#17 0x0000000000415189 in process_line (line=<value optimized out>) at uscan.l:199
#18 0x0000000000406e93 in main (argc=<value optimized out>, argv=<value optimized out>) at amrecover.c:644


amidxtaped:



(gdb) thread apply all bt

Thread 2 (Thread 0x7f1af5ab9700 (LWP 7166)):
#0 0x00000034f4e0e7cd in write () at ../sysdeps/unix/syscall-template.S:82
#1 0x00007f1afae01d46 in safe_write (fd=6, buf=0x7f1aecdff010, count=52428800) at safe-read.c:66
#2 0x00007f1afae015ae in full_write (fd=6, buf=<value optimized out>, count=52428800) at full-write.c:65
#3 0x00007f1afb07f9b9 in pull_and_write (self=0x33ed020 [XferElementGlue]) at element-glue.c:411
#4 0x00007f1afb082c16 in worker_thread (data=<value optimized out>) at element-glue.c:1380
#5 0x00000034f626a374 in g_thread_create_proxy (data=0x33ecf00) at gthread.c:1955
#6 0x00000034f4e07aa1 in start_thread (arg=0x7f1af5ab9700) at pthread_create.c:301
#7 0x00000034f4ae8aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7f1afc0de700 (LWP 7163)):
#0 0x00000034f4adf283 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1 0x00000034f62449f9 in g_main_context_poll (context=0x2ed9c30, block=1, dispatch=1, self=<value optimized out>) at gmain.c:3405
#2 g_main_context_iterate (context=0x2ed9c30, block=1, dispatch=1, self=<value optimized out>) at gmain.c:3087
#3 0x00000034f6244e4c in g_main_context_iteration (context=0x2ed9c30, may_block=1) at gmain.c:3155
#4 0x00007f1afaddde49 in event_loop_wait (wait_eh=0x0, nonblock=0, return_when_empty=0) at event.c:427
#5 0x00007f1af966141d in _wrap_run_c (my_perl=<value optimized out>, cv=<value optimized out>) at Amanda/MainLoop.c:2017
#6 0x0000003a926a6815 in Perl_pp_entersub () from /usr/lib64/perl5/CORE/libperl.so
#7 0x0000003a926a4b06 in Perl_runops_standard () from /usr/lib64/perl5/CORE/libperl.so
#8 0x0000003a9264d0d8 in perl_run () from /usr/lib64/perl5/CORE/libperl.so
#9 0x0000000000400e74 in main ()



gzip:



(gdb) thread apply all bt

Thread 1 (process 7167):
#0 0x00000034f4adb6d0 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82
#1 0x000000000040aade in write_buffer (fd=1, buf=0x64b440, cnt=32768) at util.c:190
#2 write_buf (fd=1, buf=0x64b440, cnt=32768) at util.c:233
#3 0x000000000040ab88 in flush_window () at util.c:216
#4 0x00000000004062a2 in inflate_codes (tl=0x1b20020, td=0x1b225c0, bl=9, bd=6) at inflate.c:612
#5 0x0000000000406b37 in inflate_dynamic () at inflate.c:891
#6 0x00000000004072d2 in inflate () at inflate.c:969
#7 0x000000000040a22d in unzip (in=<value optimized out>, out=<value optimized out>) at unzip.c:138
#8 0x0000000000403fb7 in treat_stdin () at gzip.c:807
#9 0x0000000000405b9a in main (argc=2, argv=0x7fff49f13358) at gzip.c:714


tar:



(gdb) thread apply all bt

Thread 1 (Thread 0x7fdda816a7a0 (LWP 7169)):
#0 0x00000034f4adb6d0 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82
#1 0x00000034f4a71a93 in _IO_new_file_write (f=0x34f4d8f040, data=<value optimized out>, n=58) at fileops.c:1268
#2 0x00000034f4a73045 in new_do_write (fp=0x34f4d8f040, data=0x7fdda8173000 "./opt/puppetlabs/puppet/lib/ruby/2.1.0/rubygems/core_ext/\nlate/json_index/js/\n/\ninttypes/\n", to_do=58) at fileops.c:522
#3 _IO_new_do_write (fp=0x34f4d8f040, data=0x7fdda8173000 "./opt/puppetlabs/puppet/lib/ruby/2.1.0/rubygems/core_ext/\nlate/json_index/js/\n/\ninttypes/\n", to_do=58) at fileops.c:495
#4 0x00000034f4a72248 in _IO_new_file_sync (fp=0x34f4d8f040) at fileops.c:897
#5 0x00000034f4a709f3 in fflush_unlocked (fp=<value optimized out>) at iofflush_u.c:40
#6 0x0000000000416398 in simple_print_header (st=<value optimized out>, blk=0xfbe400, block_ordinal=<value optimized out>) at list.c:1314
#7 0x000000000040dd38 in extract_archive () at extract.c:1400
#8 0x00000000004178eb in read_and (do_something=0x40db60 <extract_archive>) at list.c:186
#9 0x000000000041f247 in main (argc=<value optimized out>, argv=<value optimized out>) at tar.c:2692


[root@localhost ProdSet]# strace -s 255 -p 7169
Process 7169 attached
write(1, "./opt/puppetlabs/puppet/lib/ruby/2.1.0/rubygems/core_ext/\n", 58^CProcess 7169 detached
<detached ...>

[root@localhost ProdSet]# ls -l /proc/7169/fd/1
l-wx------. 1 root root 64 Mar 9 18:00 /proc/7169/fd/1 -> pipe:[716533]



It looks to me like tar is stuck writing to a pipe?

The reader for the pipe is proc 7158:

[root@localhost ProdSet]# ls -l /proc/*/fd/* 2>/dev/null | grep 716533
lr-x------. 1 root root 64 Mar 9 18:01 /proc/7158/fd/9 -> pipe:[716533]
l-wx------. 1 root root 64 Mar 9 18:00 /proc/7169/fd/1 -> pipe:[716533]

which is amrecover:



[root@localhost ProdSet]# ps -fp 7158
UID PID PPID C STIME TTY TIME CMD
root 7158 4966 0 17:48 pts/0 00:00:00 amrecover -oauth=local


But the backtrace above shows 2 threads that are both blocked writing - nothing appears to be reading from that pipe.

Let me know if you need more info - I'll take my laptop back with me tonight, so can now recreate this at home.

Kind Regards
Dave

martineau
March 9th, 2017, 10:29 AM
I think you find the problem.

Can you do another test to be sure?
In recover-src/extract_list.c,change the line 2015 form

g_ptr_array_add(argv_ptr, g_strdup("-xpGvf"));
to

g_ptr_array_add(argv_ptr, g_strdup("-xpGf"));
(remove the 'v' flag)

tar will write nothing, so it should not hang, amrecover will not show the list of files, but the extraction should succeed

droa
March 9th, 2017, 12:17 PM
Thanks for this - that seems to work around the issue. I assume that's not the proper fix(!)?

Also - whilst I was building today , I had to add a section to packaging/rpm/amanda.spec.src to get it to build on Centos 6:



%if %(awk '$1 == "CentOS" && $3 ~ /6.*/ { exit 1; }' /etc/redhat-release; echo $?)
%define dist redhat
%define disttag rhel
%define distver 6
%define curl curl
%endif


Could this be included?

Let me know if you come up with a full patch for this - I'd be happy to test it for you.

Dave

martineau
March 9th, 2017, 06:32 PM
Try the attached patch!

Don't forget to re-add the 'v' falg in the tar command

g_ptr_array_add(argv_ptr, g_strdup("-xpGvf"));

martineau
March 10th, 2017, 06:02 AM
I committed the fix to packaging/rpm/amanda.spec.src

droa
March 10th, 2017, 07:10 AM
Thanks for that. I tried the patch and that seems to fully resolve the issue. Many thanks for turning this round so quickly.

One other thing - I noticed that the rpm uninstall script was broken - there is a fix for it here:

https://github.com/driseley/amanda/commit/92a54bac8800b297aa30b220ae7bad8aaa39bf95

Would this fix for amrecover be included in your next release?

Kind Regards
Dave

martineau
March 10th, 2017, 09:23 AM
All fixes are committed and will be in 3.4.4.

Thanks for your help, you did a great job to identify the bug.

droa
March 10th, 2017, 09:31 AM
No problems - thanks for fixing it!

Kind Regards
Dave

abanghasan
March 12th, 2017, 12:33 AM
i'll try to figure out this problem