PDA

View Full Version : Taper process hanging



AlanG
May 30th, 2017, 04:33 AM
Hi,

Have problems with the backup run hanging and it appears to be the taper process.

Sometimes the run will complete, but more often than not most of the dumps will remain on the holding disk and the taper process will stop updating the log. I have to kill taper to get amdump to complete and send the mail report.

Running amflush on the remaining dumps will sometimes complete, but usually hangs as well.

I've pushed large quantities of data to the tape drive via tar, without issue, so can't see a problem there.

The taper always hangs on this point in the log,

taper: Building type SPLIT_FILE header of 65536-65536 bytes with name='mgt-ams-satdb-01.XXX' disk='/var/log/audit' dumplevel=1 and blocksize=65536

But with a different DLE each time.

Running strace on the taper I see this repeated indefinitely.

poll([{fd=0, events=POLLIN}, {fd=3, events=POLLIN}], 2, 5000) = 0 (Timeout)
lseek(4, 0, SEEK_SET) = 0
lseek(4, 0, SEEK_CUR) = 0
write(4, "0", 1) = 1
ftruncate(4, 1) = 0
poll([{fd=0, events=POLLIN}, {fd=3, events=POLLIN}], 2, 5000) = 0 (Timeout)
lseek(4, 0, SEEK_SET) = 0
lseek(4, 0, SEEK_CUR) = 0
write(4, "0", 1) = 1
ftruncate(4, 1) = 0
poll([{fd=0, events=POLLIN}, {fd=3, events=POLLIN}], 2, 5000) = 0 (Timeout)
lseek(4, 0, SEEK_SET) = 0
lseek(4, 0, SEEK_CUR) = 0
write(4, "0", 1) = 1
ftruncate(4, 1) = 0

I was on 3.4, upgraded to 3.4.4 to see if that resolved it, but still remains the same.

Any assistance appreciated.

Thanks,

Alan

martineau
May 30th, 2017, 05:06 AM
Can you get the stacktrace of all threads,
in gdb, run: thread apply all bt
Also post the taper debug file

AlanG
May 30th, 2017, 08:35 AM
Hi,

Here is the stack trace against taper process. It complained about a number of missing debug pkgs, so not sure if this is useful without them?


Thread 4 (Thread 0x7f3798b7d700 (LWP 11709)):
#0 0x00007f37a1894e2d in poll () from /lib64/libc.so.6
#1 0x00007f37a016204c in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#2 0x00007f37a016216c in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#3 0x00007f37a01621a9 in glib_worker_main () from /lib64/libglib-2.0.so.0
#4 0x00007f37a01870f5 in g_thread_proxy () from /lib64/libglib-2.0.so.0
#5 0x00007f37a1b70dc5 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f37a189f76d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f3793fff700 (LWP 11760)):
#0 0x00007f37a1899c29 in syscall () from /lib64/libc.so.6
#1 0x00007f37a01a494f in g_cond_wait () from /lib64/libglib-2.0.so.0
#2 0x00007f379ac47a8a in device_thread_wait_for_block (eof_flag=<synthetic pointer>, self=0x31d7080) at xfer-dest-taper-splitter.c:389
#3 device_thread_write_part (self=<optimized out>) at xfer-dest-taper-splitter.c:565
#4 device_thread (data=<optimized out>) at xfer-dest-taper-splitter.c:745
#5 0x00007f37a01870f5 in g_thread_proxy () from /lib64/libglib-2.0.so.0
#6 0x00007f37a1b70dc5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007f37a189f76d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f37937fe700 (LWP 11761)):
#0 0x00007f37a1899c29 in syscall () from /lib64/libc.so.6
#1 0x00007f37a01a494f in g_cond_wait () from /lib64/libglib-2.0.so.0
#2 0x00007f379aea55fa in holding_thread (data=<optimized out>) at xfer-source-holding.c:157
#3 0x00007f37a01870f5 in g_thread_proxy () from /lib64/libglib-2.0.so.0
#4 0x00007f37a1b70dc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f37a189f76d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f37a3095740 (LWP 11707)):
#0 0x00007f37a1894e2d in poll () from /lib64/libc.so.6
#1 0x00007f37a016204c in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#2 0x00007f37a016216c in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#3 0x00007f37a0ce6555 in event_loop_wait (wait_eh=0x0, nonblock=0, return_when_empty=0) at event.c:427
#4 0x00007f37a0ce6958 in event_loop_run () at event.c:336
#5 0x00007f3799f6b8a5 in run_c () at Amanda/MainLoop.c:1545
#6 0x00007f3799f6b900 in _wrap_run_c (my_perl=<optimized out>, cv=<optimized out>) at Amanda/MainLoop.c:1981
#7 0x00007f37a2bb439f in Perl_pp_entersub (my_perl=0xdf7010) at pp_hot.c:2778
#8 0x00007f37a2bacb16 in Perl_runops_standard (my_perl=0xdf7010) at run.c:41
#9 0x00007f37a2b49925 in S_run_body (oldscope=<optimized out>, my_perl=<optimized out>) at perl.c:2402
---Type <return> to continue, or q <return> to quit---
#10 perl_run (my_perl=0xdf7010) at perl.c:2320
#11 0x0000000000400d99 in main (argc=7, argv=0x7ffe9a070938, env=0x7ffe9a070978) at perlmain.c:120

AlanG
May 30th, 2017, 08:41 AM
And here is the debug log.

AlanG
May 30th, 2017, 08:42 AM
Adding stack trace again as it didn;t seem to upload first time



Thread 4 (Thread 0x7f3798b7d700 (LWP 11709)):
#0 0x00007f37a1894e2d in poll () from /lib64/libc.so.6
#1 0x00007f37a016204c in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#2 0x00007f37a016216c in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#3 0x00007f37a01621a9 in glib_worker_main () from /lib64/libglib-2.0.so.0
#4 0x00007f37a01870f5 in g_thread_proxy () from /lib64/libglib-2.0.so.0
#5 0x00007f37a1b70dc5 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f37a189f76d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f3793fff700 (LWP 11760)):
#0 0x00007f37a1899c29 in syscall () from /lib64/libc.so.6
#1 0x00007f37a01a494f in g_cond_wait () from /lib64/libglib-2.0.so.0
#2 0x00007f379ac47a8a in device_thread_wait_for_block (eof_flag=<synthetic pointer>, self=0x31d7080) at xfer-dest-taper-splitter.c:389
#3 device_thread_write_part (self=<optimized out>) at xfer-dest-taper-splitter.c:565
#4 device_thread (data=<optimized out>) at xfer-dest-taper-splitter.c:745
#5 0x00007f37a01870f5 in g_thread_proxy () from /lib64/libglib-2.0.so.0
#6 0x00007f37a1b70dc5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007f37a189f76d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f37937fe700 (LWP 11761)):
#0 0x00007f37a1899c29 in syscall () from /lib64/libc.so.6
#1 0x00007f37a01a494f in g_cond_wait () from /lib64/libglib-2.0.so.0
#2 0x00007f379aea55fa in holding_thread (data=<optimized out>) at xfer-source-holding.c:157
#3 0x00007f37a01870f5 in g_thread_proxy () from /lib64/libglib-2.0.so.0
#4 0x00007f37a1b70dc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f37a189f76d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f37a3095740 (LWP 11707)):
#0 0x00007f37a1894e2d in poll () from /lib64/libc.so.6
#1 0x00007f37a016204c in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#2 0x00007f37a016216c in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#3 0x00007f37a0ce6555 in event_loop_wait (wait_eh=0x0, nonblock=0, return_when_empty=0) at event.c:427
#4 0x00007f37a0ce6958 in event_loop_run () at event.c:336
#5 0x00007f3799f6b8a5 in run_c () at Amanda/MainLoop.c:1545
#6 0x00007f3799f6b900 in _wrap_run_c (my_perl=<optimized out>, cv=<optimized out>) at Amanda/MainLoop.c:1981
#7 0x00007f37a2bb439f in Perl_pp_entersub (my_perl=0xdf7010) at pp_hot.c:2778
#8 0x00007f37a2bacb16 in Perl_runops_standard (my_perl=0xdf7010) at run.c:41
#9 0x00007f37a2b49925 in S_run_body (oldscope=<optimized out>, my_perl=<optimized out>) at perl.c:2402
---Type <return> to continue, or q <return> to quit---
#10 perl_run (my_perl=0xdf7010) at perl.c:2320
#11 0x0000000000400d99 in main (argc=7, argv=0x7ffe9a070938, env=0x7ffe9a070978) at perlmain.c:120

martineau
May 30th, 2017, 10:06 AM
Some post are moderated, they must be approved before you can see them.
I need you increase the debugging level to understand what's going on. Can you add the following line in amanda.conf


debug-taper 9


retry an amflush and post the resulting taper debug file.
I would also like to get the value of the mem_ring structure from the thread that is hang in the holding_thread function.
In gdb:

thread 2
up
up
print *self->mem_ring

AlanG
May 31st, 2017, 03:38 AM
It ran for about two hours this time before hanging. Unfortunately the zipped log will not upload, just gives a generic file upload error. So I'm sharing it from Dropbox.

https://www.dropbox.com/s/8cwhjz5vusg1ok7/taper-debug-9.zip?dl=0

The trace



$1 = {write_offset = 1307440, written = 1307440, eof_flag = 0, padding1 = '\000' <repeats 235 times>, read_offset = 0, readx = 0, padding2 = '\000' <repeats 239 times>,
buffer = 0x7f62cc151e00 "Salted__\254\227\022f\377\066\373C\367t\021\032\26 3\071Д\021\220\070A*\227\255\355\343\214\v\240$\30 7\346F\211M\f<\246\346\333\350\037|\242U>+\270\303\373\206\262\366\320\300.\371\325\357Ѝ_\ 200#7;\006\331V\316)\177\241\205\272\313\035\201\2 31\315\004\234\302;\367\367\037\230\312\021\322}2! ^ȯ\031--;", ring_size = 1310720, add_cond = 0x7f62c80312a0, free_cond = 0x7f62c80312b0,
mutex = 0x7f62c8031290, consumer_block_size = 65536, producer_block_size = 32768, consumer_ring_size = 1310720, producer_ring_size = 1048576, data_avail = 62256}

martineau
May 31st, 2017, 09:50 AM
Thanks, the information was useful to find the bug.

Can you try the attached patch? It require you recompile amanda?

AlanG
May 31st, 2017, 01:30 PM
I've applied the patch - I'll let you know in a couple of days if it seems to have done the trick. Thanks for the quick turnaround!

AlanG
June 5th, 2017, 01:30 AM
Just to confirm the issue is now fixed by the patch. Thanks again for your quick support.

martineau
June 5th, 2017, 07:59 AM
Thanks for testing the patch.
I committed the fix.

tinycc
June 29th, 2017, 12:53 AM
I also faced same Issue Now its Fixed by the Patch Thank you for sharing such a Nice and Informative idea.