View Full Version : Seeing some amandad processing stuck in FUTEX_WAIT...

Andrew Rakowski
March 21st, 2007, 03:23 PM
After realizing that if the Amanda server processes die for some reason (out of resources, etc.) there are occasional tar processes left "spinning" in tight CPU-sucking loops on the clients, I decided to look for any Amanda processes more than 24 hours old. I found several "amandad" processes laying around, but not "spinning". They were waiting on a FUTEX_WAIT call, like this one:

mgr@bigbox 6# psg aman
25263 16920 7848 0 Mar20 ? 00:00:00 amandad -auth=bsdtcp amdump
25263 27396 7848 0 01:25 ? 00:00:00 amandad -auth=bsdtcp amdump
mgr@bigbox 7#
mgr@bigbox 7# rpm -qa | grep -i amanda
mgr@bigbox 8#
mgr@bigbox 8# strace -f -p 27396
Process 27396 attached - interrupt to quit
[ Process PID=27396 runs in 32 bit mode. ]
futex(0x972840, FUTEX_WAIT, 2, NULL <unfinished ...>
Process 27396 detached
mgr@bigbox 9#
mgr@bigbox 9# cat /etc/redhat-release
Red Hat Enterprise Linux WS release 4 (Nahant Update 4)
mgr@bigbox 10#
mgr@bigbox 10# uname -a
Linux bigbox 2.6.9-42.0.10.ELsmp #1 SMP Fri Feb 16 17:13:42 EST 2007 x86_64 x86_64 x86_64 GNU/Linux
mgr@bigbox 11#

While this particular system is a 64-bit Linux box, I also found it on 32-bit systems. I didn't see any errors, and backups had run successfully.

I *think* these are all systems that have lots of filesystem space, even though a lot of it is NOT being backed up (for instance, system "bigbox" has about 2 TB of disk mounted, although my disklist only backs up / and /home (about 30GB in use out of about 40GB total). I wonder if the initial "tar" that is doing the estimating might be trawling though all the filesystems (not following the excludes, for instance, the 1.1TB /scratch filesystem.)

Anyhow, I thought I'd make you folks aware of this issue - just in case someone is forgetting to stop a waiter after killing a child process somewhere. Of, the amandad processes are all children of the current xinetd process.



Andrew Rakowski
April 2nd, 2007, 03:53 PM
I was wondering if anyone has looked any further into this problem with "amandad" processes hanging around forever. I ended up writing a lame script that looks for any processes owned by the amandabackup user that are more than 24 hours old, so I know if I need to head over to that system and kill the processes.

As of this morning, I found 41 out of my 188 Amanda-running systems (464 DLEs total) had at least one process left behind by Amanda. All appear to be Linux systems. Of those, 10 also had "spinning" gtar processes, where some child shell process of gtar was laying around "defunct", and an strace shows gtar itself is spinning endlessly on "SIGPIPE (Broken pipe)...EPIPE".

Since these processes are cumulative (back to late February), I'm not sure how often this happens. I'm killing off all the leftover processes manually today, and I'll see if I can discern any pattern (very large DLEs, numerous files, etc.)

Meanwhile, the developers might want to take a quick look at the code that kills child processes to see if there are any signals being ignored that should be (for instance). More news if I discover any.