PDA

View Full Version : [FreeBSD 7.1-STABLE] amtape operations hang



glowkrantz
January 17th, 2009, 01:48 PM
When I run some amtape operations, like taper, I get a hang in Amanda/MainLoop/libMainLoop.so. It seems to be related to glib and pthreads.


Command line: amtape lule2 taper
Call: /usr/bin/perl5 /usr/local/libexec/amanda/chg-glue lule2

I have tested with both Perl 5.8.8 and 5.8.9.

Backtrace is attached.

The thing can be worked around by building threaded perl but this may not be possible for all users.

/glz

dustin
January 17th, 2009, 02:26 PM
Hmm.. from what I can see in the backtrace, this is not a bug where perl is called in a thread (that would be bad!), but perhaps results from linking code that expects to be single-threaded (perl) with code that uses threads (glib). There are some versions of glib with versions of child_watch that contain race conditions which could trigger this kind of hang - what version of glib are you using?

I know that FreeBSD's threading is quite a bit different from other systems, so any other suggestions you can offer as to why building a threaded perl is an effective workaround would be helpful.

glowkrantz
January 18th, 2009, 04:26 AM
Hi Dustin,

Here are the list of ports used by Amanda:
# pkg_info -rx amanda
Information for amanda-devel-2.6.1b2.20081222:

Depends on:
Dependency: mtx-1.3.11
Dependency: python25-2.5.2_3
Dependency: perl-threaded-5.8.9
Dependency: pkg-config-0.23_1
Dependency: pcre-7.8
Dependency: libiconv-1.11_1
Dependency: gettext-0.17_1
Dependency: glib-2.18.4
Dependency: gamin-0.1.10
Dependency: gio-fam-backend-2.18.4
Dependency: lzo2-2.03_2
Dependency: lzop-1.02.r1
Dependency: lzmautils-4.32.7
Dependency: gtar-1.21


Attached are the link lists for threaded and non-threaded perl, the only difference I see is that in the threded version /lib/libthr.so.3 is linked with the perl main.

It only seems to be happening in the server part, as the only one I have encountered so far is amtape , my clients seems to work with non-threaded perl. The clients backup using zfs send, GNU tar and FreeBSD dump.

By the way, what is the standard on the Linux systems where you build Amanda? Looking around, some of the distributions seems to default to the threaded version.

/glz

dustin
January 18th, 2009, 08:48 AM
glib-2.18 is fairly new, so my race-condition hypothesis is less likely.

Various linuxes have threaded or non-threaded perls, but it's invariably libpthread (rather than libthr or, worse, a distinct C library like libc_r), which does not require that all code in the process be compiled to be thread-aware. That is, it's fine to use threads from C in non-threaded perl process, as long as you don't call any perl functions in a thread.

Digging into the backtrace a little bit, it looks like pthread_create ends up calling calloc, which, in trying to lock its arena, calls some mutex functions, which in turn call calloc. I see this in thr_mutex.c:

/* This function is used internally by malloc. */
int
_pthread_mutex_init_calloc_cb(pthread_mutex_t *mutex,
void *(calloc_cb)(size_t, size_t))
{

which suggests that there is some funny business that should be going on to make libc and libthr play nicely together. I can't quite trace the logic from _pthread_mutex_trylock through _destroy and _mutexattr_init, so it's not entirely clear what's going wrong.

At a higher level, this looks like one of two things:
1. A bug in libc/libthr.
2. Amanda, or your build of Amanda, is doing something that is known not to work on FreeBSD, and which triggers this bug-like behavior.

I have no idea how to distinguish those two things. If it's #2, what are we doing wrong, and how can we fix it?

glowkrantz
January 18th, 2009, 09:18 AM
I found this one:
http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc/stdlib/malloc.c?rev=1.183;content-type=text%2Fx-cvsweb-markup

I will try to backport it to 7.1 and also try and get hold of a second tape station so I can test on an 8-CURRENT system.

/glz

dustin
January 22nd, 2009, 12:08 PM
any updates?

glowkrantz
January 23rd, 2009, 02:41 PM
I backported the patch and it didn't help, so I have created a jail, have built non-stripped debug version of amanda and dependant software and will do some debugging during this weekend.

As real life(tm) in interfering, it may take until next week until I know.

/glz

glowkrantz
February 3rd, 2009, 09:22 PM
The offical FreeBSD ports maintainer Jun Kuriyama got it working with the attached patch.

It seems that on FreeBSD the race protection must be enabled all the time.

/glz

PS. For general consumption, maybe change the 1 to __FreeBSD__ if this truly is not a problem on other architectures.

PPS. Now tested on FreeBSD 6.4 and 7.1.

dustin
February 4th, 2009, 08:22 AM
Awesome! We certainly haven't seen this kind of failure on other systems, so perhaps this is because of some assumption that newer versions of Glib make regarding child signals or other racy conditions - assumptions that don't hold on FreeBSD. I'm happy to apply a __FreeBSD__ conditional and call it a day.