PDA

View Full Version : Bug in when_overwrite() usage in planner.c



ckeasling
December 1st, 2016, 10:09 AM
I recently upgraded my server from 3.3.2 to 3.4.1. Although backups seem to be proceeding without incident, the mail report has scores of messages per the following pattern:

planner: Last full dump of client:disk on tape DAILYNN overwritten in 1 run

My disklist has roughly 110 entries and I received 112 such messages after the first run of 3.4.1 and 87 on the second. Dumps are to vdisk. I'm running RHEL 6.

I tracked this down to planner.c (which recently had changes to relevant routines, per the ChangeLog). It looks like when_overwrite() is being called to determine if a dump is to be overwritten within the next 5 days. The return value of when_overwrite() is used verbatim, and since it is often 1 this is duly logged as a warning. However, when_overwrite() will also return 1 to report certain warning or error conditions. These appear to relate to lookup failures related to the tape database, but I haven't done enough digging to really understand what's going on.

It seems to me that when_overwrite() ought to do a better job of reporting errors it encounters, or else error conditions ought to be checked in some other fashion. I'm happy to provide further information, but perhaps this is enough to identify an obvious problem. I'm also inclined to suspect that, due to the reduced number of messages tonight, once every dle has a level 0 performed by 3.4.1 whatever database issue their might be will be resolved. But it's an obnoxious and unnerving bug...

Also, it's really not clear which support venues are active these days. Forgive me (and please advise me) if I should really be posting this elsewhere.

Thanks for any help.

Corey

martineau
December 1st, 2016, 10:39 AM
Can you try the untested attached patch?

You can test it by running only the planner:
/path/.../to/.../planner CONFIG --log-filename /tmp/planner.log

The look at the /tmp/planner.log,there should be no warnings.

ckeasling
December 1st, 2016, 11:25 AM
The build is currently fighting me. Why the hell does amanda need glib? And --disable-glibtest doesn't work. RHEL 6 includes only glib 1.2, and the configure test indicates it wants 2.2. Any idea how to get around this idiotic requirement? I certainly don't need any graphical components.

martineau
December 1st, 2016, 11:28 AM
Please, report the exact error you get.

I think you need to install glib2 and glib2-devel

ckeasling
December 1st, 2016, 11:35 AM
Oh, right. glib2. I'm still irritated at how Gnome has infiltrated everything. Why does Amanda require the Gnome libraries? Is there some graphical component I'm unaware of?

The build is now proceeding.

martineau
December 1st, 2016, 11:50 AM
glib2is not only for graphic, it have a lot of useful things.

ckeasling
December 1st, 2016, 12:15 PM
Doesn't look good.

-bash-4.1$ /usr/local/amanda/3.4.1p1/libexec/amanda/planner DAILY --log-filename /tmp/planner.log.3.4.1p1
planner: pid 2851 executable /usr/local/amanda/3.4.1p1/libexec/amanda/planner version 3.4.1
planner: build: VERSION="Amanda-3.4.1"
planner: BUILT_DATE="Thu Dec 1 12:43:54 MST 2016" BUILT_MACH=""
planner: BUILT_REV="6945" BUILT_BRANCH="tags" CC="gcc"
planner: paths: bindir="/usr/local/amanda/3.4.1p1/bin"
planner: sbindir="/usr/local/amanda/3.4.1p1/sbin"
planner: libexecdir="/usr/local/amanda/3.4.1p1/libexec"
planner: amlibexecdir="/usr/local/amanda/3.4.1p1/libexec/amanda"
planner: mandir="/usr/local/amanda/3.4.1p1/share/man"
planner: AMANDA_TMPDIR="/tmp/amanda" AMANDA_DBGDIR="/tmp/amanda"
planner: CONFIG_DIR="/usr/local/amanda/3.4.1p1/etc/amanda"
planner: DEV_PREFIX="/dev/" DUMP="/sbin/dump"
planner: RESTORE="/sbin/restore" VDUMP=UNDEF VRESTORE=UNDEF
planner: XFSDUMP="/sbin/xfsdump" XFSRESTORE="/sbin/xfsrestore"
planner: VXDUMP=UNDEF VXRESTORE=UNDEF
planner: SAMBA_CLIENT="/usr/bin/smbclient" GNUTAR="/bin/tar"
planner: COMPRESS_PATH="/bin/gzip" UNCOMPRESS_PATH="/bin/gzip"
planner: LPRCMD=UNDEF MAILER=UNDEF
planner: listed_incr_dir="/usr/local/amanda/3.4.1p1/var/amanda/gnutar-lists"
planner: defs: DEFAULT_SERVER="jilamanda.colorado.edu"
planner: DEFAULT_CONFIG="DailySet1"
planner: DEFAULT_TAPE_SERVER="amanda.XXX"
planner: DEFAULT_TAPE_DEVICE="" NEED_STRSTR AMFLOCK_POSIX
planner: AMFLOCK_FLOCK AMFLOCK_LOCKF AMFLOCK_LNLOCK SETPGRP_VOID
planner: AMANDA_DEBUG_DAYS=4 BSD_SECURITY USE_AMANDAHOSTS
planner: CLIENT_LOGIN="amandabackup" CHECK_USERID HAVE_GZIP
planner: COMPRESS_SUFFIX=".gz" COMPRESS_FAST_OPT="--fast"
planner: COMPRESS_BEST_OPT="--best" UNCOMPRESS_OPT="-dc"
planner: CONFIGURE_ARGS=" '--prefix=/usr/local/amanda/3.4.1p1' '--without-rest-server' '--with-group=disk' '--with-user=amandabackup'"
READING CONF INFO...
DATE 20161201131505
planner: timestamp 20161201131505
planner: tape_length is set from maxdumpsize (768000000 KB)
planner: time 0.125: startup took 0.125 secs

SENDING FLUSHES...
Cleaning up holding disk '/holdA'
Cleaning up holding disk '/holdB'
ENDFLUSH
ENDFLUSH

SETTING UP FOR ESTIMATES...
planner: time 0.156: setting up estimates for ancilla-2:/dev/sda1
Segmentation fault

martineau
December 1st, 2016, 01:59 PM
i do not understand why this patch fail with: Segmentation fault

Maybe the problem is with the build, can you try to build without the patch to see if it succeed?

ckeasling
December 1st, 2016, 02:43 PM
A build without the patch using identical configure flags seems to work. The planner is currently running and has passed the point at which it was breaking.

Incidentally, I couldn't determine why I wasn't getting a core dump. I can generate a core dump with an appropriate C program, so they're not disabled in any system- or user-wide way. But nothing from Amanda. Any ideas? I can provide a stack backtrace if I could get a core file.

ckeasling
December 8th, 2016, 10:44 AM
I encountered two additional bugs and have rolled back to 3.3.9.

The first is a HEADER_DONE not set error on a single dataset of a ZFS pool. We're dumping in the neighborhood of 20 small datasets (one per user) on a server running zfsonlinux, and each run resulted in precisely one of these failing with the HEADER_DONE not set error. It was not the same dataset each time. I've never encountered this error before, and digging into the code it wasn't immediately apparent why this was happening. The error did not occur with last night's 3.3.9 run.

The other error, which was a (literal) showstopper, was a taper hang. The run two nights ago never finished. At the time of investigation there were 8 idle dumper threads and 3 taper threads, with the driver reporting that the taper was writing. Two of the tapers were in a futex wait channel and the third was in a poll_s... wchan. I ultimately sent sigterm to this last taper and amanda immediately wrapped things up with a taper protocol error. The logs didn't report anything unusual prior to the hang and there's no evidence of any i/o problems with the hardware or network. And the problem did not occur with 3.3.9.

So it seems 3.4 is not ready for primetime, at least not for us.

trelyfuly
December 9th, 2016, 01:10 AM
I encountered two additional bugs and have rolled back to 3.3.9.

The first is a HEADER_DONE not set error on a single dataset of a ZFS pool. We're dumping in the neighborhood of 20 small datasets (one per user) on a server running zfsonlinux, and each run resulted in precisely one of these failing with the HEADER_DONE not set error. It was not the same dataset each time. I've never encountered this error before, and digging into the code it wasn't immediately apparent why this was happening. The error did not occur with last night's 3.3.9 run.

The other error, which was a (literal) showstopper, was a taper hang. The run two nights ago never finished. At the time of investigation there were 8 idle dumper threads and 3 taper threads, with the driver reporting that the taper was writing. Two of the tapers were in a futex wait channel and the third was in a poll_s... wchan. I ultimately sent sigterm to this last taper and amanda immediately wrapped things up with a taper protocol error. The logs didn't report anything unusual prior to the hang and there's no evidence of any i/o problems with the hardware or network. And the problem did not occur with 3.3.9.

So it seems 3.4 is not ready for primetime, at least not for us.

So I think you should use 3.3.2, to avoid trouble

martineau
December 9th, 2016, 07:16 AM
Which application are you using to do the backup? amzfs-sendrecv?
Are you using amanda-3.4on both server and client?

Can you post the dumper.<datatstamp>.debug file from the server and the amandad, sendbackup and amzfs-sendrecv debug files from the client?
The taper hang can be related to the first issue, are you using a holding disk? or dumping directly to tape?