December 1st, 2016, 09:09 AM
Bug in when_overwrite() usage in planner.c
I recently upgraded my server from 3.3.2 to 3.4.1. Although backups seem to be proceeding without incident, the mail report has scores of messages per the following pattern:
planner: Last full dump of client:disk on tape DAILYNN overwritten in 1 run
My disklist has roughly 110 entries and I received 112 such messages after the first run of 3.4.1 and 87 on the second. Dumps are to vdisk. I'm running RHEL 6.
I tracked this down to planner.c (which recently had changes to relevant routines, per the ChangeLog). It looks like when_overwrite() is being called to determine if a dump is to be overwritten within the next 5 days. The return value of when_overwrite() is used verbatim, and since it is often 1 this is duly logged as a warning. However, when_overwrite() will also return 1 to report certain warning or error conditions. These appear to relate to lookup failures related to the tape database, but I haven't done enough digging to really understand what's going on.
It seems to me that when_overwrite() ought to do a better job of reporting errors it encounters, or else error conditions ought to be checked in some other fashion. I'm happy to provide further information, but perhaps this is enough to identify an obvious problem. I'm also inclined to suspect that, due to the reduced number of messages tonight, once every dle has a level 0 performed by 3.4.1 whatever database issue their might be will be resolved. But it's an obnoxious and unnerving bug...
Also, it's really not clear which support venues are active these days. Forgive me (and please advise me) if I should really be posting this elsewhere.
Thanks for any help.
December 1st, 2016, 09:39 AM
Can you try the untested attached patch?
You can test it by running only the planner:
/path/.../to/.../planner CONFIG --log-filename /tmp/planner.log
The look at the /tmp/planner.log,there should be no warnings.
December 1st, 2016, 10:25 AM
The build is currently fighting me. Why the hell does amanda need glib? And --disable-glibtest doesn't work. RHEL 6 includes only glib 1.2, and the configure test indicates it wants 2.2. Any idea how to get around this idiotic requirement? I certainly don't need any graphical components.
December 1st, 2016, 10:28 AM
Please, report the exact error you get.
I think you need to install glib2 and glib2-devel
December 1st, 2016, 10:35 AM
Oh, right. glib2. I'm still irritated at how Gnome has infiltrated everything. Why does Amanda require the Gnome libraries? Is there some graphical component I'm unaware of?
The build is now proceeding.
December 1st, 2016, 10:50 AM
glib2is not only for graphic, it have a lot of useful things.
December 1st, 2016, 11:15 AM
Doesn't look good.
-bash-4.1$ /usr/local/amanda/3.4.1p1/libexec/amanda/planner DAILY --log-filename /tmp/planner.log.3.4.1p1
planner: pid 2851 executable /usr/local/amanda/3.4.1p1/libexec/amanda/planner version 3.4.1
planner: build: VERSION="Amanda-3.4.1"
planner: BUILT_DATE="Thu Dec 1 12:43:54 MST 2016" BUILT_MACH=""
planner: BUILT_REV="6945" BUILT_BRANCH="tags" CC="gcc"
planner: paths: bindir="/usr/local/amanda/3.4.1p1/bin"
planner: AMANDA_TMPDIR="/tmp/amanda" AMANDA_DBGDIR="/tmp/amanda"
planner: DEV_PREFIX="/dev/" DUMP="/sbin/dump"
planner: RESTORE="/sbin/restore" VDUMP=UNDEF VRESTORE=UNDEF
planner: XFSDUMP="/sbin/xfsdump" XFSRESTORE="/sbin/xfsrestore"
planner: VXDUMP=UNDEF VXRESTORE=UNDEF
planner: SAMBA_CLIENT="/usr/bin/smbclient" GNUTAR="/bin/tar"
planner: COMPRESS_PATH="/bin/gzip" UNCOMPRESS_PATH="/bin/gzip"
planner: LPRCMD=UNDEF MAILER=UNDEF
planner: defs: DEFAULT_SERVER="jilamanda.colorado.edu"
planner: DEFAULT_TAPE_DEVICE="" NEED_STRSTR AMFLOCK_POSIX
planner: AMFLOCK_FLOCK AMFLOCK_LOCKF AMFLOCK_LNLOCK SETPGRP_VOID
planner: AMANDA_DEBUG_DAYS=4 BSD_SECURITY USE_AMANDAHOSTS
planner: CLIENT_LOGIN="amandabackup" CHECK_USERID HAVE_GZIP
planner: COMPRESS_SUFFIX=".gz" COMPRESS_FAST_OPT="--fast"
planner: COMPRESS_BEST_OPT="--best" UNCOMPRESS_OPT="-dc"
planner: CONFIGURE_ARGS=" '--prefix=/usr/local/amanda/3.4.1p1' '--without-rest-server' '--with-group=disk' '--with-user=amandabackup'"
READING CONF INFO...
planner: timestamp 20161201131505
planner: tape_length is set from maxdumpsize (768000000 KB)
planner: time 0.125: startup took 0.125 secs
Cleaning up holding disk '/holdA'
Cleaning up holding disk '/holdB'
SETTING UP FOR ESTIMATES...
planner: time 0.156: setting up estimates for ancilla-2:/dev/sda1
December 1st, 2016, 12:59 PM
i do not understand why this patch fail with: Segmentation fault
Maybe the problem is with the build, can you try to build without the patch to see if it succeed?
December 1st, 2016, 01:43 PM
A build without the patch using identical configure flags seems to work. The planner is currently running and has passed the point at which it was breaking.
Incidentally, I couldn't determine why I wasn't getting a core dump. I can generate a core dump with an appropriate C program, so they're not disabled in any system- or user-wide way. But nothing from Amanda. Any ideas? I can provide a stack backtrace if I could get a core file.
December 8th, 2016, 09:44 AM
I encountered two additional bugs and have rolled back to 3.3.9.
The first is a HEADER_DONE not set error on a single dataset of a ZFS pool. We're dumping in the neighborhood of 20 small datasets (one per user) on a server running zfsonlinux, and each run resulted in precisely one of these failing with the HEADER_DONE not set error. It was not the same dataset each time. I've never encountered this error before, and digging into the code it wasn't immediately apparent why this was happening. The error did not occur with last night's 3.3.9 run.
The other error, which was a (literal) showstopper, was a taper hang. The run two nights ago never finished. At the time of investigation there were 8 idle dumper threads and 3 taper threads, with the driver reporting that the taper was writing. Two of the tapers were in a futex wait channel and the third was in a poll_s... wchan. I ultimately sent sigterm to this last taper and amanda immediately wrapped things up with a taper protocol error. The logs didn't report anything unusual prior to the hang and there's no evidence of any i/o problems with the hardware or network. And the problem did not occur with 3.3.9.
So it seems 3.4 is not ready for primetime, at least not for us.