PDA

View Full Version : tape spanning issue with large DLE



Mike McConnell
November 3rd, 2008, 10:21 AM
Sorry for the long post in advance.

I'm having problems with tape spanning of very large DLEs.

Background Info:
client machine: 3.8G Extreme HT x86_64, Centos 5.2 (64bit), 4G Ram, 2.5T SATA II Disk space with 1.5T being backed up
backup server: Dell SC1420, Xeon 2.8G, Centos 5.2 (64bit), 80G disk, 1G Ram
tape: 17 LTO3 cartridge robot with one HP Ultrium tape drive

using amanda 2.6.0p1 from source.

I have amanda running on another setup doing PCs and small servers and it's working fine but I couldn't get the big servers to work so I've removed them from the disklist file of that system. One big difference there is that I have a 500g holding disk so everything goes there before being dumped to tape. These backups are so big – and growing – that I cannot practically keep enough holding space available so I'll run them direct to tape. I've setup another backup server to work with and have been trying to get it to backup up just one of the big servers.

The client machine has 1.5T of data that I need to back up. Amanda runs and fills one tape, request a new tape, writes a small amount of data (more on this later) and then immediately asks for another new tape, writes a small amount etc until it exhausts “runtapes” number of tapes.

I cannot get it to span to the next tape properly and I'm hoping someone can point out my error.

Initially I had followed the recommendation and set the tape_splitsize to 40G which is ~10% of a single tapes capacity (~400G) and got the result mentioned above. After some more digging I saw several posts that suggested limitations related to mmap in 32bit systems and recommended setting tape_splitsize to 4G or even 2G so I reran the backup with 4G even though I'm on 64 bit machines, same result, and then again with 2G, and again same result. This is a bit painful since each run take about 5 and half hours to fill the first tape and then begins cycling through the remaining tapes.

While running I noticed that it never uses any of the split_disk space either, even when it at the end of the first tape and asking for a new tape. Also I'm only attempting to backup one DLE which is the 1.5T server and I'm only using only one dumptype which I set dumpcycle = 0 to force a full backup. The only thing I haven't tried yet is breaking the DLE into smaller pieces but since amanda is already breaking it into tape_splitsize chunks it wasn't obvious that would help.

My amanda.conf file:
inparallel 4
maxdumps 8
tapecycle 40 tapes
device_output_buffer_size 10m

runtapes 5

tapetype OVRL_LTO3_256
ctimeout 30
dtimeout 1800
etimeout 300

flush-threshold-dumped 100
flush-threshold-scheduled 100
taperflush 0

holdingdisk hd1 {
comment "main holding disk"
directory "/space/amandahold"
use 35 gb
chunksize 1 gb # size of chunk to allocate
}

define tapetype OVRL_LTO3_256 {
comment "just produced by tapetype prog (hardware compression off)"
length 373760 mbytes
filemark 0 kbytes
speed 65670 kps
blocksize 256
}


define dumptype global {
index yes
program "GNUTAR"
tape_splitsize 2Gb
split_diskbuffer "/space/splits"
}

define dumptype no-compression {
global
comment "used for big server samba shares"
holdingdisk never
compress none
priority medium
dumpcycle 0
}

Because of the tape_splitsize was set to 2G for this run, the first 183 parts were written to the first tape. Things go bad with part 184.

picking up from the last (183) part written to tape here's the amdump log file:

driver: result time 18768.867 from taper: PARTDONE 00-00001 MCNET-001 183 2097152 "[sec 101.938663 kb 2097152 kps 20572.684968]"

driver: state time 18833.393 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle

driver: interface-state time 18833.393 if default: free 798976

driver: hdisk-state time 18833.393 hdisk 0: free 36700160 dumpers 0

driver: result time 18833.526 from taper: REQUEST-NEW-TAPE 00-00001

driver: send-cmd time 18833.526 to taper: NEW-TAPE

taper: using label `MCNET-002' date `20081102231511'

driver: state time 18931.808 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle

driver: interface-state time 18931.808 if default: free 798976

driver: hdisk-state time 18931.808 hdisk 0: free 36700160 dumpers 0

driver: result time 18931.808 from taper: NEW-TAPE 00-00001 MCNET-002

Kernel gave unexpected write() result of "Invalid argument" on device /dev/nst0.

driver: state time 18976.103 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle

driver: interface-state time 18976.103 if default: free 798976

driver: hdisk-state time 18976.103 hdisk 0: free 36700160 dumpers 0

driver: result time 18976.103 from taper: REQUEST-NEW-TAPE 00-00001

driver: send-cmd time 18976.103 to taper: NEW-TAPE

taper: using label `MCNET-003' date `20081102231511'

driver: state time 19075.856 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle

driver: interface-state time 19075.856 if default: free 798976

driver: hdisk-state time 19075.856 hdisk 0: free 36700160 dumpers 0

driver: result time 19075.856 from taper: NEW-TAPE 00-00001 MCNET-003

Kernel gave unexpected write() result of "Invalid argument" on device /dev/nst0.

From the log file:
PART taper MCNET-001 183 MCS1 /export 20081102231511 183/-1 0 [sec 101.938663 kb 2097152 kps 20572.684968]

PARTPARTIAL taper MCNET-001 184 MCS1 /export 20081102231511 184/-1 0 [sec 53.069631 kb 1117184 kps 21051.286375] ""

INFO taper Will request retry of failed split part.

START taper datestamp 20081102231511 label MCNET-002 tape 2

PARTPARTIAL taper MCNET-002 1 MCS1 /export 20081102231511 184/-1 0 [sec 22.265419 kb 1117440 kps 50187.243276] ""

INFO taper Will request retry of failed split part.

START taper datestamp 20081102231511 label MCNET-003 tape 3

From the taper debug log I snipped from the last sucessful write which was part 183:
1225715280.185300: taper: putresult: 20 PARTDONE

1225715283.017962: taper: Building type 5 (SPLIT_FILE) header of size 262144 using:

1225715283.018017: taper: Contents of *(dumpfile_t *)0x1cae03f0:

it then builds what looks like a good header but fails and goes for a new tape as indicated:
1225715344.718008: taper: putresult: 26 REQUEST-NEW-TAPE

1225715344.851214: taper: getcmd: NEW-TAPE

1225715346.640598: taper: changer_query: changer return was 17 1 1

1225715346.640642: taper: changer_query: searchable = 1

1225715346.640655: taper: changer_find: looking for NULL changer is searchable = 1

"/dev/nst0" uses deprecated device naming convention;
using "tape:/dev/nst0" instead.

1225715427.450930: taper: device_read_label; mode = 0

1225715431.451484: taper: changer_label: MCNET-002 for slot 2

1225715431.960607: taper: changer_query: changer return was 17 1 1

1225715431.960658: taper: changer_query: searchable = 1

1225715431.960671: taper: changer_label: calling changer -label MCNET-002

1225715432.503888: taper: device_read_label; mode = 0

1225715436.351678: taper: device_start mode = 2

1225715439.955615: taper: Building type 1 (TAPESTART) header of size 262144 using:

1225715439.955659: taper: Contents of *(dumpfile_t *)0x1cae1000:

I noticed that it switched from type 5 to type 1 headers which seems reasonable but it just writes type 1 headers to each of the next 4 tapes.

The result is:
1225715344.718008: taper: putresult: 26 REQUEST-NEW-TAPE

1225715344.851214: taper: getcmd: NEW-TAPE

and it keeps cycling to new tapes until it runs out of tapes and then quits.

The Kernal write error seems to be the smoking gun to me but I cannot figure out how to correct it.

If I missed any important data please let me know and I can post as needed.

Thanks for any help.

Mike

Mike McConnell
November 6th, 2008, 09:03 AM
So in desperation I tried breaking the DLE into smaller pieces and as I suspected, it didn't help. From amstatus:

MCS1:/export/1-B 0 238g wait for dumping
MCS1:/export/C-D 0 301g finished (23:41:05)
MCS1:/export/E-H 0 274g wait for dumping
MCS1:/export/I-L 0 206g wait for dumping
MCS1:/export/M-P 0 280g dump to tape failed: dumper: [data write: Broken pipe]
MCS1:/export/Q-Z 0 242g wait for dumping

and from the email summary

FAILURE DUMP SUMMARY:
MCS1 /export/M-P lev 0 FAILED [data write: Broken pipe]
MCS1 /export/E-H lev 0 FAILED [can't dump no-hold disk in degraded mode]
MCS1 /export/Q-Z lev 0 FAILED [can't dump no-hold disk in degraded mode]
MCS1 /export/1-B lev 0 FAILED [can't dump no-hold disk in degraded mode]
MCS1 /export/I-L lev 0 FAILED [can't dump no-hold disk in degraded mode]
MCS1 /export/M-P lev 0 FAILED [can't dump no-hold disk in degraded mode]


STATISTICS:
Total Full Incr.
-------- -------- --------
Estimate Time (hrs:min) 0:00
Run Time (hrs:min) 5:29
Dump Time (hrs:min) 4:18 4:18 0:00
Output Size (meg) 308710.4 308710.4 0.0
Original Size (meg) 308710.4 308710.4 0.0
Avg Compressed Size (%) -- -- --
Filesystems Dumped 1 1 0
Avg Dump Rate (k/s) 20442.4 20442.4 --

Tape Time (hrs:min) 5:09 5:09 0:00
Tape Size (meg) 376181.9 376181.9 0.0
Tape Used (%) 100.6 100.6 0.0
Filesystems Taped 2 2 0

Chunks Taped 188 188 0
Avg Tp Write Rate (k/s) 20756.8 20756.8 --

USAGE BY TAPE:
Label Time Size % Nb Nc
MCNET-001 5:10 367G 100.6 1 184
MCNET-002 0:01 2G 0.5 0 1
MCNET-003 0:01 2G 0.5 0 1
MCNET-004 0:01 2G 0.5 0 1
MCNET-005 0:01 2G 0.5 1 1


FAILED DUMP DETAILS:

/-- MCS1 /export/M-P lev 0 FAILED [data write: Broken pipe]
sendbackup: start [MCS1:/export/M-P level 0]
sendbackup: info BACKUP=/bin/tar
sendbackup: info RECOVER_CMD=/bin/tar -xpGf - ...
sendbackup: info end
\--------


NOTES:
planner: Adding new disk MCS1:/export/1-B.
planner: Adding new disk MCS1:/export/C-D.
planner: Adding new disk MCS1:/export/E-H.
planner: Adding new disk MCS1:/export/I-L.
planner: Adding new disk MCS1:/export/M-P.
planner: Adding new disk MCS1:/export/Q-Z.
taper: Will request retry of failed split part.
taper: Will request retry of failed split part.
taper: Will request retry of failed split part.
taper: Will request retry of failed split part.
taper: Will request retry of failed split part.
driver: Out of tapes; going into degraded mode.


DUMP SUMMARY:
DUMPER STATS TAPER STATS
HOSTNAME DISK L ORIG-GB OUT-GB COMP% MMM:SS KB/s MMM:SS KB/s
-------------------------- ------------------------------------- -------------
MCS1 /export/1-B 0 FAILED --------------------------------------------
MCS1 /export/C-D 0 301 301 -- 257:44 20442.4 252:36 20857.4
MCS1 /export/E-H 0 FAILED --------------------------------------------
MCS1 /export/I-L 0 FAILED --------------------------------------------
MCS1 /export/M-P 0 N/A 66 -- N/A N/A 56:42 20308.5 TAPE-PARTIAL
MCS1 /export/Q-Z 0 FAILED --------------------------------------------

(brought to you by Amanda version 2.6.0p2)

pyeatman
November 6th, 2008, 12:11 PM
Yes, the Kernel write errors are what caught my eye too and yet similarly am not sure what to do with this info.

Do you get the same kernel write error message on the split-up configuration when /export/M-P fails?

Mike McConnell
November 6th, 2008, 01:09 PM
Yes I see the same error.

It made a smooth transition from the first part, picked up and continued on tape MCNET-001 and continued with the next block sequence number from where it left off until it filled the tape, which occurs after block 183 when using 2K blocks. Here's the amdump part of interest picking up from the last good block (183).

driver: result time 18909.454 from taper: PARTDONE 00-00002 MCNET-001 183 2097152 "[sec 104.615566 kb 2097152 kps 20046.271126]"
driver: state time 19021.944 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle
driver: interface-state time 19021.944 if default: free 798976
driver: hdisk-state time 19021.944 hdisk 0: free 36700160 dumpers 0
driver: result time 19022.148 from taper: REQUEST-NEW-TAPE 00-00002
driver: send-cmd time 19022.148 to taper: NEW-TAPE
taper: using label `MCNET-002' date `20081105192309'
driver: state time 19120.440 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle
driver: interface-state time 19120.440 if default: free 798976
driver: hdisk-state time 19120.440 hdisk 0: free 36700160 dumpers 0
driver: result time 19120.440 from taper: NEW-TAPE 00-00002 MCNET-002
Kernel gave unexpected write() result of "Invalid argument" on device /dev/nst0.
driver: state time 19199.963 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle
driver: interface-state time 19199.963 if default: free 798976
driver: hdisk-state time 19199.963 hdisk 0: free 36700160 dumpers 0
driver: result time 19199.963 from taper: REQUEST-NEW-TAPE 00-00002
driver: send-cmd time 19199.963 to taper: NEW-TAPE
taper: using label `MCNET-003' date `20081105192309'

I haven't looked into yet but is there a debug level I can set to get more information?

Mike

pyeatman
November 6th, 2008, 03:10 PM
I believe there is only one debug level.

So was it at the same point of just having started on a second tape when it failed on M-P, ie. the same tape result as when the DLE is not split up? Is the amount it writes on the second tape before failing correlated to tape_splitsize?

Mike McConnell
November 6th, 2008, 04:25 PM
There is some relationship but I'm not sure what to make of it. When I used 20G split size It wrote 8G chunks, when I used 4G is wrote 5G chunks. Since I've only run both single DLE and the 6 part DLE with 2G I only have that data. 2G tape_splits with one large DLE resulted in 1G chunks but 2G tape_splits with a 6 part DLE produced the 3G chunks.

I was careful that the only thing changed from 1 DLE vs 6 part DLE was the disklist. No changes to amanda.conf or the client.

Mike McConnell
November 6th, 2008, 10:31 PM
Update:

I printed and studied the logs and saw something that I had seen before but shrugged off since the other config for small dumps was working.

I had previously fought a battle to get the tape drive setup for 256K blocks but when I looked in the taper.xxx.debug logs I would always see this (and still do even on this run):

1226039632.235909: taper: Building type 1 (TAPESTART) header of size 262144 using:
1226039632.235934: taper: Contents of *(dumpfile_t *)0xa0aec60:
1226039632.235946: taper: type = 1 (TAPESTART)
1226039632.235963: taper: datestamp = '20081106223330'
1226039632.235974: taper: dumplevel = 0
1226039632.235985: taper: compressed = 0
1226039632.235995: taper: encrypted = 0
1226039632.236005: taper: comp_suffix = ''
1226039632.236015: taper: encrypt_suffix = ''
1226039632.236025: taper: name = 'MCNET-001'
1226039632.236035: taper: disk = ''
1226039632.236045: taper: program = ''
1226039632.236055: taper: dumper = ''
1226039632.236065: taper: srvcompprog = ''
1226039632.236075: taper: clntcompprog = ''
1226039632.236085: taper: srv_encrypt = ''
1226039632.236095: taper: clnt_encrypt = ''
1226039632.236104: taper: recover_cmd = ''
1226039632.236114: taper: uncompress_cmd = ''
1226039632.236124: taper: encrypt_cmd = ''
1226039632.236134: taper: decrypt_cmd = ''
1226039632.236144: taper: srv_decrypt_opt = ''
1226039632.236153: taper: clnt_decrypt_opt = ''
1226039632.236163: taper: cont_filename = ''
1226039632.236173: taper: is_partial = 0
1226039632.236183: taper: partnum = 0
1226039632.236193: taper: totalparts = 0
1226039632.236203: taper: blocksize = 32768

Notice that the header size is clearly 262144 (256K) but the blocksize in the label being built is 32768. I just assumed this was the amanda label block size and not the tape block size. Anyway it started me chasing down the path.

I specify blocksize 256 in the tapetype def in my amanda.conf as noted above.

I was checking every entry in the amanda.conf against the man page and noted that the device_property "BLOCK_SIZE" "262144" statement had to agree withe the ./configure option --with-maxtapeblocksize setting. I had the entry in my conf file but had it commented out since it was in the taptype def and something I read somewhere said I didn't need it if it was specified in the tapetype.

I ran amadmin xx version to see what was compiled in but it failed to list anything about this parameter. In checking my notes I saw that I had a typo in the ./configure string I was using. I had typed --with-maxblocksize=256 leaving out the word tape.

Since I was already at the point of desperation I was more than willing to remove, redownload and recompile. Done

So now it's running but I won't know anything for sure for another 5 hours or so. One disappointment is that re-running amadmin xx version still doesn't report anything about the maxtapeblocksize setting. I checked everything in the ./configure string character by character so I know it's in there correctly.

But there are two thoughts to pass along. One is that the ./configure/make process didn't report any errors when fed a bad parameter, assuming that it wasn't smart enough to accept either variation. I know some checking is done but I can't tell how strict. Might be worth looking at. The other thought is that amadmin (or some equivalent) needs an option to dump all of its configuration options because I also notice other options that I know are working are not listed (--with-tcp(udp)portrange for example). This is a valuable debug tool an might even help the amanda team by giving them a tool to fall back on when weird things start to happen.

I'll report back tomorrow am and let you know if I made any progress.

Mike

dustin
November 7th, 2008, 07:31 AM
There was a bug in displaying the blocksize in the headers that has been fixed since 2.6.0p2 was released.

Autoconf-generated configure scripts do not, as a rule, give errors for incorrect arguments. You could argue either side of the point, but the autoconf authors have designed configure this way so that it can pass off unrecognized options to sub-scripts.

It's actually very difficult to get the configure options into a usable form -- there is a lot of shell script that gets added to the beginning.

Finally, the --with-maxtapeblocksize option has been removed, also since the release of 2.6.0p2, and block sizes have, in general, been made a lot simpler.

Mike McConnell
November 7th, 2008, 08:29 AM
That all makes sense. Thanks.

The dump failed with the same Kernel error message :confused:

Mike

dustin
November 7th, 2008, 08:43 AM
Can you post your taper debug logs? The kernel error you're seeing makes no sense to me, either -- EINVAL is usually reserved for invalid inputs to a syscall, but in this case the inputs seem completely valid. My guess is that your kernel is returning this error to indicate that the tape drive isn't yet read to write -- perhaps it is still loading. Usually chg-zd-mtx can catch this by running amdevcheck to read the tape label, but perhaps your OS or tape drive is just weirder than most.

You may want to look at the initial_poll_delay parameter to chg-zd-mtx -- that may give it enough time to settle down and get ready to write your data.

pyeatman
November 7th, 2008, 11:16 AM
I assume you are configuring Amanda for tape blocksize 256k because you know this is what your drive uses? It makes sense that being in the tapetype definition is enough.

Mike McConnell
November 7th, 2008, 04:15 PM
Thanks Dustin. I'll do some more investigation.

Attached in the taper debug log with some of the middle snipped out to reduce its size. One reoccurring entry at the failure point strikes me as something I can dig into tonight.

taper: changer_find: looking for NULL changer is searchable = 1
"/dev/nst0" uses deprecated device naming convention;

I thought I had completely eliminated the "/dev/nst0" entries in favor of "tape:/dev/nst0". Just a nit but might lead me to something. The "looking for NULL ..." is something I'll look into also but don't know quite how to interpret it at the moment.

I'll post anything I find over the weekend.

Cheers,

Mike

Mike McConnell
November 7th, 2008, 04:30 PM
Paul,

I looked at the manufacturers web page and they suggested 256 and other posts I read in the forum suggested improved performance. As I mentioned, 32K buffers actually was a bit better performance but I could clearly hear much more noise from the drive so I opted for the 256. It appears either work and if all else fails I would certainly consider trying 32k buffers. Seems like an easy thing to do now since I apparently don't have to re-compile amanda. Remove one entry in modprobe.conf, change the tapetype, reboot the computer and of course relabel the tapes.

I'll look into Dustin's suggestions and if necessary try 32k buffers too although I might have to compete for bandwidth this weekend.

I appreciate the help and suggestions.

Mike

Mike McConnell
November 12th, 2008, 09:58 AM
Got some very interesting results that makes me believe we're definitely in the hunt but more tuning is required.

I did some investigation into the suggestions Dustin made and found what looked to be a smoking gun. I had the initial_poll_delay set to 20 seconds and the max_drive_wait set to 60 seconds. To validate these numbers I timed the robot to see how much time it actually takes to change a tape. Much to my horror I discovered that it takes about 60 seconds just to change tapes plus another 30 to 35 before the lcd panel on the front reads ready (threading and getting to load point). If amanda was waiting the 60 plus 20 it's entirely possible the tape wasn't ready since I clocked it 5 times in a row to be ~90+ seconds. This could account for the system error of the tape not ready.

So I started a run with only those two parameters changed: initial_poll_delay from 20 to 60 and max_drive_wait from 60 to 120.

The first observation is that the time to write a tape went up dramatically. Using the original parameters it took 5.3 hrs to write out one full tape. I looked in the amdump log and calculated the time from the first block until I saw the Request New Tape entry. Using the same method with these new parameters writing one tape took 11.1 hrs.

However, it did successfully transition from the first tape to the second where it proceeded to write the remaining portion of the second DLE entry. When it filled the second tape it crashed with the same Kernel error message for each of the next three tapes left in runtapes.

So digging into the logs again looking carefully at the transition points where tapes were being changed. From one to two here's the entry I focused in on:

driver: result time 39629.482 from taper: PARTDONE 00-00002 MCNET-001 183 2097152 "[sec 213.594621 kb 2097152 kps 9818.374593]"
driver: state time 39848.753 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle
driver: interface-state time 39848.753 if default: free 798976
driver: hdisk-state time 39848.753 hdisk 0: free 36700160 dumpers 0
driver: result time 39848.870 from taper: REQUEST-NEW-TAPE 00-00002
driver: send-cmd time 39848.870 to taper: NEW-TAPE
taper: using label `MCNET-002' date `20081111162944'
driver: state time 39988.694 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle
driver: interface-state time 39988.694 if default: free 798976
driver: hdisk-state time 39988.694 hdisk 0: free 36700160 dumpers 0
driver: result time 39988.694 from taper: NEW-TAPE 00-00002 MCNET-002
driver: state time 40035.641 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle
driver: interface-state time 40035.641 if default: free 798976
driver: hdisk-state time 40035.641 hdisk 0: free 36700160 dumpers 0
driver: result time 40035.642 from taper: PARTDONE 00-00002 MCNET-002 1 1933829 "[sec 44.103476 kb 1933829 kps 43847.554740]"
driver: state time 40248.174 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle
driver: interface-state time 40248.174 if default: free 798976
driver: hdisk-state time 40248.174 hdisk 0: free 36700160 dumpers 0
driver: result time 40248.309 from taper: PARTDONE 00-00002 MCNET-002 2 2097152 "[sec 211.333224 kb 2097152 kps 9923.437311]"

and the from the second to the third tape:

driver: result time 63047.842 from taper: PARTDONE 00-00002 MCNET-002 108 2097152 "[sec 212.410113 kb 2097152 kps 9873.126898]"
driver: state time 63097.127 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle
driver: interface-state time 63097.127 if default: free 798976
driver: hdisk-state time 63097.127 hdisk 0: free 36700160 dumpers 0
driver: result time 63097.127 from dumper0: DONE 00-00002 293894550 293894550 30321 "[sec 30320.848 kb 293894550 kps 9692.8 orig-kb 293894550]"
driver: finished-cmd time 63097.127 dumper0 dumped MCS1:/export/M-P
Kernel gave unexpected write() result of "Invalid argument" on device /dev/nst0.
driver: state time 63179.363 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle
driver: interface-state time 63179.363 if default: free 798976
driver: hdisk-state time 63179.363 hdisk 0: free 36700160 dumpers 0
driver: result time 63179.363 from taper: REQUEST-NEW-TAPE 00-00002
driver: send-cmd time 63179.363 to taper: NEW-TAPE
taper: using label `MCNET-003' date `20081111162944'
driver: state time 63319.975 free kps: 798976 space: 36700160 taper: writing idle-dumpers: 3 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle: not-idle
driver: interface-state time 63319.975 if default: free 798976
driver: hdisk-state time 63319.975 hdisk 0: free 36700160 dumpers 0
driver: result time 63319.975 from taper: NEW-TAPE 00-00002 MCNET-003
Kernel gave unexpected write() result of "Invalid argument" on device /dev/nst0.
driver: state time 63343.367 free kps: 798976 space: 36700160 taper: writing idle

It appears to me that the amount of time is about the same, ~200 seconds. The only thing I can figure is that I'm on the ragged edge possibly.

From this I'm thinking max_drive_wait needs to get pushed to 180 seconds and initial_poll_delay set back to 20 or even less to speed up the process.

Suggestions and thoughts always welcome.

Thanks to all,

Mike

pyeatman
November 12th, 2008, 11:41 AM
Sounds like a great guess to me. I'm interested to see the [positive] results.

Mike McConnell
November 14th, 2008, 06:04 PM
Success!

The complete 1.6T run spanning 5 tapes finished normally. I haven't had a chance to check the results and try a restore on a couple of files yet but I'd be surprised if there was a problem.

Notes to take away:

The error posted in the amdump log was:

Kernel gave unexpected write() result of "Invalid argument" on device /dev/nst0

This was caused because the max_tape_wait value in the changer.conf file was incorrectly set to a value that was insufficient to allow the robot to unload a tape, store it, fetch the next tape, load it and allow it to get to load point thus resulting in a ready condition not being observed prior to expiration of the max_tape_wait. Neither amlabel or amtape ever complained despite my frequent use of the "slot n" arguments that would cause the changing of tapes prior to labeling or reset.

It is worthy of note that the error posted in the client log file was a broken pipe error. I believe this was caused because the client lost contact with the server when the server stopped the process after exhausting all available tapes. Obviously a symptom but not the root cause.

One thing I still don't understand is the additional time to write a tape. Here's some data from three successive runs. The only thing changed were these setting in changer.conf. Note that the runs on 1105 and 1111 both failed but the last run succeeded.

Run on date 20081112 poll=15, max tape = 180 (just finished successfully)
Run on date 20081111 poll=60, max tape = 120 (failed)
Run on date 20081105 poll=20, max tape = 60 (failed)

PART taper MCNET-005 9 MCS1 /export/ 20081112155621 9/-1 0 [sec 218.504344 kb 2097152 kps 9597.758844]
PART taper MCNET-001 9 MCS1 /export/ 20081111162944 9/-1 0 [sec 223.447785 kb 2097152 kps 9385.423087]
PART taper MCNET-001 9 MCS1 /export/ 20081105192309 9/-1 0 [sec 105.692900 kb 2097152 kps 19841.938295]

note that when max_tape_delay was set to 60 the time to write a 2G block (which was set with the tape_splitsize setting in amanda.conf) was 105s and the speed was 19842kps. From then on the time to write the same size block took ~220s at ~9500kps. I'll play with things a bit and if I make any discoveries I'll post accordingly. This makes the time to write a single LTO3 tape ~8.5 hours compared to 5.3hr. This is a significant increase and doesn't seem to make sense so I'll dig some more to see what I can figure out but hey, it worked so I'm not unhappy.

Thanks to Paul and Dustin for extremely helpful comments and suggestions. Obviously couldn't have gotten here without your help.

I noticed a few nits but will post that in the suggestion box.

Regards and appreciation to all.

Mike

pyeatman
November 17th, 2008, 03:15 PM
That's great! I don't know if Dustin or Jean-Louis have any insight into the change in speed.