Hi Everyone,

I have a HUGE data management problem that I am looking to try and solve with Amanda if possible. I'll describe the issue, my current setup and what I am looking to do.

I am a system engineer for a Next Generation DNA Sequencing (NGS) group. We run a number of NGS systems that generate up to 3 TB of data per run. There are a number of datasets that we generate that we cannot delete (publication data, critical development milestones, rare data etc.) and we cannot keep them on spinning disk as we have some tight restrictions on facilities (AC/Power/Rack space). To deal with our never ending growth of data, we need an On Demand offline (non-spinning) storage tier to act as a relief valve for our existing spinning storage.

Our current hacked together solution is a bunch of Hot Swap SATA drive bays connected via eSATA to a tower server. It works, but it is very time and labor intensive.
Here is an example of the drive bays we use.


First, we sequester the dataset into an "archiving" directory. Then we compress the data. Next, we format the drives as EXT3 and uniquely label them. The final step is to move each dataset (files and directories) via rsync to a drive/drives. We have some scripts to take an inventory (at the top level of each drive) after the rsyncs are complete and dump it into text files. We have another script that queries the text files to find specific datasets if we need to restore them. We run the rsync on a dataset until either all of the data is transferred, or the drive fills up and we need to continue the rsync on the next available drive.

It is a very labor intensive solution, but it was braindead simple to set up and very low cost.

Now, I would like to move this from using bare hard drives to using tape for the following reasons in no particular order:
  1. Automate the inventory of the data on the tapes
  2. Can ship tapes to an offsite data warehouse (Like Iron Mountain)
  3. Can automate the switching of archive volumes when one volume has filled but there is still more data to archive from a dataset
  4. Ideally, the backup client would monitor a "Archive" directory looking for any new entries and automatically start the transfer to tape
  5. Automate the compression of the data onto tape

Now, I may just not be able to get my head around this, but none of the tape software solutions out there seem to address this problem the way I am looking at it. We are not trying to do a typical scheduled backup of the data. This is a on-demand archiving problem. We don't want to image the entire file system, just a directory or just a few directories at a time.

So, my question is, can Amanda be made to work for an on-demand configuration? Can Amanda be configured to monitor an "Archive" directory on various storage servers and start pulling directories off once they have been placed into the "Archive" directory automatically? I know Amanda can handle the tape changing/label tracking/compression part of this problem.

If anyone has any pointers to documents describing a solution like is desired, I would be most grateful for the heads up. Like I said, most of the docs on the Amanda wiki seem to be focused on the "Scheduled Daily/Weekly/Monthly - Full/Incremental backup of entire file systems" and not an on-demand file/directory level archiving setup.

Thank you for any assistance and please feel free to ask any questions for clarification.

- Mike