Page 1 of 2 12 LastLast
Results 1 to 10 of 15

Thread: De-Duplication Support?

  1. #1

    Default De-Duplication Support?

    Hello everyone,

    does anybody know if there are plans to integrate De-Duplication Support into Amanda? This Feature whould be absolutly great, calculate how much Files you Backup every day from XX Clients which are 100% the same (/usr/bin, /usr/sbin etc on Linux Systems for example).

    Greetings

    Juergen

  2. #2
    Join Date
    Sep 2007
    Location
    Massachusetts
    Posts
    58

    Default

    This would be extremely complicated. It's been on Bacula's wish list for years and never been implemented. Amanda's architecture, using native backup tools on the clients, would end up requiring a totally new infrastructure to communicate file information from clients to server and back to other clients as well as some mechanism for the clients to use that information. An alternative approach would require a huge amount of processing and analysis on the server to remove duplicates and then to insert appropriate copies back in to reconstruct the image before a recovery. And that would have to take into account each of the various native tool formats. Unless some developer comes in with a stroke of genius, I think it is safe to assume it won't ever happen. The backup programs that have that feature use proprietary formats and write their own backup mechanisms on every supported platform.

  3. #3
    Join Date
    Oct 2005
    Posts
    7

    Default

    We (at Zmanda) have been thinking for a while about how to add this feature to Amanda environments. So, far couple of ways of doing this:

    1. Use a storage device which does block level de-dup as the backup device for Amanda. E.g. Data domain. We had tested a Data Domain device with Amanda few months ago, with pretty good results. This is completely transparent to Amanda.

    2. Integrate BackupPC with Amanda. BackupPC is an open source tool which does file level de-deup, but doesn't have media management. Paddy was doing some initial research on this. I will request him to add his comments to this thread, when he makes some progress.

    If anyone has any other ideas/experiences on how to incorporate de-dup functionality in a Amanda based backup environment, please do post them here.

    thanks

  4. #4
    Join Date
    Oct 2007
    Location
    Gold Coast, Australia
    Posts
    9

    Default Dirvish

    Dirvish can do this "de-duplication", by replacing files with links. From the Debian package description: "common files are shared between the different backup generations". Not as powerful as amanda though :-)

    [url]http://dirvish.org/[/url]

  5. #5

    Red face de-duplication

    IMHO - de-duplication & CDP are essential functions for any "enterprise" recovery tool (closed source or open). With CDP now covered (at least for MySQL), it appears that de-duplication would be the next logical step. As enterprises grow and/or decentralize operations while continue consolidation efforts, centralized disk/tape backups that cover mobile end-users to server farms/remtoe offices is shaping the landscape of high-end recovery solutions.

    After checking out BackupPC and having limited experience with Data Domain's product line, it seems to me that both approaches should be explored. Integrating existing open source code for software based de-duplication while providing integration guidance leveraging hardware based de-duplication solutions would deliver a range of cost effective and scalable options.

  6. #6

    Default

    Hi,

    Please Donīt mistake File level Dedup with Block-level, file level is somewhat ok, but far away from a good solution to minimize data, and often there are duplicate files with differences anyway, thats mostly why there are duplicate files in an FS, so File DeDupe is no good.

    Block Level Dedup is the answer to all of our problems with duplicate data(blocks) whether it is a duplicate file or not, it really doesnīt matter wich file it is, ex. Lets say you have 3 files with 8192 Aīs in them in a filesystem with cluster size of 1024b , what block level Dedup does is look at the blocks saved on the backup server and then looks for those Aīs, now since the block size is 1024b that first block will become Un-deduped, but the rest of the Blocks,3x1024b=7168b of them, will reference to the first 1024b block. Now this is a really simple explenation to block Dedup.

    This way the more data you save the more space you get There have been cases where you save up to 70% space, that is if you would have a 10TB Disk stg for backup you would save 7TB of Disks.. thats a lot of Storage and money.

    Hopefully Amanda will begin looking at block-level backups with Dedup technology, when that happens all the big Companies like IBM(TSM) and Symantec(Netbackup) will have a serious competitor in Amanda.
    Last edited by petergrande; June 15th, 2008 at 03:55 PM.

  7. #7
    Join Date
    Oct 2005
    Posts
    58

    Default

    Block level deduplication is best handled by appliances such as Data Domain (DD).
    Data Domain is a technology partner of Zmanda and together we have qualified both the Amanda Community and the Amanda Enterprise with DD appliances. In our testing we achieved 1:22 compression ratio after 19 backup runs to vtapes created on DD over NFS.
    ---------
    Dmitri Joukovski

  8. #8

    Talking De-dup best done with appliances

    Not exactly! This is a very common misconception. The problem is the fact that de-duplication at the appliance level does not reduce network traffic, particularly in large enterprises where there is a huge number of large virtual servers that must be backed up.

    File level de-duplication, such as the CommVault SIS (Single Instance Store) helps for backup at the file level. However, block-level dedupe is required for DBMS applications.

    For a few servers, no problem, but when you get into hundreds and thousands of servers de-dupe at the host is essential!!!

  9. #9

    Default Thoughts

    Dear Zmanda,

    You have a client on each server. I would have thought that you would want to save bandwidth, time and disk space as part of dedup.

    As 60-70% of most servers are common libraries etc assuming patching etc is maintained consistently I think you would want the clients to run an MD5 of all files on a server then submit this to the Backup Server for comparison. This would acheive 2 things

    1) You could dedup common files on the server as part of the client implementation
    2) You could dedup across servers as part of the backup server implementation

    The Backup server should only request a single copy of each file it needs from servers and the rest would be dedpued and never transmitted across the wire. This would save bandwidth across the wire, time as so few files would be backed up and large amounts of disk space. The overhead would be MD5 checksum processing time on the clients.

    Anyway that is my thoughts.

    Yours sincerely

    David Ruwoldt

  10. #10
    Join Date
    Aug 2008
    Posts
    184

    Default

    We are actively working on integrating BackupPC based source-level deduplication in [URL="http://www.zmanda.com/zba.html"]Zmanda Backup Appliance[/URL]. Please look for an announcement soon!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •