ENTRIES
Welcome to Eric Cheng's online journal! You are not logged in. [ Log in ]
«  :: index ::  »

File storage and backup for photographers

:: Sunday, January 6th, 2013 @ 2:53:39 am

:: Tags: , , ,

Eric Cheng Pictures

I spend way too much time thinking about data storage and backup. I’ve been a professional photographer for nearly 10 years, and have accumulated over 10 terabytes of pictures, video, and project data. I have finally implemented a storage and backup scheme that I’m happy with. It took a long time to set up, but I have direct access to all of my media now, and have comfort in knowing that it is securely backed up.

Can I back up to the cloud?

A lot of normal people (non-photographers) are starting to store their pictures exclusively on the cloud, and while there are some great cloud storage services out there that cater to photographers, none of them are really suitable for storing or backing up multiple terabytes of data. Also, uploading to the cloud is slow. A fairly-fast DSL or Cable connection will probably allow you to sustain upload speeds of 200KB/s (I’m being generous). At that speed, uploading 1TB takes about 2 months. Uploading 10TB would take nearly 2 years, and a mainstream ISP will likely throttle you before allowing you to use that much data. So cloud backup is out.

Simplicity

Over the years, I’ve had crazy backup schemes involving multiple computers, multiple software products and services, and custom scripts, all requiring coordinated (but automated) execution to keep my data safe. All of these schemes required that I create flow charts to track how data moved during backups; without that documentation, I might have forgotten how things work, over time.

I’m sick of these crazy schemes, and have finally settled on something much more simple.

Kinds of data

As someone who collects pictures and video files, I think of my data as living in two different categories. I’ve had to simplify my thinking a lot in order categorize data this way, but that’s OK. It’s got to be rough to be simple.

  1. Working data: all system files, applications, documents, email, and project data (including temporary or intermediate files and final output files)
  2. Raw data: pictures, video, audio and other media generated by cameras or capture devices

The main difference between the two is that Raw data doesn’t really change after it is captured. After a single photography trip, I might have 300GB of pictures and video, which I consider to be Raw data. I may then create 50GB of additional data in project data over time (e.g., slideshows, produced videos, edited pictures saved as TIFs). I consider all of this to be Working data. Even if it doesn’t change in a long time, I may decide—at any time—to re-open and tweak a project, which will result in a need to back up again. It also means that I might accidentally screw up a project, so saving multiple versions of Working data is desirable.

Backup requirements

  • Working data should be continuously, incrementally backed up in a versioned manner so I can roll back to a prior state for any given file

  • Raw data should be backed up in a versioned manner as well, but doesn’t need continuous backup. I can kick this off manually, but need to have the discipline to do so regularly.

All data also needs to be stored offsite, so I don’t lose everything is there is a fire or flood.

So here’s how I’ve implemented backup:

My main machine is a mid-2010 Mac Pro. Inside, I have:

Mac Pro drive configuration

I am currently using 10TB of the 16TB available space, which gives me 6TB of growing room. 6TB should last a long time… unless I suddenly get a RED camera and start shooting RAW video. :) About 9TB of this data is picture / video data (Raw data), and 1TB is Working data.

I connect a Sans Digital TowerRAID TR5UT+B, which is a 5-bay, USB 3.0/eSATA box that features hardware RAID. The box has 5 x 3TB Seagate Barracuda 3TB drives in it configured as a concatenated array1 (15TB volume). Accessed over a single eSATA port (port-multipled), this setup sustains around 90 MB/s, but when using something like rsync, I see transfer speeds between 30-70 MB/s. You can also configure the box to use RAID 02 or RAID 53, if you so desire.

For the 1TB of Working data, I use Crashplan for incremental backups to two locations:

  1. a Mac Mini, which has a 3TB drive attached to it via USB 3.0 (backup set includes entire boot drive, as well as Working data)
  2. Crashplan Cloud (Working data only; no system or applications, nor Raw data). The initial backup seed is still in progress: the Crashplan app tells me it will take 5 months to upload 1TB, so I will likely mail in a drive to seed the backup (a service they offer).

I backup the 9TB of Raw data to the TowerRAID using a custom rsync script that supports incremental snapshots (a modified copy of Mike Rubel’s script). It took about 40 hours to do the initial backup (9.3TB over 40 hours is an average of 64.5MB/s), but successive backups take less than an hour. I keep 4 daily snapshots, 3 weekly snapshots, and 3 monthly snapshots. I may add a semi-yearly or yearly snapshot as well. For those of you who are more technical, the script uses hard links for files that have not changed, which means that I can effectively copy those files to a snapshot without using any additional drive space. Only files that have changed are actually copied to the backup during each incremental backup process.

I actually backup my entire computer, including both Working and Raw data, to the TowerRAID. Why not? I have the space to do so, and it doesn’t take that much more time.

Why snapshots?

Why use a crazy snapshot script to version files instead of just cloning a drive using SuperDuper!? Recently, two of my photographer friends discovered that they had some corrupted pictures. Both their master and backups were corrupted because once the master copy was corrupted, future backups were also corrupted. Luckily, both of them had very old backups that they used to restore good versions of the files. With versioned backups, the backup will notice that the file is different (potentially, corrupted) and make a new version. It keeps the old version so you can always go back.

Other notes:

  1. For much of my active data, I work out of Dropbox, which is a fantastic cloud sync service. All data in Dropbox is instantly backed up, versioned, and accessible to any device. It works very well, and nearly everyone I know uses the service.
  2. I use SuperDuper! to maintain a bootable clone of my machine’s boot disk. If the drive fails, I want to be able to boot up and be productive immediately. I do this every once in awhile, but am not too rigorous about doing it frequently. If you’re a Windows person, try Acronis True Image, instead.
  3. I actually have two of the TowerRAID boxes, each with 5 x 3TB drives installed. One is configured as a concatenated array (as described above), and the other, as a RAID 0 stripe. One is stored offsite, and the other lives at home. I backup regularly to the box at home, and periodically swap it out with the one that is stored offsite.

There is a full list of all of the hardware referred to in this article over at my refer.ly page. Full disclosure: I get referral fees for many of the items on that page. Feel free to click through from there if you’d like to, but don’t feel obligated to do so.

Backups in the field are another topic, which I’ll write about at a later date.

What do you use to backup your data? I’m very interested in how other photographers—or people with large data sets—keep their data secure.


  1. In theory, a concatenated array, which the box supports via switches, results in the loss of only a single drive’s worth of data if a drive fails. In practice, I’ve never had to deal with a failure in this kind of array, so I’m just guessing. 

  2. A RAID 0 stripe is an option as well; I see 130 MB/s from a RAID 0 stripe over a single eSATA port, and real-world rsync speeds of 80 MB/s. This is much faster than using a concatenated array, but you lose the entire set if a drive fails instead of losing only a single drive’s worth of data. 

  3. You should think of RAID 5 as a way to not lose your data if 1 drive fails, but I wouldn’t assume that you can rebuild the RAID successfully if you have a lot of data. If a drive fails, copy the data off as soon as possible and start over. Considering that RAID 5 performance degrades a lot once a drive fails (by up to 80%, according to stuff I’ve read on the internet), this may take a long, long time. In my opinion, it’s to assume the entire volume is toast when a single drive fails, so multiple backups are necessary. I much prefer newer, proprietary RAID implementations like Synology Hybrid RAID, which are dynamically expandable and allows for 2-drive redundancy. 

| San Francisco | link | trackback | Jan 6, 2013 02:53:39
ARCHIVES
Journal Home
Where is Eric? (password)
Stuff for Sale
February 2014 (2)
December 2013 (1)
October 2013 (1)
June 2013 (3)
May 2013 (2)
April 2013 (3)
March 2013 (1)
February 2013 (2)
January 2013 (3)
November 2012 (2)
October 2012 (3)
September 2012 (8)
August 2012 (8)
July 2012 (8)
June 2012 (8)
May 2012 (5)
April 2012 (8)
March 2012 (15)
February 2012 (7)
January 2012 (6)
December 2011 (8)
November 2011 (10)
October 2011 (12)
September 2011 (8)
August 2011 (14)
July 2011 (9)
June 2011 (9)
May 2011 (11)
April 2011 (11)
March 2011 (12)
February 2011 (23)
January 2011 (22)
December 2010 (16)
November 2010 (17)
October 2010 (26)
September 2010 (24)
August 2010 (24)
July 2010 (30)
June 2010 (26)
May 2010 (21)
April 2010 (26)
March 2010 (19)
February 2010 (17)
January 2010 (29)
December 2009 (21)
November 2009 (23)
October 2009 (32)
September 2009 (19)
August 2009 (34)
July 2009 (21)
June 2009 (30)
May 2009 (23)
April 2009 (18)
March 2009 (6)
February 2009 (25)
January 2009 (5)
December 2008 (6)
November 2008 (22)
October 2008 (27)
September 2008 (25)
August 2008 (34)
July 2008 (34)
June 2008 (32)
May 2008 (26)
April 2008 (15)
March 2008 (19)
February 2008 (31)
January 2008 (43)
December 2007 (33)
November 2007 (29)
October 2007 (29)
September 2007 (9)
August 2007 (19)
July 2007 (10)
June 2007 (17)
May 2007 (26)
April 2007 (38)
March 2007 (39)
February 2007 (13)
January 2007 (35)
December 2006 (35)
November 2006 (14)
October 2006 (6)
September 2006 (20)
August 2006 (24)
July 2006 (32)
June 2006 (17)
May 2006 (23)
April 2006 (16)
March 2006 (16)
February 2006 (26)
January 2006 (33)
December 2005 (17)
November 2005 (21)
October 2005 (18)
September 2005 (17)
August 2005 (5)
July 2005 (15)
June 2005 (20)
May 2005 (25)
April 2005 (7)
March 2005 (22)
February 2005 (20)
January 2005 (38)
December 2004 (6)
November 2004 (24)
October 2004 (16)
September 2004 (22)
August 2004 (12)
July 2004 (17)
June 2004 (15)
May 2004 (11)
April 2004 (35)
March 2004 (40)
February 2004 (29)
January 2004 (36)
December 2003 (20)
November 2003 (18)
October 2003 (10)
September 2003 (18)
August 2003 (10)
July 2003 (34)
June 2003 (12)
May 2003 (49)
April 2003 (42)
March 2003 (42)
February 2003 (15)
January 2003 (7)
December 2002 (17)
November 2002 (19)
October 2002 (24)
September 2002 (22)
August 2002 (20)
July 2002 (21)
June 2002 (14)
May 2002 (15)
April 2002 (11)
March 2002 (13)
February 2002 (20)
January 2002 (17)
December 2001 (16)
Even Older Journal
Travel Journals

CATEGORIES / TAGS
(25) (2) (1) (3) (1) (1) (1) (6) (2) (3) (11) (8) (3) (1) (1) (4) (2) (4) (2) (1) (6) (1) (1) (1) (6) (2) (1) (1) (1) (3) (1) (5) (1) (1) (23) (1) (1) (1) (1) (1) (14) (1) (10) (1) (1) (2) (1) (1) (1) (27) (6) (3) (2) (4) (4) (1) (1) (41) (11) (12) (4) (38) (1) (3) (2) (4) (1) (1) (1) (1) (2) (1) (1) (1) (1) (1) (10) (25) (8) (3) (2) (3) (2) (1) (5) (1) (1) (2) (1) (1) (14) (1) (5) (1) (1) (5) (43) (1) (1) (1) (3) (24) (1) (1) (1) (1) (5) (1) (4) (1) (1) (10) (1) (3) (1) (1) (1) (1) (6) (5) (1) (1) (1) (3) (1) (3) (1) (1) (1) (69) (4) (3) (7) (3) (1) (16) (6) (1) (29) (1) (7) (1) (4) (4) (4) (1) (1) (1) (1) (1) (1) (1) (10) (4) (4) (2) (1) (89) (14) (1) (2) (79) (2) (2) (1) (1) (1) (1) (1) (1) (3) (2) (3) (1) (1) (24) (3) (5) (4) (1) (2) (1)
MOST POPULAR
Most Popular Posts of All Time


Eric Cheng's RSS Journal Journal RSS
Eric Cheng's RSS Journal Comments RSS

proudly powered by wordpress
script exec time: 0.58s
i hate computers.