Managing Files

Jun 19

Do you keep all your RAW files? What about JPG’s if you shoot R+J? Are SSD’s and other flash-based drives really worth it? What about a NAS (network attached storage)? Is cloud storage worthwhile?

Let’s sort through it! (Bad dad-joke pun intended.)

When one begins their journey into the realm of photography, the focus is often on cameras and lenses, tripods and equipment bags. Digital storage workflow is seldom at the top of the list, yet it’s crucial to consider early on if one wants to avoid a loss of data or an administrative overhead nightmare.

Folder Structure

My approach to file and folder structure is to create a folder for each year, then within those folders I have “session” folders which start with YYYY-MM-DD followed by a description. For example, I might have a folder labeled “2020-05-25 Graduation Pics, Downtown”. Using a YYYY-MM-DD format auto sorts the images chronologically, and the label helps me with recalling what the theme of the images are.

It gets a bit more complicated from here.

Let’s consider a current session, and separate the images into two categories: “processing” and “final”. Images which are currently being processed are those which haven’t been sorted through yet, or those currently queued up for post-processing work. If I took, say, 250 images during a session, all 250 will have been imported into Capture One (or Lightroom) to begin my post processing workflow.

Side Quest: My Post Processing Workflow

Actually, this was enough information to warrant its own post; feel free to hop over and read it if you’d like.

A Word About Macs

A significant portion of this post is going to get into the weeds a bit about various drive technologies. Some of it is applicable to Apple hardware, but keep in mind that all modern Macs do not allow for drives to be replaced. The storage is permanently attached to the motherboard and cannot be replaced or swapped out. That leaves external storage and cloud storage as the only options for expanding storage.

Some of the other topics are still relevant — NAS, RAID configuration of external drives, etc — just keep in mind that you’ll need to know your drive needs ahead of time if you’re purchasing a Mac.

Drive Technologies

Let’s quickly dissect various drive technologies, and their pro’s & con’s.

Traditional, spinning/magnetic hard drives (HDD’s)
- Connection type is typically SATA III (on any modern system), which is restricted to 600 MB/s max throughput (although the drives are usually slower than the maximum port speed). A physical “needle” reads and writes from/to a magnetic plate (or series of plates). Generally comes in 3.5-inch sizes, but 2.5-inch laptop-sized drives are available too.
- Pros: least expensive solution; large storage capacity; read speeds are typically just fine; great for NAS solutions.
- Cons: more prone to fail (moving parts); noise and heat; storage is “linear” so reading or writing information becomes bottlenecked if it’s asked to multitask; data becomes fragmented, which will slow the drive down over time if defragmentation tasks aren’t ran to reorganize the data; generally the slowest option, particularly when writing.
Solid State Drive (SSDs)
- Connection type is typically SATA, same restriction as above; however, SSD’s will actually be able to take advantage of the full bandwidth available to the port. It’s based on flash storage, so there’s no moving parts. Typically come in 2.5-inch size, although M.2 SSD’s are available (more on that in a bit).
- Pros: Reasonably affordable; non-linear storage so multitasking won’t bottleneck; no fragmentation of data; no moving parts so no heat or noise; less prone to failure, but definitely not immune.
- Cons: Price per GB is still far higher than HDD’s; still bottlenecked by SATA bandwidth limits; 2.5-inch drives still have to be mounted somewhere in the system, then connected & powered via cables; price is no longer significantly lower than NVMe options to warrant consideration for primary drives on modern systems.
NVMe
- Flash-based storage, similar to SSD’s; however, NVMe drives connect via an M.2 port on a motherboard. It’s important (albeit confusing) to note that an M.2 port can exist on a board but actually be SATA only, which is not compatible with NVMe. While SATA has its own bus controller to communicate with the CPU and RAM, NVMe uses the PCI-E bus (similar to a video card) to communicate with the CPU and RAM, which is far faster. Speeds can range between 2,000 MB/s to nearly 4,000 MB/s — an exponential increase over SATA.
- Pros: Clearly, the speed; similar to SSD’s, flash is non-linear and can handle multitasking; no cables for connection or power; typically the size of a stick of gum or a single stick of laptop RAM, simplifying a system’s potential layout.
- Cons: The most expensive single-drive option, although prices have decreased to be only moderately higher than SSD drives, and the performance increase is well worth the price delta; most motherboards only have 1 M.2 slot (some do have 2), so adding additional NVMe storage can be difficult without potentially requiring a complete reformat (or a complicated cloning process).
External Drives
- Typically these are encased HDD’s, so they’re spinning disks. The most popular connectivity type is USB 3.0, although eSATA was popular for a hot second; on the Apple side, thunderbolt ports are available (but more costly). Some external drives do utilize SSD’s, but I’ll be assuming HDD’s here as they are most common.
- Pros: Portability; USB 3.0’s bandwidth is on par with SATA; increases storage without having to access system internals; can be placed conveniently for removal or replacement if needed; cost per GB of storage is generally similar to internal HDD’s.
- Cons: Virtually all the same as HDD’s (noise, heat, moving parts, linear storage); are often utilizing drives with lower yield rates, meaning that the drives are less reliable because they’re not expected to be used as extensively as an internal drive; requires an external power source and a USB cable connected to the system, which can be inconvenient.
Network Attached Storage (NAS)
- A NAS is typically a dedicated storage system that leverages multiple drives. It plugs into your network and is accessible as network storage. Some may include wifi capabilities, but this reduces the speed of the connectivity. Most can be configured to increase redundancy at the expense of performance, or to increase performance at the expense of redundancy. Most use multiple HDD’s, although some support a mixture of drive types. Most NAS devices are prebuilt and sold as such, but NAS devices can also be custom built with open-source operating systems.
- Pros: Accessible by more than one user at a time; great for families/roommates/small groups of coworkers who need shared storage or to share files amongst the group; redundancy options reduce the likelihood that drive failure causes permanent data loss; performance options allow for HDD’s to be leveraged for their large storage size while mitigating the bottleneck limitations of linear read/write operations; can be physically placed anywhere so long as network connectivity is accessible.
- Cons: Prebuilt NAS devices can be costly, as they are essentially computers dedicated to providing network storage; performance-inclined configurations mitigate drive read/write speed bottlenecks (meaning more users can access the system at once) but speed is restricted by the network connection to the device; drives themselves have similar cons as HDD’s (noise, heat, moving parts); must be placed external and use an external power source; takes up about as much space as a small PC would; requires configuration and occasional maintenance.

To RAID or Not To RAID

RAID stands for “redundant array of inexpensive disks” and is probably more accessible than most realize. As the name implies, disks configured for RAID are referred to as “an array”. A NAS will definitely be utilizing some sort of RAID configuration. RAID does require that all drives in an array be the same size (i.e. you can’t RAID a 2TB and a 4TB drive and access all the space of both drives).

I’m not going to go into every possible RAID configuration, but here are a few basics:

RAID-0 (zero), aka Striping: two or more disks are configured to be treated as one device; this allows data to be written to any of the drives in the array. The amount of available storage is directly multiplied by the number of disks. The more drives in an array, the greater the performance, since data can be written to a drive that’s currently not in use while another drive is busy. It’s by far the highest performing configuration, and can easily reach or surpass SSD performance with HDD technology. The major drawback is that if one drive fails, ALL of the data becomes inaccessible on the ENTIRE array and is essentially lost. RAID-0 is best used for temporary or scratch disk usage, or when performance is absolutely critical & the data is securely backed up.
RAID-1, aka Mirroring: Two disks are configured to be treated as one device; however unlike R0 / striping, data written to the array is duplicated and written to both drives separately. The amount of available storage is equal to one drive, or half the total storage consumed. Performance is on par with, or slightly below, the speed of the drive itself. The major advantage of R1 is the redundancy: if one drive fails, the second drive has a copy of the data and is still accessible. A failed drive can be replaced and the data replicated. Another nice feature is that unlike a traditional backup process, the data is duplicated as it’s written to the device. R1 is best used when performance is not critical but redundancy is, and when storage space needs do not exceed that of a single drive.
RAID-5, aka Parity: R5 requires a minimum of 3 disks. Data is written across all disks, similar to striping; however, each drive reserves a portion of its storage availability for a “parity” volume. As an example: if a R5 has 3 disks, and each disk is 3TB in size, then each disk reserves 1TB of space to store a portion of the data written to the other two drives. This leaves 6TB in available storage space. Any one drive can fail, and the parity data can be accessed so that the data isn’t lost until the drive is replaced; the replacement drive can then have the data rebuilt onto it from the parity data of the other two drives. Performance is greater than R1 since multiple drives are used, and storage availability is higher; however, the added read/write processes reduce overall performance so that it’s not as fast as a R0 configuration. RAID-5 was, and often still is, the “standard” configuration when the term “raid” is used casually.
Other options, such as RAID-6 (like R5 but with 2 drives worth of parity) or RAID-10 (a set of drives are striped, and then the stripe is mirrored to an identical set of striped drives) are available. R10 is becoming more popular where large quantities of inexpensive drives can be used; it’s probably the highest performance option while also providing redundancy, but it’s not terribly practical for the average home or small business user.

So what did I mean when I said “RAID is more accessible than some realize”? Traditionally, RAID required a dedicated hardware controller (some motherboards do include hardware controllers integrated); however, software RAID has been increasingly available on just about every operating system. When software RAID first became widely available, compute capabilities were much, much lower than they are today, so it was somewhat dismissed as anything more than experimental. Today, however, the overhead necessary to manage a software RAID configuration on a modern computer is next to nothing. Additionally, unlike a hardware RAID controller — which could corrupt an array if the hardware experienced any sort of failure or glitch — software RAID options are generally much more stable.

So why or when would you want to consider a software RAID solution?

Some modern external drives are sold as comprehensive backup solutions by leveraging software RAID capabilities. For example, I’ve seen external USB drives that are mirrored when both are connected, but can also be individually accessed.

A software RAID solution would be a great option for a user with additional drives of similar space available to them, perhaps salvaged from past systems. Configuring a software R0 on a local PC can be a great solution for photo or video editing. It’s not too uncommon for this to be setup with smaller drives for “scratch” or temporary drives — dedicated storage for software to toss data into, thrash it around, and then produce a result.

A R0 can also be used for installing applications or even storing data, so long as the applications can be easily reinstalled or the data is backed up and easily restored in the event of a drive failure. SSD drives can be striped to increase the size of a “single” volume and to achieve I/O (input/output) performance greater than the SATA bottleneck. A local mirror could be setup to ensure that data only has to be output to one “device” one time, yet is written to two distinct drives.

The one use-case to never use RAID for is your system / OS drive. Most operating systems won’t allow it; a hardware controller card would make it possible to do so, but it’s strongly discouraged.

Other Storage Technologies

There are other options for managing drives than just RAID. LVM (logical volume management) is available in Linux, which allows for “volume groups” to be created and then added or removed from “logical” volume groups. While somewhat similar to software RAID, it’s more flexible with the modification of the volumes. One can even combine the two technologies, creating software RAID volumes and then joining those to logical volumes.

Cloud Storage

“What about the cloud?” Cloud storage is essentially the leasing of storage in someone else’s datacenter. Popular cloud storage solutions include Dropbox, Google Drive, OneDrive, and more. Adobe also offers cloud storage with some of its subscription services.

Cloud storage is dependent upon, and restricted by, your internet connectivity and bandwidth. Typically, a local copy of your data is stored and then synchronized with a cloud service; it can take a while for an initial set of data to fully synchronize, but afterwards, only your modifications are pushed to the cloud storage location.

Pricing varies by provider. It’s not uncommon for cloud storage to be bundled with other services, like the aforementioned Adobe option. Microsoft offers 1TB of OneDrive storage for Microsoft 365 subscribers (formerly Office 365).

Cloud storage is not meant to be used as direct storage. What I mean is that you wouldn’t, (nay, couldn’t) store your images in a cloud storage location and then edit them directly off the cloud storage without any use of local storage.

So What Should I Use?

The answer to this is dependent on your needs, but the most comprehensive answer would be “a little bit of everything”:

An NVMe or SATA disk should be used for your system / operating system drive;
If your system drive is not large enough, a second NVMe or SATA disk for applications and file modification (such as photo editing) is a good solution to ensure performance;
If you need storage space, a HDD inside your PC can provide inexpensive “cold storage” for holding files that aren’t regularly accessed or modified;
Using a software RAID configuration with internal drives can improve redundancy or performance of existing drives;
If you need the storage capacity of multiple HDD’s and/or the functionality of multiple users, a NAS is a great solution;
An external USB drive is great for a local backup of your data, to prevent data loss in the event of hardware failure or catastrophe (just don’t forget to grab the drive, if it’s safe to do so, in the event that the physical location is in danger);
Cloud storage can automatically backup critical data to an offsite location, but is limited to your bandwidth speeds and the amount of capacity you’re willing to pay for.

My Data Storage Setup

I have a custom built PC as my primary system; I’ve ran both Linux (Ubuntu and Pop_OS) on it, but I’m currently using Windows 10 simply because neither of the photography editing packages I use are supported on Linux… yet. Even gaming has almost caught up entirely to Windows. Don’t misunderstand, though: I think Windows 10 is a superb operating system. As is MacOS. There are pro’s and con’s to them all, but at the end of the day, just pick the one that gets the job done for you.

My PC has an Asus motherboard, which has two NVMe capable M.2 slots. I have a 2TB NMVe as my “system” drive, which runs the OS and applications, as well as holds my primary “user” data (documents, pictures, etc). It’s an Intel NVMe which was more affordable for the capacity size because its write speed isn’t quite as fast as most NVMe drives: instead of averaging 3,500 mb/s, its sustained average write is closer to 1,800 mb/s. Keep in mind, though, that this is only relevant when copying really large chunks of data; when you’re copying (or editing, or modifying) smaller files, then random read/write speeds are more relevant; in that department, the drive performs just as well as the rest.

My second NVMe drive is a 1TB Sabrient, and it’s rated at 3,500 mb/s; I use it as my “media” drive. Yes, I said I stored my Pictures on the primary drive with my user data. That’s correct. I store my Videos folder on here, and I import images onto this drive to be edited by Capture One or Lightroom. The exported files are then stored in my Pictures folder on the primary drive. I’ll explain why I’m doing this in just a second.

For archival storage, I built a custom NAS using “Open Media Vault” and some older PC hardware I had in storage. I’ve added drives to it over the past 18 months or so and it’s now up to 32TB of storage. Yes, 32 terabytes. The drives in the NAS are all 5400 RPM HDD’s; I create small RAID-0 arrays, then add the arrays to a logical volume. This way, if any one drive in a particular array fails, I’ll lose the data in that array, but not the entirety of the data across the entire logical volume. I primarily store media for a private Plex media server on here, but I also have network shares for everyone in the house to have what I’ve coined a “vault” — basically just personal network storage space for each person. There are also shared volumes for files, such as application installs and ISO images of operating systems; and applications, such as a Minecraft or Terraria server.

I use an 8TB external USB drive connected directly to the NAS to backup all of the vault shares, as well as the application and file shares. The media data is not backed up, but it is less of a priority for me.

I use my “vault” on the NAS to store backups, video files, and photos that are not accessed regularly. I have 20 years worth of photos and videos available on there. I store my in-camera JPG backups here.

Back on my PC, I have a subscription to Microsoft 365 so I have 1TB of OneDrive storage; it automatically synchronizes my user data (Documents, Pictures, Desktop, etc) to OneDrive. As previously mentioned, after I post process a session of photos, the exported results are placed in my Pictures folder. This is then automatically synchronized to OneDrive. This synchronization is also why I store the RAW files on the second drive: originally, the RAW files were also stored under Pictures; however, that meant they would synchronize with OneDrive as soon as they hit the drive, which would impact my total available space as well as the performance of my system. (OneDrive synchronization is generally rather benign; however if it’s thrashing GB of RAW files, it can become noticeable.) I could also exclude folders from synchronizing, but that required more micromanagement of the service than I cared to employ.

Inside my Pictures folder, I store the current year’s Finals folder as well as a copy of every prior year’s “photo albums” photos. Each year I try to select the best of the best, photos which best represent the events of that year. If you’ve read my article on Post Processing, these are the images with the purple color label. Ideally I’d have a physical photo book created for each year; in reality, I’ve done this 4 times in 20 years. The idea is that if something happens to me, I want my family to have easy and obvious access to the memories from each year without having to dig through tens of thousands of files in some obscure location on my system.

So in summary, this configuration:

Automatically synchronizes all of the images which are post processed and exported to cloud storage;
Backs up “photo album” images from past years up to cloud storage as well;
Separates the images currently being processed onto a separate, high speed storage medium;
Stores archived files on a NAS device so they aren’t consuming faster, more expensive local storage;
Backs up the archived files to an external USB drive (which, by the way, is an RSYNC scheduled job that keeps multiple copies of the files in the backup).

Is this system perfect? Far from it. I’m not backing up my archived files to cloud storage, so if the physical location were to be compromised before I could retrieve the backup drive, the archived data would be lost. I also haven’t fully completed photo album collections from every single year; there’s a span of about 6 years still remaining which are only on the archive. Additionally, if I happen to be working on a current session for an extended period of time (seldom happens, I usually start and finish in one sitting, but not always), and my “media” secondary drive were to fail, those files would be lost. I’d also lose all of the RAW images currently being kept for the calendar year. In both cases, though, the loss would be marginal; I’d likely only lose the 1 active session, and so long as the JPGs and TIFFs were exported, losing the RAW files for completed sessions would almost go unnoticed. Plus, I back up the in-camera JPG’s to the NAS, so if I did lose an active session, at worst I’d have the JPG’s to still work with.

(I will say that if I were working on a paying client’s images, I’d establish a process to keep a copy of the RAW files separate — perhaps on a USB stick or something — for an indeterminate amount of time.)

How Long To Keep RAW Files

I touch on this in the post processing article, but I do not keep my RAW files indefinitely. I keep the in-camera JPG’s from the entire session; however, once I post process a session, I delete the images that weren’t selected to be kept. After about 12 months, I’ll delete those RAW files too. The JPGs and TIFF copies will suffice for images older than a year; it’s extremely unlikely I’ll go back and re-edit those files over a year after originally creating them.

This can be a rather dry topic, but it’s one that is critical to ensure your data is protected while also providing maximum performance for your workflow. I hope this is informative. If you have any questions or suggestions, feel free to shoot me an email.

filesfile managementadministrationnasraidjpgrawwindowslinuxmacOS

Charlie Digiglio