I’m working on a project to back up my family photos from TrueNas to Blu-Ray disks. I have other, more traditional backups based on restic and zfs send/receive, but I don’t like the fact that I could delete every copy using only the mouse and keyboard from my main PC. I want something that can’t be ransomwared and that I can’t screw up once created.
The dataset is currently about 2TB, and we’re adding about 200GB per year. It’s a lot of disks, but manageably so. I’ve purchased good quality 50GB blank disks and a burner, as well as a nice box and some silica gel packs to keep them cool, dark, dry, and generally protected. I’ll be making one big initial backup, and then I’ll run incremental backups ~monthly to capture new photos and edits to existing ones, at which time I’ll also spot-check a disk or two for read errors using DVDisaster. I’m hoping to get 10 years out of this arrangement, though longer is of course better.
I’ve got most of the pieces worked out, but the last big question I need to answer is which software I will actually use to create the archive files. I’ve narrowed it down to two options: dar and bog-standard gnu tar. Both can create multipart, incremental backups, which is the core capability I need.
Dar Advantages (that I care about):
- This is exactly what it’s designed to do.
- It can detect and tolerate data corruption. (I’ll be adding ECC data to the disks using DVDisaster, but defense in depth is nice.)
- More robust file change detection, it appears to be hash based?
- It allows me to create a database I can use to locate and restore individual files without searching through many disks.
Dar disadvantages:
- It appears to be a pretty obscure, generally inactive project. The documentation looks straight out of the early 2000s and it doesn’t have https. I worry it will go offline, or I’ll run into some weird bug that ruins the show.
- Doesn’t detect renames. Will back up a whole new copy. (Problematic if I get to reorganizing)
- I can’t find a maintained GUI project for it, and my wife ain’t about to learn a CLI. Would be nice if I’m not the only person in the world who could get photos off of these disks.
Tar Advantages (that I care about):
- battle-tested, reliable, not going anywhere
- It’s already installed on every single linux & mac PC , and it’s trivial to put on a windows pc.
- Correctly detects renames, does not create new copies.
- There are maintained GUIs available; non-nerds may be able to access
Tar disadvantages:
- I don’t see an easy way to locate individual files, beyond grepping through
snar
metadata files (that aren’t really meant for that). - The file change detection logic makes me nervous - it appears to be based on modification time and inode numbers. The photos are in a ZFS dataset on truenas, mounted on my local machine via SMB. I don’t even know what an inode number is, how can I be sure that they won’t change somehow? Am I stuck with this exact NAS setup until I’m ready to make a whole new base backup? This many blu-rays aren’t cheap and burning them will take awhile, I don’t want to do it unnecessarily.
I’m genuinely conflicted, but I’m leaning towards dar. Does anyone else have any experience with this sort of thing? Is there another option I’m missing? Any input is greatly appreciated!
You can’t really easily locate where the last version of the file is located on an append-only media without writing the index in a footer somewhere, and even then if you’re trying to pull an older version you’d still need to traverse the whole media.
That said, you use ZFS, so you can literally just
zfs send
it. ZFS will already know everything that needs to be known, so it’ll be a perfect incremental. But you’d definitely need to restore the entire dataset to pull anything out of it, reapply every incremental one by one, and if just one is unreadable the whole pool is unrecoverable, but so would the tar incrementals. But it’ll be as perfect and efficient as possible, as ZFS knows the exact change set it needs to bundle up. It’s unidirectional, so that’s why you can justzfs send
into a file and burn it to a CD.Since ZFS can easily tell you the difference between two snapshots, it also wouldn’t be too hard to make a Python script that writes the full new version of changed files and catalogs what file and what version is on which disc, for a more random access pattern.
But really for Blurays I think I’d just do it the old fashioned way and classify it to fit on a disc and label it with what’s on it, and if I update it make a v2 of it on the next disc.
Ohhh boy, after so many people are suggesting I do simple files directly on the disks I went back and rethought some things. I think I’m landing on a solution that does everything and doesn’t require me to manually manage all these files:
fd
(and any number of other programs) can produce lists of files that have been modified since a given date.xorrisofs
can accept lists of files to add to an isoSo if I
fd
a list of new files (or don’t for the first backup), pipe them intofpart
to chunk them up, and then pass these lists intoxorrisofs
to create ISOs, I’ve solved almost every problem.Downsides:
rsync -a
some files into the dataset, which have mtimes older than the last backup, they won’t get slurped up in the next one. Can be solved by checking that all files are already in the existing fpart indices, or by just not doing that.Honestly those downsides look quite tolerable given the benefits. Is there some software that will produce and track a checksum database?
Off to do some testing to make sure these things work like I think they do!
your first two points can be mitigated by using checksums. trivial to name the file after it’s checksum, but ugly. save checksums separately? safe checksums in file metadata (exit)? this can be a bit tricky 🤣 I believe zfs already has the checksum, so the job would be to just compare lists.
restoring is as easy, creation gets more complicated and thus prone to errors
I’ve been thinking through how I’d write this. With so many files it’s probably worth using sqlite, and then I can match them up by joining on the hash. Deletions and new files can be found with different join conditions. I found a tool called ‘hashdeep’ that can checksum everything, though for incremental runs I’ll probably skip hashing if the size, times, and filename haven’t changed. I’m thinking nushell for the plumbing? It runs everywhere, though they have breaking changes frequently. Maybe rust?
ZFS checksums are done at the block level, and after compression and encryption. I don’t think they’re meant for this purpose.
never heard of nushell, but sounds interesting… but it’s not default anyhwhere yet. I’d go for bash, perl or maybe python? your comments on zfs make a lot of sense, and invalidate my respective thoughts :D
I only looked how zfs tracks checksums because of your suggestion! Hashing 2TB will take a minute, would be nice to avoid.
Nushell is neat, I’m using it as my login shell. Good for this kind of data-wrangling but also a pre-1.0 moving target.
Woah, that’s cool! I didn’t know you just
zfs send
anywhere. I suppose I’d have to split it up manually withsplit
or something to get 50gb chunks?Dar has
dar_manager
which you can use to create a database of snapshots and slices that you can use to locate individual files, but honestly if I’m using this backup it’ll almost certainly be a full restore after some cataclysm. If I just want a few files I’ll use one of my other, always-online backups.Edit: Clicked save before I was finished
I’m more concerned with robustness than efficiency. Dar will warn you about corruption, which should only affect that particular file and not the whole archive. Tar will allow you to read past errors so the whole archive won’t be ruined, but I’m not sure how bad the affects would be. I’m really not a fan of a solution that needs every part of every disk to be read perfectly.
I could chunk them up manually, but we’re talking about 2TB of lumpy data, spread across hundreds of thousands of files. I’ll definitely need some sort of tooling to track changes, I’m not doing that manually and I bounce around the photo library changing metadata all the time.