Processing scanned images

In 1911 tourist Maud Moreland wrote a travelogue, Through South Westland, lavishly illustrated with photographs. As part of converting this to an ebook, I needed to download some of these scanned images and clean them up, ready to upload to Wikimedia Commons. Here’s my workflow; I’m sure there are better ways of doing this, if you’re a whiz at image editing, but this works for me.

When working with a book that’s been scanned by Google Books or the Internet Archive, there will be a folder among all the download options containing all the page scans, including all the photos, in JPEG-2000 format (.jp2). I generally download the entire folder as a ZIP file and pull out the pages I need. Remember to save a copy of the originals.

I want to use a more meaningful filename schema:
Through South Westland (1916) · Moreland · 344.jpg
so the only part of the filename that changes is the page number, which makes the name less descriptive but the images much easier to place images into Wikisource.

On the Mac I select all the files and use File > Rename… to replace the Internet Archive text with my filename, removing the first zero from the page number. We’ll change the files from .jp2 to .jpg when we export them.

I use Affinity Photo to edit images. I could do a fair bit just in the Preview app that comes with my Mac, but Affinity Photo is a bit more powerful,m while costing nowhere near as much as Photoshop.

Make the image fill the screen (Command-0). Rotate it 90 degrees if needed, and crop it down to the photo edge, rotating it manually a little if it’s not straight. Crop out all the text: photo captions are added in Wikisource.

Google likes to add a sepia background to its scans, but this makes the images into RGB files with three 8-bit colour channels. By converting from RGB to greyscale, we cut the file size by two thirds and lose no actual information. You can do this at Document > Convert Format > Grey/8 (which converts the colours to 256 shades of grey, fine for our purposes).

Some of the scans are quite murky, so I adjust the levels in the Adjustment palette > Levels > (Default). You can see in this image doesn’t occupy the full range of tones available, so drag in the black and white points to the edge of the graph to make the darkest pixels black and the lightest ones white. Then move the gamma slider left to brighten the photo a little by bringing up the shadows. Merge the results.

Apply some sharpening with Unsharp Mask, if there are enough hard edges in the photo that might benefit from it. Affinity Photo lets you set a Before/After slider, so you can adjust the intensity and see a preview.

File > Export it as a JPEG. Choosing a High Quality setting won’t harm the image, and more than halves the file size (the full-quality Affinity file is 1.88 MB). Save it into a new folder specifically for fixed images, and leave the original file unchanged.

There are ways of automating most of this workflow: you can run batch jobs in Affinity Photo to convert photos to JPEGs, and record a macro that converts an image to black and white and runs a bit of Unsharp Mask. You’ll still need to crop and tweak your photos by hand though – no substitute for that.

I hope seeing this basic image processing workflow has been helpful, and you picked up something useful. At some point I may cover the next step: a bulk upload to Wikimedia Commons.

Naming photos

Your camera is great, but you shouldn’t let it decide what to call your photos. 

People often send me photographs, either as email attachments or through a site like WeTransfer. To use them in Wikipedia I need to know the photographer, copyright holder (often not the same person), date, and thing or event depicted. Too often though that information’s hidden deep in the metadata, or worn away by repeated handling, and the photo’s named something unhelpful like “DSC5553”.

We’re going to try to give this random assortment of files sensible names.

This blog post won’t help you organise your photos by date, topic, project, or location – you should be using photo management software or digital asset management tools for this already. I just want to help you come up with a meaningful filename. You’ll always have to share photos with media, collaborators, or future-you, so work out a schema for naming those files and put it on a Post It near your computer. Here are some tips to help you develop one.

Some letters are illegal in filenames. Backslash (\), forward slash (/), colon, comma, square brackets, asterisk, question mark, double quote (“), greater and less than (< >) and the pipe (|) are all going to cause problems.

Try to use separators rather than blank spaces; spaces in filenames used to be troublesome for computers, and still sometimes cause problems. Underscores (_) or hyphens (-) are easy separators; underscores are invisible when the text is underlined or part of a link, so they aren’t as good as a hyphen. I often use bullets • (option-8 on a Mac, Alt 0149 on Windows) and middle dots · (option-shift-9 on a Mac, Alt 0183 on Windows) as my separators.

Project code. When I was working as a graphic designer, every project in the company had a unique four letter code, and these four letters were used on every document, photo, spreadsheet, and graphic in that project. You could also use this when assembling photos for an ad or blog post. Put the project code first, because you usually want to sort by project name.

Project codes here are WCSO for West Coast Stories Online and WDL for Westland District Library. The number 242780 is a collection ID number.

Date. If you’re not dealing with several projects at once, just keep photos in a project folder and begin the name with the date in YYYY-MM-DD. You might think you don’t need to add a date to the filename, and can just sort photos by their metadata, but date information can get stripped away or changed through repeated sharing, duplicating, and tweaking. And of course metadata for scanned photos will have the date the scan was made, not the date the photo was actually taken. When using YYYY-MM-DD you’ll need to add zeroes to the day or month to pad out the places. If you don’t know the month or date a photo was taken, use xx.

This effort is worth it, because you’ll be able to sort all the photos within a project by date just by alphabetising the filename.

Three of the filenames now have year-month-day dates applied.

Description. What’s depicted, in the fewest words? People’s names should be written lastname firstname to help with sorting (and remember you can’t use commas in filenames).

Just assume the photo is going to be emailed from a random stranger to another random stranger at some point in the future with no explanation or context, and name accordingly.

Your name, or your company name, or the initials of whoever took the photo, just as a return-to-sender tag. This is not necessarily the same as whoever owns the copyright on the photograph, so clearly note the copyright holder, with a copyright symbol © (option-g on a Mac, Alt 0169 on Windows), especially if they’re not you or the photographer.

Lynn Adams is the photographer, but DOC as her employer owns the copyright.

If the image is available under a Creative Commons licence, you may want to include that as well. These licences can be encoded quite concisely: “Creative Commons Attribution Share-Alike 4.0” is usually represented as CC BY-SA 4.0, and you could use CCBYSA4.

Numbering. If you have a batch of photos of exactly the same thing in the same time and place, with the same author and licence, distinguish them with a number at the end. Start with 001, padding out the empty places with zeroes.

Edited versions of a photo should always be named – never change the original photo, always back up the originals, save a copy first, or use “Save As…” while you’re editing, and add something to the filename. V2, v3, cropped, web are good suffixes. You can use “Original” as the suffix to make sure you don’t change it. Using “Final” for the final version is not helpful, as it never turns out to be the final version. If you’re just resizing the photo, add pixel dimensions to the end (yes, they’re visible in the file information, but not when you’re skimming a list of email attachments).

All that makes for a loooooong filename. But don’t worry too much about filename length: it can be 255 characters – more than an entire tweet. Most operating systems display filenames with ellipses in the middle; this schema puts the most critical information (like project, date, number, and version) at either end. Some systems have limits to the total path length (the list of all directories and subdirectories the file’s inside, plus filename): for OneDrive that’s only 400 characters, and before Windows 10 it was just 260, but even if you use 100-character filenames you’ve got some breathing room.

None of the filenames we created were even 100 characters long.

A sensible filename schema is part of being a good digital archivist – and if you’ve ever taken a photo on your camera you want to keep, congratulations: you too are a digital archivist. Name your files accordingly. For more tips on all the other aspects of looking after digital collections, see the National Library of New Zealand’s Caring for Taonga guide.