Processing scanned images

In 1911 tourist Maud Moreland wrote a travelogue, Through South Westland, lavishly illustrated with photographs. As part of converting this to an ebook, I needed to download some of these scanned images and clean them up, ready to upload to Wikimedia Commons. Here’s my workflow; I’m sure there are better ways of doing this, if you’re a whiz at image editing, but this works for me.

When working with a book that’s been scanned by Google Books or the Internet Archive, there will be a folder among all the download options containing all the page scans, including all the photos, in JPEG-2000 format (.jp2). I generally download the entire folder as a ZIP file and pull out the pages I need. Remember to save a copy of the originals.

I want to use a more meaningful filename schema:
Through South Westland (1916) · Moreland · 344.jpg
so the only part of the filename that changes is the page number, which makes the name less descriptive but the images much easier to place images into Wikisource.

On the Mac I select all the files and use File > Rename… to replace the Internet Archive text with my filename, removing the first zero from the page number. We’ll change the files from .jp2 to .jpg when we export them.

I use Affinity Photo to edit images. I could do a fair bit just in the Preview app that comes with my Mac, but Affinity Photo is a bit more powerful,m while costing nowhere near as much as Photoshop.

Make the image fill the screen (Command-0). Rotate it 90 degrees if needed, and crop it down to the photo edge, rotating it manually a little if it’s not straight. Crop out all the text: photo captions are added in Wikisource.

Google likes to add a sepia background to its scans, but this makes the images into RGB files with three 8-bit colour channels. By converting from RGB to greyscale, we cut the file size by two thirds and lose no actual information. You can do this at Document > Convert Format > Grey/8 (which converts the colours to 256 shades of grey, fine for our purposes).

Some of the scans are quite murky, so I adjust the levels in the Adjustment palette > Levels > (Default). You can see in this image doesn’t occupy the full range of tones available, so drag in the black and white points to the edge of the graph to make the darkest pixels black and the lightest ones white. Then move the gamma slider left to brighten the photo a little by bringing up the shadows. Merge the results.

Apply some sharpening with Unsharp Mask, if there are enough hard edges in the photo that might benefit from it. Affinity Photo lets you set a Before/After slider, so you can adjust the intensity and see a preview.

File > Export it as a JPEG. Choosing a High Quality setting won’t harm the image, and more than halves the file size (the full-quality Affinity file is 1.88 MB). Save it into a new folder specifically for fixed images, and leave the original file unchanged.

There are ways of automating most of this workflow: you can run batch jobs in Affinity Photo to convert photos to JPEGs, and record a macro that converts an image to black and white and runs a bit of Unsharp Mask. You’ll still need to crop and tweak your photos by hand though – no substitute for that.

I hope seeing this basic image processing workflow has been helpful, and you picked up something useful. At some point I may cover the next step: a bulk upload to Wikimedia Commons.

Wikisource cheat sheet

There are common bits of formatting you have to do in Wikisource; this is a list of ones beginners will likely come up against. Most take the form of templates, inside {{double curly brackets}}. You can nest one inside the other but do keep count of the number of brackets because if you leave one off weird things will happen.

Wikisource measures everything in ems; an em is a unit equal to the point size of the font, about the width of a capital M.

Paragraph formatting and spacing

EffectTemplateNotes
Line break<br />Don’t do this to make a break between paragraphs, just press Return twice.
Space {{em}}A one-em space between words.
Bigger space{{gap}}A bigger space. Don’t worry about trying to replicate the spacing in a Wikisource scan, though, it’s not the point.
Paragraph space{{dhr}}
{{dhr|3em}} 
A double-height row, and you can specify the gap. Again, no need to replicate white space in an e-book.
Right align{{right|sampletext}}
{{right|sampletext|2em}}
You can specify the distance from the right margin.
Centre{{c|sampletext}}

Typography

H⸻H{{longdash}}Enclose this in a {{nowrap| }} to stop it breaking across lines
Larger{{larger|Text}}
{{x-larger|Text}}
Goes up to {{xxxx-larger| }}
Smaller{{smaller|Text}}Down to {{xxxx-smaller| }}
Small block{{smaller block/s}}
{{smaller block/e}}
Put these templates before and after the block of text you want to make smaller.
Smaller, tighter text{{fine block/s}}
{{fine block/e}}
As above. Good for quoted text.
Small caps{{sc|Heading}}Note that only lower case letters will be converted to small capitals.
Hung punctuation{{fqm|“}}This floats an initial quotation mark into the left margin (which is good typography).
Gothic/blackletter font{{blackletter|  Das Text}}Fraktur fancy text
Size, colour etc{{font-size|130%
|style=color:#555051|Text}}
Makes larger grey text

Ornamental Rules

Rule{{rule}}
{{rule|10em}}
Makes a basic black line. You can tweak the length.
Wiggly line{{custom rule|w|40|w|40|w|40}} Three 40-em wiggles, so 120 ems total.
Row of bullets{{***|5|4em|char=•}} Or any line of spaced characters across the page: see help.
{{custom rule|sp|40|fy1|40|sp|40}}See the {{custom rule}} help for how to generate fancy-pants rules.

Headers

Basic running head{{rh||BIRTH OF WESTLAND.|4|}} The pipes separate the left-aligned, centred, and right-aligned.
Left/right running heads{{rvh|56|Chapter Title|Book Title}}Puts the page number and title or chapter in the correct spots: see help.

Words

Typo{{SIC|typed word|presumed word}}Use this for actual typos, not for former spellings.
Start of hyphenated word (ad-){{hws|ad|admirable}}Only used if the last word on the page is hyphenated…
End of hyphenated word
(-mirable)
{{hwe|mirable|admirable}}…so the word joins up properly across page breaks.
Paragraph ends coincides with page end{{nop}}Use when a paragraph ends at the bottom of the page, to stop it being joined to the next page’s text.
Reference in text<ref>Op. cit. p.11</ref> Need to put {{smallrefs}} in the footer as well

Poetry

{{block center|
{{smaller|
{{fqm}}And forth there stretched a silent land—<br />
For distance robbed mine ear of sound—<br />
League after league from the near strand<br />
To giant peaks that, band by band,<br />
Marched past the vision’s outmost bound.”
}}
}}

(There are actual <poem></poem> tags, but Wikisource people prefer this method. The <poem> tags work well for formatting a list of people or ingredients, though.)

Tip: apply typographic formatting first, them enclose it in positioning/spacing formatting, like the poem above.

Releasing a book copyright

One of the powerful things an institution can do for a copyright holder is help them share their work with a wider audience using an open licence.

In New Zealand, copyright lasts for 50 years after death; for 50 years anyone wanting to excerpt, digitise, share, or reprint all or part of a book from – who? Whoever inherited the copyright. Most authors don’t mention their copyrights in their will, and there’s no central register that tells you who the current copyright holder of a work is. Thus many books become “orphan works” after the author dies: somebody owns the copyright, nobody knows who, and so the text can’t be used for anything by anyone for 50 years.

That sounds a bit drastic, but New Zealand doesn’t currently have a “Fair Use” exemption in its copyright law, so apart for “criticism or review” all the following uses would need permission:

  • Quoting a paragraph from the book in a museum label
  • Reprinting a chapter in a free souvenir booklet
  • Reproducing an illustration on a non-profit historical society website
  • Making a digital copy for a library to lend out as an e-book
  • Reading out an excerpt at a funeral

Vonnie Alexander’s 2010 book Gillespies Beach Beginnings is a local history of a small gold-mining settlement in South Westland. Self-published with a tiny print run, only 10 libraries in the world have it, all in New Zealand. As part of our Wikisource project, we wanted to scan it and convert it into an e-book. I contacted Vonnie and asked if she would be willing to license Gillespies Beach Beginnings under an open license, or even release it to the public domain. It’s important to note that with an open licence the author keeps their copyright; they’re just stipulating how people are allowed to make copies – essentially, giving permission in advance.

Here’s some wording for legal licensing or copyright release you could use with an author (although it’s based on other releases I’ve seen, I hereby release it to the public domain, CC0 1.0). It’s important that, as well as being saved by both parties, the form is saved with the digitised text (and forwarded to the Volunteer Response Team if the file is uploaded to Wikimedia Commons).


I represent to the Westland District Library that I am either a copyright holder of the following work (“the Work”) or their representative, with the right and authorisation to licence the Work.

Title:

Indentifier (e.g. ISBN):

I also represent that the Work, to the best of my knowledge, does not infringe or violate any rights of others.

I represent that I have obtained all necessary rights to permit the Westland District Library to share the Work, and that any third-party content is clearly identified and acknowledged within the Work.

Pick one:

□ I dedicate the Work to the public domain using the Creative Commons Public Domain Dedication 1.0 (CC0)
I license the Work to the public under the terms of the following licence
□ Creative Commons Attribution 4.0 (CC BY)
□ Creative Commons Attribution Share-Alike 4.0 (CC BY SA)
□ Creative Commons Attribution Non-Commercial 4.0 (CC BY-NC)
□ Creative Commons Attribution Non-Commercial Share-Alike 4.0 (CC BY-NC-SA)

Name:
Address:

Email:
Phone:
Signature:

Date:

Naming photos

Your camera is great, but you shouldn’t let it decide what to call your photos. 

People often send me photographs, either as email attachments or through a site like WeTransfer. To use them in Wikipedia I need to know the photographer, copyright holder (often not the same person), date, and thing or event depicted. Too often though that information’s hidden deep in the metadata, or worn away by repeated handling, and the photo’s named something unhelpful like “DSC5553”.

We’re going to try to give this random assortment of files sensible names.

This blog post won’t help you organise your photos by date, topic, project, or location – you should be using photo management software or digital asset management tools for this already. I just want to help you come up with a meaningful filename. You’ll always have to share photos with media, collaborators, or future-you, so work out a schema for naming those files and put it on a Post It near your computer. Here are some tips to help you develop one.


Some letters are illegal in filenames. Backslash (\), forward slash (/), colon, comma, square brackets, asterisk, question mark, double quote (“), greater and less than (< >) and the pipe (|) are all going to cause problems.

Try to use separators rather than blank spaces; spaces in filenames used to be troublesome for computers, and still sometimes cause problems. Underscores (_) or hyphens (-) are easy separators; underscores are invisible when the text is underlined or part of a link, so they aren’t as good as a hyphen. I often use bullets • (option-8 on a Mac, Alt 0149 on Windows) and middle dots · (option-shift-9 on a Mac, Alt 0183 on Windows) as my separators.

Project code. When I was working as a graphic designer, every project in the company had a unique four letter code, and these four letters were used on every document, photo, spreadsheet, and graphic in that project. You could also use this when assembling photos for an ad or blog post. Put the project code first, because you usually want to sort by project name.

Project codes here are WCSO for West Coast Stories Online and WDL for Westland District Library. The number 242780 is a collection ID number.

Date. If you’re not dealing with several projects at once, just keep photos in a project folder and begin the name with the date in YYYY-MM-DD. You might think you don’t need to add a date to the filename, and can just sort photos by their metadata, but date information can get stripped away or changed through repeated sharing, duplicating, and tweaking. And of course metadata for scanned photos will have the date the scan was made, not the date the photo was actually taken. When using YYYY-MM-DD you’ll need to add zeroes to the day or month to pad out the places. If you don’t know the month or date a photo was taken, use xx.

This effort is worth it, because you’ll be able to sort all the photos within a project by date just by alphabetising the filename.

Three of the filenames now have year-month-day dates applied.

Description. What’s depicted, in the fewest words? People’s names should be written lastname firstname to help with sorting (and remember you can’t use commas in filenames).

Just assume the photo is going to be emailed from a random stranger to another random stranger at some point in the future with no explanation or context, and name accordingly.

Your name, or your company name, or the initials of whoever took the photo, just as a return-to-sender tag. This is not necessarily the same as whoever owns the copyright on the photograph, so clearly note the copyright holder, with a copyright symbol © (option-g on a Mac, Alt 0169 on Windows), especially if they’re not you or the photographer.

Lynn Adams is the photographer, but DOC as her employer owns the copyright.

If the image is available under a Creative Commons licence, you may want to include that as well. These licences can be encoded quite concisely: “Creative Commons Attribution Share-Alike 4.0” is usually represented as CC BY-SA 4.0, and you could use CCBYSA4.

Numbering. If you have a batch of photos of exactly the same thing in the same time and place, with the same author and licence, distinguish them with a number at the end. Start with 001, padding out the empty places with zeroes.

Edited versions of a photo should always be named – never change the original photo, always back up the originals, save a copy first, or use “Save As…” while you’re editing, and add something to the filename. V2, v3, cropped, web are good suffixes. You can use “Original” as the suffix to make sure you don’t change it. Using “Final” for the final version is not helpful, as it never turns out to be the final version. If you’re just resizing the photo, add pixel dimensions to the end (yes, they’re visible in the file information, but not when you’re skimming a list of email attachments).

All that makes for a loooooong filename. But don’t worry too much about filename length: it can be 255 characters – more than an entire tweet. Most operating systems display filenames with ellipses in the middle; this schema puts the most critical information (like project, date, number, and version) at either end. Some systems have limits to the total path length (the list of all directories and subdirectories the file’s inside, plus filename): for OneDrive that’s only 400 characters, and before Windows 10 it was just 260, but even if you use 100-character filenames you’ve got some breathing room.

None of the filenames we created were even 100 characters long.

A sensible filename schema is part of being a good digital archivist – and if you’ve ever taken a photo on your camera you want to keep, congratulations: you too are a digital archivist. Name your files accordingly. For more tips on all the other aspects of looking after digital collections, see the National Library of New Zealand’s Caring for Taonga guide.

Where credit’s due

If we’re using and sharing other people’s copyrighted photos, we should be scrupulous in telling people the source, who owns the copyright, and whether someone else can reuse it. But even museum and library professionals regularly get this wrong. Here’s a quick guide to crediting photos properly. This is not just good practice, it’s the law (specifically, the 1994 Copyright Act).

There are two things you should declare when you use someone else’s photo: the copyright owner (whoever owns the right to make copies of it), and the licence (why you were allowed to make a copy).

COPYRIGHT

The copyright owner is usually the photographer. Sometimes, if they took the photo as part of their job, their employer owns that copyright. Regardless, we have to say who the copyright holder is. If the copyright is held by an organisation, like DOC or Knox College, I personally credit the photographer as well if I know who they are. That isn’t required, it’s just being polite. 

Sometimes the copyright has expired – in New Zealand, that happens 50 years after the photographer’s death, or for photos taken before 1944. If there’s no longer any copyright, you can do what you like with the photo, and there’s no requirement to credit anyone. Some institutions put all sort of conditions and restrictions on the use of out-of-copyright photographs, but you should usually just ignore those; feel free to be polite and identify the author or source, but you don’t have to.

LICENCE

If the photo is still in copyright, the default licence is All Rights Reserved. That usually means you need the permission of the copyright holder to reuse it; there are only a few exceptions (research, private study, criticism, or review – as I’m doing in the examples below). But even if you don’t need permission to reproduce the photo, you should still credit it properly.

You don’t have to put the full credit in the photo caption, and often publishers don’t. They’ll put the copyright owner’s name in tiny capitals under or beside the image, and in a “photo credits” section say something like: “p 74: © Daisy Smith / All Rights Reserved. Used with permission.” You can do this too.

If there’s no copyright (it’s expired, or the creator has released the photo into the public domain) you should say “No known copyright” or “Public domain” in the credit. You don’t have to, but it’s good practice. When you reproduce a photo, you should also be telling people if or how they can reuse that photo themselves: don’t give them false information!

Some photos are released under a Creative Commons or CC licence: everything in WikiCommons* (which supplies most of Wikipedia’s photos), much of Te Ara, and some of Flickr or DigitalNZ. A Creative Commons image is free for anyone to use, but there’s usually one or more conditions. If you found the photo on Wikipedia, you should click through to view it, then click “View on Commons” to see where it’s stored on WikiCommons and what those conditions are. For example, the license might be CC BY 4.0 (translation: Creative Commons, Attribution). This means you have to credit the copyright holder (usually the photographer). This isn’t optional, or being nice! It’s a legal agreement you’ve entered into by reusing that photograph. Sometimes the photographer only gives their silly Wikipedia username. You still have to credit them.

You also have to state what kind of CC licence the photograph has (so people know what they themselves can use it for), and ideally give a source or link that lets people find the original. This is also part of the legal agreement you’ve entered into! The best way to do this if the photo’s online is a link to its page in WikiCommons. In print media you can just say “WikiCommons” or “Wikimedia Commons”.

If a DOC office emailed you this Brown Teal photo, the credit would look something like “© DOC / CC BY”, which translates as “DOC owns the copyright, and they’re releasing the photo under a Creative Commons Attribution licence”. If you found this photo on Flickr, you could link to it directly, using the licence it says on Flickr: “© DOC / Flickr / CC BY 2.0”. But looking at the photo’s metadata tells us it was taken in 2009 by “Fantommst”, who turns out to be Auckland photographer Lisa Ridings. Presumably DOC commissioned it, or she was working for them. So an even better credit would be “© DOC / Lisa Ridings / Flickr / CC BY 2.0”

SOME REAL LIFE EXAMPLES (TWEAKED TO PROTECT THE OFFENDERS)

This is a Sharon Murdoch cartoon published in Stuff. It’s still very much in copyright, even though there’s a copy hosted in the NLNZ collection. So we’d need to ask for permission to reuse this, and the credit line would be:

© Sharon Murdoch/Stuff, All Rights Reserved


“Wikimedia” definitely didn’t take this photo of the Champs Élysées; when we track it down it in Wikimedia Commons, we see that German war correspondent Johannes Jörgensen did, and the German National Archive has specified how they’d like to be credited. So the correct credit, with a link to the original, is: 

Bundesarchiv, Bild 101I-362-2210-05A / Jörgensen / Wikimedia Commons / CC-BY-SA 3.0.


This is a photo of the late entomologist Ray Shannon, from his personal papers, but just being in a photo doesn’t make you the copyright holder. Ray almost certainly didn’t take this photo, and since it’s less than 50 years old it’s definitely copyrighted to someone; we’d need to track them down and get permission before we could use it at all.


The dreaded Mr or Ms “Supplied”, who seems to take most of the photos in newspapers. The photo is actually a selfie by Axel Wilke, and available in WikiCommons, so the Herald should have put:

© Axel Wilke / Commons / CC BY-SA 4.0


TO SUMMARISE

We should be obeying the law and crediting creators properly. When you use someone else’s photo, clearly state whose it is, and what licence it’s released under. Let’s lift our game.


A version of this blog post appeared in the February 2021 edition of the LIANZA journal Library Life.


* Just to clarify some confusing names: Creative Commons is a licensing scheme for copyrighted photographs; Wikimedia Commons or WikiCommons is a website that hosts freely-usable photos. Most of the photos in WikiCommons have a Creative Commons licence.

An introduction to Wikisource

On 10 February 2021 a Wikisource volunteer who edits under the username Beeswaxcandle gave a workshop to a dozen West Coast librarians, museum workers, and history buffs on how Wikisource works and why it might be useful for digital heritage. These are my notes summarising his talk.

Beeswaxcandle has been a Wikisource editor – a wikisourcerer – since 2009. He began as a Wikipedia editor, but when one of his first articles was promptly shredded by other volunteers he turned to the calmer backwater of Wikisource. A musician, one of his biggest projects there has been transcribing the 1900 Grove Dictionary of Music and Musicians, still in progress (he’s up to Schubert).

Wikisource is a free online digital library that anyone can improve. Its logo is an iceberg; much is happening beneath the surface. It was created as a sister project to Wikipedia, and aims to be a reference library of primary source texts. Its value as a repository is not that it contains scanned images of the texts, but that these have been transcribed and proofread by volunteers (at least two per page) so they can be found by search engines.

The great strength of Wikisource is that its transcribed text is backed up by the original scanned pages; anyone can check the source and correct mistakes. It’s useful to compare it to the free ebook library Project Gutenberg. Gutenberg doesn’t include the original scanned text, and will sometimes merge several different editions: there’s no way to check the accuracy of its transcriptions. (Wikisource began back in 2003 as “Project Sourceberg” and numerous Gutenberg transcriptions have been added to Wikisource, even without scanned pages to back them up.)

There are other Wikimedia projects for housing texts that have different goals to Wikisource: Wikimedia Commons can host images or entire PDFs of publications, but it doesn’t transcribe; Wikibooks is a home for annotated publications and study guides; but Wikisource is solely for accurate transcriptions of the text as written, not commentary nor interpretation.

Wikisource is a small community, with only 421 active users in English and 26 admins. There are substantial efforts in French, Russian, and other languages: 98% of French Wikisource works are backed up by scanned pages, compared with 58% in English. The culture seems more laid-back and friendly than Wikipedia, with people working together to finish someone else’s project as a nice gesture. Rates of vandalism seem low, and the community uses watchlists and the list of recent changes to keep tabs on it. There are regular collaborations and “Proofread of the Month” projects – January is “quirky” month.

Author links to Julius von Haast’s Wikisource page

Usually the only blue hyperlinks in a Wikisource text will be Author pages, which display a short bio and a list of publications. Wikipedia-style linking would be seen as commentary. [To me author pages looked like something that could be generated from Wikidata, but Wikisource seems to have an arm’s-length relationship to Wikidata, and each project blames the other. — Mike] Because Wikisource hosts mostly public-domain works, the authors will usually be long-dead, so no page for J.K. Rowling. The rolling cutoff date for the US public domain is 1925: on January 1st this year The Great Gatsby entered the public domain, and very soon after a full transcribed and downloadable version appeared in Wikisource. [During question time there was some discussion about approaching local historians and convincing them to donate their copyrights to the public domain so some of the small local histories could be made available through Wikisource.]

Out of all the texts published in English from Chaucer to 1925, Wikisource holds about 300,000. That includes plenty of out-of-copyright novels, but Wikisource can also host:

From Why the Shoe Pinches  (1861) by Georg Hermann von Meyer

There are also portals, a curated collection on a particular area. The New Zealand portal (a selection from Category:New Zealand) includes legislation, treaties, travel writing, and floras. Some highlights:

One gem is the Letters from New Zealand, 1857–1911 by clergyman Henry W. Harper, who spent 9 years on the West Coast. Advice given to him before moving to Hokitika: “Take an old hand’s advice, don’t be discouraged, and if it rains, let it rain.”

But there is very little New Zealand work, and a need for more volunteers here to get busy expanding and correcting Wikisource’s holdings. There also plenty to do sorting and categorising what’s already been done, sourcing better images, finding scanned versions of works already done, and creating author pages.

To add a text to Wikisource, it needs to be scanned at at least 200 dpi for OCR to work, but 250–300 is fine. You can also source already-scanned works from the Internet Archive, the Biodiversity Heritage Library, or the Hathi Trust (which has clearer scans than the Internet Archive). An example text that could be brought into Wikisource would be the Hathi Trust scan of Horatio Gordon Robley’s Pounamu: notes on New Zealand greenstone (1915). I’ve written more on the process of scanning and OCR in another blog post. Once in Wikisource, each page needs to be both proofread and validated by volunteers – and they have to be two different volunteers (also known as Wikisourcerers). As this is happening the work can be transcluded into the main namespace, which is Wikisource jargon for being turned into a live digital document available for download.

George Marriner’s 1908 The Kea: a New Zealand problem, badly OCR’d in the Biodiversity Heritage Library, and a good candidate for Wikisource.

One example of a Wikisource project happened during COVID lockdown in the UK, when the staff of the Scottish National Library were sent home. The SNL had an extensive collection of scanned pamphlets – for example, The surprising adventures, miraculous escapes, and wonderful travels of the renowned Baron Munchausen – which had been (poorly) OCR’d, but needed to be proofread by humans. The resulting Wikiproject, advised by Beeswaxcandle, uploaded thousands of works, and over 1100 were eventually transcluded, by dozens of NLS staff, from April to August 2020. The NLS was then able to take the corrected text and reimport it to their database.

The talk was well received, with plenty of discussion, and may well be a catalyst for West Coast heritage organisations using Wikisource to make their collections more accessible. Watch this space.

Digitising a tiny book

Many books about the West Coast are a) out of print, b) out of copyright, and c) printed in very small runs. Consequently, this history is only available in a few libraries, so only accessible to a few thousand people.

One solution is to reprint these out of copyright books, and a couple of small New Zealand publishers have started doing this; the results are expensive and not very high quality, and the distribution is still quite limited. How could we make this West Coast history available to millions of people, ideally for free?

A simple way to do this would be to digitise the books and release them online, so anyone can download them for free, for reading as an ebook or to print their own copy. The problems are twofold: the work of typing and proofreading, and finding a permanent place to host the result. One solution is to use the Wikimedia Foundation project Wikisource, a repository for free public-domain texts, which volunteers collaboratively transcribe and proofread in their free time. Here’s how it works, using a small book from the Westland District Library collections and easily-available hardware and software.

Hokitika, N.Z. is a 24-page pamphlet containing a talk by Westland County Clerk David Evans on the origins of Hokitika, essays by William Evans (no relation) and journalist Samuel Saunders, and an excerpt from the writings of Julius Von Haast. It was printed as a fund-raising booklet by the Hokitika Guardian in 1921; apart from us, the only libraries in the world which have a copy are in Dunedin, Wellington, and Adelaide. Being pre-1925 means it’s out of copyright in the USA (something that Wikisource requires, because that’s where its servers are hosted).

I started by scanning each page as monochrome text TIFF files at 400 dpi. This didn’t require a fancy scanner, just the library’s multifunction photocopier (an ApeosPort VII) and the free Image Capture that came with my MacBook. A dedicated scanner and software would certainly have sped things up, but they weren’t necessary. It’s important to crop pages evenly and quite close to the text, because a page margin gets added later during printing. I took extra time to clean up the scans in Affinity Photo and adjust the contrast, erase spots and lines, and straighten the columns.

I used Print > PDF in Preview to export the scans as a single PDF; Preview let me reshuffle pages and drop photographic plates into the right place. I could then print the PDF 2-up on A4 using Preview’s booklet layout, and with a long-arm stapler could turn them into an A5 pamphlet (be sure to choose Print Entire Image or the text can be cropped). At this stage we now have a printable PDF which can produce a far better copy of the original than a photocopier could.

I used Affinity Photo to stitch together a better version of the pamphlet’s cover – without library stickers – and replaced the one in the PDF. Then I uploaded the PDF to Wikimedia Commons, as well as all the illustrations separately, all the typographic ornaments as high-resolution 1-bit (black and white, not greyscale) TIFFs, and the fancy headings (just because I thought they were a bit hilarious). The images, cover, and PDF were then all available in Category:Hokitika, N.Z : the Birth of the Borough (1921) in Commons for anyone to use.

The PDF and all the images need to be uploaded to WikiCommons first,
so Wikisource can use them to assemble pages.

With a clean PDF in Wikimedia Commons, I could create an Index in Wikisource, which lists all those pages and shows whether they’ve been proofread or not. This is a critical step when working with volunteers, who can see at a glance which pages to work on. A page in Wikisource is first proofread side by side with its PDF scan, then saved, and finally verified by a different editor, so every page gets checked at least twice. When all the pages are validated the book can be made available as a [Wikisource work](https://en.wikisource.org/wiki/Hokitika,_N.Z.): essentially becoming a long web page with digital text. There’s a certain amount of formatting, and images are inserted in the right places, but the type size and font are up to the reader’s settings or screen reader.

The typesetter at the Hokitika Guardian was determined to use all the fancy fonts and squiggly rules in the type cases.

Rather than transcribe each page by hand, I used Optical Character Recognition (OCR) software to generate a rough draft. There are plenty of free services that will perform OCR on an uploaded image, and the best ones seem to use the software Tesseract. First I tried uploading single pages to PDF24 Tools: a page took 1 minute to OCR, and 7 minutes to manually clean up ready for upload (cleanup is just sorting line ends and column breaks; the proofreading still has to happen ). I also tried the Mac and Windows software PDF OCR X; the community edition can only convert one page at a time, but was reasonably speedy and understood two-column pages. OCRing and uploading the 23 pages of text took half an hour. Wonky columns cause big problems for OCR software, so I was glad I spent some time cleaning up the scans first.

As a new transcription the book briefly featured on Wikisource’s home page, just below the intriguing-sounding Dream of a Rarebit Fiend.

Once digitised, proofread, and verified, the book can be downloaded from Wikisource as an EPUB (all e-readers except the Kindle), PDF, or MOBI (for Kindles).

Proofreading took about 10 minutes per page, and an experienced Wikisource editor User:Beeswaxcandle validated each one and handled the formatting of headings, page numbers, and even the fancy page rules. Page formatting is not too complicated, and it’s all documented in the Wikisource Help pages, but I’m now preparing a handout for new proofreaders with a list of tips and the common templates you’d use to validate pages. Someone brand new to Wikisource could start proofreading text right away, and leave the more technical stuff to someone else.

Having the book online as text – not just page images – opens up lots of possibilities. The book can be indexed by Google and the contents more easily discovered (it’s now deposited in the Internet Archive and the Open Library, for example). It can be downloaded as a PDF and read on a tablet, or uploaded to Overdrive to be borrowed from the Westland District Library as an EPUB file – which means it’s now accessible to the sight impaired, who use a screenreader or need to increase font size.

An example of a digitised ebook from Wikisource manually added to the catalogue alongside the physical copies.

The book can be used as a reference in Wikipedia articles and cited in Wikidata; in a future blog post I’ll demonstrate using this book to support a “Streets of Hokitika” Wikidata project.

Digitising Hokitika, N.Z. (1921) has been a proof of concept, and shows there is scope for proofreading and validating longer texts. Short books we can prepare by hand, but longer ones we’ll want to use better scanning technology for. Notably, some West-Coast-related books have already been digitised and OCR’d by the Internet Archive or the Biodiversity Heritage Library, like George Marriner’s 1908 book The Kea : a New Zealand Problem, and could be imported to Wikisource for proofreading right now.

The next step will be to build up a community of Wikisource volunteers, who could be anywhere in the world but ideally here on the West Coast. We now have regular Wikipedia meetups in Hokitika and Greymouth and can suggest proofreading projects to the attendees. In February Wikisource veteran Andrew Wooding is giving a presentation to local GLAM people, and we’ll have a Wikisource workshop at the West Coast WikiCon in Hokitika in March. I’m hoping people interested in genealogy and local heritage or working in museums, libraries, and archives will see the potential for making out of copyright books much more available.


Many thanks to Sara Thomas from the National Library of Scotland and Andrew Wooding (User:Beeswaxcandle) for all their help with this project.

Arriving on the West Coast

I moved to Hokitika on Sunday 22nd November and started at Westland District Library as a Digital Discovery Librarian the next day. Starting this job marks the end of over two years of having no fixed abode; in June 2018 I left my job as a curator at Whanganui Regional Museum and hit the road as New Zealand Wikipedian at Large, travelling from North Cape to Bluff and helping institutions take Wikipedia seriously. As a roving Wikipedian I went to conferences in Bangkok, Berlin, and Stockholm, lived in Estonia for a month (and in Palmerston North for five months to make up for it) and then in September arrived on the West Coast.

Development West Coast had sponsored me as West Coast Wikipedian at Large to spend six weeks travelling from Westport to Fox Glacier and running workshops for libraries, museums, tourism operators, and the general public. While I was at Westland District Library, the manager Natasha Morris asked me if I’d ever considered relocating to Hokitika. It turned out there was a $59 million COVID relief package allocated to libraries in recognition of their value to communities, administered by the National Library, and Westland District Council had secured funding for two staff positions to run until June 2022.

So for the first time in my life I’m a librarian. I need to learn how to issue, check in, shelve, place holds, and handle overdue fines, but most of my work will be dealing with online sources, photographs, newspapers, and blog posts. As a Digital Discovery Librarian my brief, very broadly, is to help West Coast stories get told online, and empower the people of the Coast with the skills to do that.

Natasha and I are currently brainstorming projects for me to tackle. Some of them will be Wikipedia-based, like making sure there’s a good article about every library and museum on the West Coast—best done by recruiting and training volunteer editors from the community, and supporting them over 18 months so they form a self-sustaining editing community. Some will be working with photo collections, looking at ways to digitise them and make them more widely available and shareable. And some will be working with books: getting some out-of-copyright and out-of-print historical works online. I’m looking forward to working with communities like Ōkārito, Fox Glacier, and Haast, as well as collaborating with librarians in Greymouth and Westport.

Westland District Library • MRD • CC BY

After a week on the job I’ve been joined by Rauhine Coakley, a Community Engagement Librarian supported by the same National-Library-administered fund. So we’ve almost doubled the library team here in Hokitika. I personally think this is the coolest part of the West Coast, a little town that punches above its weight. And it’s close to coastal forest, walking tracks, lakes, and beaches, all of which appeal to my love of getting out into nature and looking at ferns and insects.

The rules for the Hokitika Free Public Library required gentleman to remove their hats and not spit on the floor.

Over the course of my time as Digital Discovery Librarian I’ll be blogging my progress each month, and sharing quirky and fascinating things I come across. My goal is also to compile useful resources for institutions and individuals wanting to open up their collections and tell stories online. If you want to participate or have ideas for projects, contact me at Mike.Dickison@westlib.co.nz. Kia ora koutou!