Digitising a tiny book

Many books about the West Coast are a) out of print, b) out of copyright, and c) printed in very small runs. Consequently, this history is only available in a few libraries, so only accessible to a few thousand people.

One solution is to reprint these out of copyright books, and a couple of small New Zealand publishers have started doing this; the results are expensive and not very high quality, and the distribution is still quite limited. How could we make this West Coast history available to millions of people, ideally for free?

A simple way to do this would be to digitise the books and release them online, so anyone can download them for free, for reading as an ebook or to print their own copy. The problems are twofold: the work of typing and proofreading, and finding a permanent place to host the result. One solution is to use the Wikimedia Foundation project Wikisource, a repository for free public-domain texts, which volunteers collaboratively transcribe and proofread in their free time. Here’s how it works, using a small book from the Westland District Library collections and easily-available hardware and software.

Hokitika, N.Z. is a 24-page pamphlet containing a talk by Westland County Clerk David Evans on the origins of Hokitika, essays by William Evans (no relation) and journalist Samuel Saunders, and an excerpt from the writings of Julius Von Haast. It was printed as a fund-raising booklet by the Hokitika Guardian in 1921; apart from us, the only libraries in the world which have a copy are in Dunedin, Wellington, and Adelaide. Being pre-1925 means it’s out of copyright in the USA (something that Wikisource requires, because that’s where its servers are hosted).

I started by scanning each page as monochrome text TIFF files at 400 dpi. This didn’t require a fancy scanner, just the library’s multifunction photocopier (an ApeosPort VII) and the free Image Capture that came with my MacBook. A dedicated scanner and software would certainly have sped things up, but they weren’t necessary. It’s important to crop pages evenly and quite close to the text, because a page margin gets added later during printing. I took extra time to clean up the scans in Affinity Photo and adjust the contrast, erase spots and lines, and straighten the columns.

I used Print > PDF in Preview to export the scans as a single PDF; Preview let me reshuffle pages and drop photographic plates into the right place. I could then print the PDF 2-up on A4 using Preview’s booklet layout, and with a long-arm stapler could turn them into an A5 pamphlet (be sure to choose Print Entire Image or the text can be cropped). At this stage we now have a printable PDF which can produce a far better copy of the original than a photocopier could.

I used Affinity Photo to stitch together a better version of the pamphlet’s cover – without library stickers – and replaced the one in the PDF. Then I uploaded the PDF to Wikimedia Commons, as well as all the illustrations separately, all the typographic ornaments as high-resolution 1-bit (black and white, not greyscale) TIFFs, and the fancy headings (just because I thought they were a bit hilarious). The images, cover, and PDF were then all available in Category:Hokitika, N.Z : the Birth of the Borough (1921) in Commons for anyone to use.

The PDF and all the images need to be uploaded to WikiCommons first,
so Wikisource can use them to assemble pages.

With a clean PDF in Wikimedia Commons, I could create an Index in Wikisource, which lists all those pages and shows whether they’ve been proofread or not. This is a critical step when working with volunteers, who can see at a glance which pages to work on. A page in Wikisource is first proofread side by side with its PDF scan, then saved, and finally verified by a different editor, so every page gets checked at least twice. When all the pages are validated the book can be made available as a [Wikisource work](https://en.wikisource.org/wiki/Hokitika,_N.Z.): essentially becoming a long web page with digital text. There’s a certain amount of formatting, and images are inserted in the right places, but the type size and font are up to the reader’s settings or screen reader.

The typesetter at the Hokitika Guardian was determined to use all the fancy fonts and squiggly rules in the type cases.

Rather than transcribe each page by hand, I used Optical Character Recognition (OCR) software to generate a rough draft. There are plenty of free services that will perform OCR on an uploaded image, and the best ones seem to use the software Tesseract. First I tried uploading single pages to PDF24 Tools: a page took 1 minute to OCR, and 7 minutes to manually clean up ready for upload (cleanup is just sorting line ends and column breaks; the proofreading still has to happen ). I also tried the Mac and Windows software PDF OCR X; the community edition can only convert one page at a time, but was reasonably speedy and understood two-column pages. OCRing and uploading the 23 pages of text took half an hour. Wonky columns cause big problems for OCR software, so I was glad I spent some time cleaning up the scans first.

As a new transcription the book briefly featured on Wikisource’s home page, just below the intriguing-sounding Dream of a Rarebit Fiend.

Once digitised, proofread, and verified, the book can be downloaded from Wikisource as an EPUB (all e-readers except the Kindle), PDF, or MOBI (for Kindles).

Proofreading took about 10 minutes per page, and an experienced Wikisource editor User:Beeswaxcandle validated each one and handled the formatting of headings, page numbers, and even the fancy page rules. Page formatting is not too complicated, and it’s all documented in the Wikisource Help pages, but I’m now preparing a handout for new proofreaders with a list of tips and the common templates you’d use to validate pages. Someone brand new to Wikisource could start proofreading text right away, and leave the more technical stuff to someone else.

Having the book online as text – not just page images – opens up lots of possibilities. The book can be indexed by Google and the contents more easily discovered (it’s now deposited in the Internet Archive and the Open Library, for example). It can be downloaded as a PDF and read on a tablet, or uploaded to Overdrive to be borrowed from the Westland District Library as an EPUB file – which means it’s now accessible to the sight impaired, who use a screenreader or need to increase font size. It can be used as a reference in Wikipedia articles and cited in Wikidata; in a future blog post I’ll demonstrate using this book to support a “Streets of Hokitika” Wikidata project.

Digitising Hokitika, N.Z. (1921) has been a proof of concept, and shows there is scope for proofreading and validating longer texts. Short books we can prepare by hand, but longer ones we’ll want to use better scanning technology for. Notably, some West-Coast-related books have already been digitised and OCR’d by the Internet Archive or the Biodiversity Heritage Library, like George Marriner’s 1908 book The Kea : a New Zealand Problem, and could be imported to Wikisource for proofreading right now.

The next step will be to build up a community of Wikisource volunteers, who could be anywhere in the world but ideally here on the West Coast. We now have regular Wikipedia meetups in Hokitika and Greymouth and can suggest proofreading projects to the attendees. In February Wikisource veteran Andrew Wooding is giving a presentation to local GLAM people, and we’ll have a Wikisource workshop at the West Coast WikiCon in Hokitika in March. I’m hoping people interested in genealogy and local heritage or working in museums, libraries, and archives will see the potential for making out of copyright books much more available.


Many thanks to Sara Thomas from the National Library of Scotland and Andrew Wooding (User:Beeswaxcandle) for all their help with this project.