Wikisource cheat sheet

There are common bits of formatting you have to do in Wikisource; this is a list of ones beginners will likely come up against. Most take the form of templates, inside {{double curly brackets}}. You can nest one inside the other but do keep count of the number of brackets because if you leave one off weird things will happen.

Wikisource measures everything in ems; an em is a unit equal to the point size of the font, about the width of a capital M.

Paragraph formatting and spacing

EffectTemplateNotes
Line break<br />Don’t do this to make a break between paragraphs, just press Return twice.
Space {{em}}A one-em space between words.
Bigger space{{gap}}A bigger space. Don’t worry about trying to replicate the spacing in a Wikisource scan, though, it’s not the point.
Paragraph space{{dhr}}
{{dhr|3em}} 
A double-height row, and you can specify the gap. Again, no need to replicate white space in an e-book.
Right align{{right|sampletext}}
{{right|sampletext|2em}}
You can specify the distance from the right margin.
Centre{{c|sampletext}}

Typography

H⸻H{{longdash}}Enclose this in a {{nowrap| }} to stop it breaking across lines
Larger{{larger|Text}}
{{x-larger|Text}}
Goes up to {{xxxx-larger| }}
Smaller{{smaller|Text}}Down to {{xxxx-smaller| }}
Small block{{smaller block/s}}
{{smaller block/e}}
Put these templates before and after the block of text you want to make smaller.
Smaller, tighter text{{fine block/s}}
{{fine block/e}}
As above. Good for quoted text.
Small caps{{sc|Heading}}Note that only lower case letters will be converted to small capitals.
Hung punctuation{{fqm|“}}This floats an initial quotation mark into the left margin (which is good typography).
Gothic/blackletter font{{blackletter|  Das Text}}Fraktur fancy text
Size, colour etc{{font-size|130%
|style=color:#555051|Text}}
Makes larger grey text

Ornamental Rules

Rule{{rule}}
{{rule|10em}}
Makes a basic black line. You can tweak the length.
Wiggly line{{custom rule|w|40|w|40|w|40}} Three 40-em wiggles, so 120 ems total.
Row of bullets{{***|5|4em|char=•}} Or any line of spaced characters across the page: see help.
{{custom rule|sp|40|fy1|40|sp|40}}See the {{custom rule}} help for how to generate fancy-pants rules.

Headers

Basic running head{{rh||BIRTH OF WESTLAND.|4|}} The pipes separate the left-aligned, centred, and right-aligned.
Left/right running heads{{rvh|56|Chapter Title|Book Title}}Puts the page number and title or chapter in the correct spots: see help.

Words

Typo{{SIC|typed word|presumed word}}Use this for actual typos, not for former spellings.
Start of hyphenated word (ad-){{hws|ad|admirable}}Only used if the last word on the page is hyphenated…
End of hyphenated word
(-mirable)
{{hwe|mirable|admirable}}…so the word joins up properly across page breaks.
Paragraph ends coincides with page end{{nop}}Use when a paragraph ends at the bottom of the page, to stop it being joined to the next page’s text.
Reference in text<ref>Op. cit. p.11</ref> Need to put {{smallrefs}} in the footer as well

Poetry

{{block center|
{{smaller|
{{fqm}}And forth there stretched a silent land—<br />
For distance robbed mine ear of sound—<br />
League after league from the near strand<br />
To giant peaks that, band by band,<br />
Marched past the vision’s outmost bound.”
}}
}}

(There are actual <poem></poem> tags, but Wikisource people prefer this method. The <poem> tags work well for formatting a list of people or ingredients, though.)

Tip: apply typographic formatting first, them enclose it in positioning/spacing formatting, like the poem above.

An introduction to Wikisource

On 10 February 2021 a Wikisource volunteer who edits under the username Beeswaxcandle gave a workshop to a dozen West Coast librarians, museum workers, and history buffs on how Wikisource works and why it might be useful for digital heritage. These are my notes summarising his talk.

Beeswaxcandle has been a Wikisource editor – a wikisourcerer – since 2009. He began as a Wikipedia editor, but when one of his first articles was promptly shredded by other volunteers he turned to the calmer backwater of Wikisource. A musician, one of his biggest projects there has been transcribing the 1900 Grove Dictionary of Music and Musicians, still in progress (he’s up to Schubert).

Wikisource is a free online digital library that anyone can improve. Its logo is an iceberg; much is happening beneath the surface. It was created as a sister project to Wikipedia, and aims to be a reference library of primary source texts. Its value as a repository is not that it contains scanned images of the texts, but that these have been transcribed and proofread by volunteers (at least two per page) so they can be found by search engines.

The great strength of Wikisource is that its transcribed text is backed up by the original scanned pages; anyone can check the source and correct mistakes. It’s useful to compare it to the free ebook library Project Gutenberg. Gutenberg doesn’t include the original scanned text, and will sometimes merge several different editions: there’s no way to check the accuracy of its transcriptions. (Wikisource began back in 2003 as “Project Sourceberg” and numerous Gutenberg transcriptions have been added to Wikisource, even without scanned pages to back them up.)

There are other Wikimedia projects for housing texts that have different goals to Wikisource: Wikimedia Commons can host images or entire PDFs of publications, but it doesn’t transcribe; Wikibooks is a home for annotated publications and study guides; but Wikisource is solely for accurate transcriptions of the text as written, not commentary nor interpretation.

Wikisource is a small community, with only 421 active users in English and 26 admins. There are substantial efforts in French, Russian, and other languages: 98% of French Wikisource works are backed up by scanned pages, compared with 58% in English. The culture seems more laid-back and friendly than Wikipedia, with people working together to finish someone else’s project as a nice gesture. Rates of vandalism seem low, and the community uses watchlists and the list of recent changes to keep tabs on it. There are regular collaborations and “Proofread of the Month” projects – January is “quirky” month.

Author links to Julius von Haast’s Wikisource page

Usually the only blue hyperlinks in a Wikisource text will be Author pages, which display a short bio and a list of publications. Wikipedia-style linking would be seen as commentary. [To me author pages looked like something that could be generated from Wikidata, but Wikisource seems to have an arm’s-length relationship to Wikidata, and each project blames the other. — Mike] Because Wikisource hosts mostly public-domain works, the authors will usually be long-dead, so no page for J.K. Rowling. The rolling cutoff date for the US public domain is 1925: on January 1st this year The Great Gatsby entered the public domain, and very soon after a full transcribed and downloadable version appeared in Wikisource. [During question time there was some discussion about approaching local historians and convincing them to donate their copyrights to the public domain so some of the small local histories could be made available through Wikisource.]

Out of all the texts published in English from Chaucer to 1925, Wikisource holds about 300,000. That includes plenty of out-of-copyright novels, but Wikisource can also host:

From Why the Shoe Pinches  (1861) by Georg Hermann von Meyer

There are also portals, a curated collection on a particular area. The New Zealand portal (a selection from Category:New Zealand) includes legislation, treaties, travel writing, and floras. Some highlights:

One gem is the Letters from New Zealand, 1857–1911 by clergyman Henry W. Harper, who spent 9 years on the West Coast. Advice given to him before moving to Hokitika: “Take an old hand’s advice, don’t be discouraged, and if it rains, let it rain.”

But there is very little New Zealand work, and a need for more volunteers here to get busy expanding and correcting Wikisource’s holdings. There also plenty to do sorting and categorising what’s already been done, sourcing better images, finding scanned versions of works already done, and creating author pages.

To add a text to Wikisource, it needs to be scanned at at least 200 dpi for OCR to work, but 250–300 is fine. You can also source already-scanned works from the Internet Archive, the Biodiversity Heritage Library, or the Hathi Trust (which has clearer scans than the Internet Archive). An example text that could be brought into Wikisource would be the Hathi Trust scan of Horatio Gordon Robley’s Pounamu: notes on New Zealand greenstone (1915). I’ve written more on the process of scanning and OCR in another blog post. Once in Wikisource, each page needs to be both proofread and validated by volunteers – and they have to be two different volunteers (also known as Wikisourcerers). As this is happening the work can be transcluded into the main namespace, which is Wikisource jargon for being turned into a live digital document available for download.

George Marriner’s 1908 The Kea: a New Zealand problem, badly OCR’d in the Biodiversity Heritage Library, and a good candidate for Wikisource.

One example of a Wikisource project happened during COVID lockdown in the UK, when the staff of the Scottish National Library were sent home. The SNL had an extensive collection of scanned pamphlets – for example, The surprising adventures, miraculous escapes, and wonderful travels of the renowned Baron Munchausen – which had been (poorly) OCR’d, but needed to be proofread by humans. The resulting Wikiproject, advised by Beeswaxcandle, uploaded thousands of works, and over 1100 were eventually transcluded, by dozens of NLS staff, from April to August 2020. The NLS was then able to take the corrected text and reimport it to their database.

The talk was well received, with plenty of discussion, and may well be a catalyst for West Coast heritage organisations using Wikisource to make their collections more accessible. Watch this space.

Digitising a tiny book

Many books about the West Coast are a) out of print, b) out of copyright, and c) printed in very small runs. Consequently, this history is only available in a few libraries, so only accessible to a few thousand people.

One solution is to reprint these out of copyright books, and a couple of small New Zealand publishers have started doing this; the results are expensive and not very high quality, and the distribution is still quite limited. How could we make this West Coast history available to millions of people, ideally for free?

A simple way to do this would be to digitise the books and release them online, so anyone can download them for free, for reading as an ebook or to print their own copy. The problems are twofold: the work of typing and proofreading, and finding a permanent place to host the result. One solution is to use the Wikimedia Foundation project Wikisource, a repository for free public-domain texts, which volunteers collaboratively transcribe and proofread in their free time. Here’s how it works, using a small book from the Westland District Library collections and easily-available hardware and software.

Hokitika, N.Z. is a 24-page pamphlet containing a talk by Westland County Clerk David Evans on the origins of Hokitika, essays by William Evans (no relation) and journalist Samuel Saunders, and an excerpt from the writings of Julius Von Haast. It was printed as a fund-raising booklet by the Hokitika Guardian in 1921; apart from us, the only libraries in the world which have a copy are in Dunedin, Wellington, and Adelaide. Being pre-1925 means it’s out of copyright in the USA (something that Wikisource requires, because that’s where its servers are hosted).

I started by scanning each page as monochrome text TIFF files at 400 dpi. This didn’t require a fancy scanner, just the library’s multifunction photocopier (an ApeosPort VII) and the free Image Capture that came with my MacBook. A dedicated scanner and software would certainly have sped things up, but they weren’t necessary. It’s important to crop pages evenly and quite close to the text, because a page margin gets added later during printing. I took extra time to clean up the scans in Affinity Photo and adjust the contrast, erase spots and lines, and straighten the columns.

I used Print > PDF in Preview to export the scans as a single PDF; Preview let me reshuffle pages and drop photographic plates into the right place. I could then print the PDF 2-up on A4 using Preview’s booklet layout, and with a long-arm stapler could turn them into an A5 pamphlet (be sure to choose Print Entire Image or the text can be cropped). At this stage we now have a printable PDF which can produce a far better copy of the original than a photocopier could.

I used Affinity Photo to stitch together a better version of the pamphlet’s cover – without library stickers – and replaced the one in the PDF. Then I uploaded the PDF to Wikimedia Commons, as well as all the illustrations separately, all the typographic ornaments as high-resolution 1-bit (black and white, not greyscale) TIFFs, and the fancy headings (just because I thought they were a bit hilarious). The images, cover, and PDF were then all available in Category:Hokitika, N.Z : the Birth of the Borough (1921) in Commons for anyone to use.

The PDF and all the images need to be uploaded to WikiCommons first,
so Wikisource can use them to assemble pages.

With a clean PDF in Wikimedia Commons, I could create an Index in Wikisource, which lists all those pages and shows whether they’ve been proofread or not. This is a critical step when working with volunteers, who can see at a glance which pages to work on. A page in Wikisource is first proofread side by side with its PDF scan, then saved, and finally verified by a different editor, so every page gets checked at least twice. When all the pages are validated the book can be made available as a [Wikisource work](https://en.wikisource.org/wiki/Hokitika,_N.Z.): essentially becoming a long web page with digital text. There’s a certain amount of formatting, and images are inserted in the right places, but the type size and font are up to the reader’s settings or screen reader.

The typesetter at the Hokitika Guardian was determined to use all the fancy fonts and squiggly rules in the type cases.

Rather than transcribe each page by hand, I used Optical Character Recognition (OCR) software to generate a rough draft. There are plenty of free services that will perform OCR on an uploaded image, and the best ones seem to use the software Tesseract. First I tried uploading single pages to PDF24 Tools: a page took 1 minute to OCR, and 7 minutes to manually clean up ready for upload (cleanup is just sorting line ends and column breaks; the proofreading still has to happen ). I also tried the Mac and Windows software PDF OCR X; the community edition can only convert one page at a time, but was reasonably speedy and understood two-column pages. OCRing and uploading the 23 pages of text took half an hour. Wonky columns cause big problems for OCR software, so I was glad I spent some time cleaning up the scans first.

As a new transcription the book briefly featured on Wikisource’s home page, just below the intriguing-sounding Dream of a Rarebit Fiend.

Once digitised, proofread, and verified, the book can be downloaded from Wikisource as an EPUB (all e-readers except the Kindle), PDF, or MOBI (for Kindles).

Proofreading took about 10 minutes per page, and an experienced Wikisource editor User:Beeswaxcandle validated each one and handled the formatting of headings, page numbers, and even the fancy page rules. Page formatting is not too complicated, and it’s all documented in the Wikisource Help pages, but I’m now preparing a handout for new proofreaders with a list of tips and the common templates you’d use to validate pages. Someone brand new to Wikisource could start proofreading text right away, and leave the more technical stuff to someone else.

Having the book online as text – not just page images – opens up lots of possibilities. The book can be indexed by Google and the contents more easily discovered (it’s now deposited in the Internet Archive and the Open Library, for example). It can be downloaded as a PDF and read on a tablet, or uploaded to Overdrive to be borrowed from the Westland District Library as an EPUB file – which means it’s now accessible to the sight impaired, who use a screenreader or need to increase font size.

An example of a digitised ebook from Wikisource manually added to the catalogue alongside the physical copies.

The book can be used as a reference in Wikipedia articles and cited in Wikidata; in a future blog post I’ll demonstrate using this book to support a “Streets of Hokitika” Wikidata project.

Digitising Hokitika, N.Z. (1921) has been a proof of concept, and shows there is scope for proofreading and validating longer texts. Short books we can prepare by hand, but longer ones we’ll want to use better scanning technology for. Notably, some West-Coast-related books have already been digitised and OCR’d by the Internet Archive or the Biodiversity Heritage Library, like George Marriner’s 1908 book The Kea : a New Zealand Problem, and could be imported to Wikisource for proofreading right now.

The next step will be to build up a community of Wikisource volunteers, who could be anywhere in the world but ideally here on the West Coast. We now have regular Wikipedia meetups in Hokitika and Greymouth and can suggest proofreading projects to the attendees. In February Wikisource veteran Andrew Wooding is giving a presentation to local GLAM people, and we’ll have a Wikisource workshop at the West Coast WikiCon in Hokitika in March. I’m hoping people interested in genealogy and local heritage or working in museums, libraries, and archives will see the potential for making out of copyright books much more available.


Many thanks to Sara Thomas from the National Library of Scotland and Andrew Wooding (User:Beeswaxcandle) for all their help with this project.