An introduction to Wikisource

On 10 February 2021 a Wikisource volunteer who edits under the username Beeswaxcandle gave a workshop to a dozen West Coast librarians, museum workers, and history buffs on how Wikisource works and why it might be useful for digital heritage. These are my notes summarising his talk.

Beeswaxcandle has been a Wikisource editor – a wikisourcerer – since 2009. He began as a Wikipedia editor, but when one of his first articles was promptly shredded by other volunteers he turned to the calmer backwater of Wikisource. A musician, one of his biggest projects there has been transcribing the 1900 Grove Dictionary of Music and Musicians, still in progress (he’s up to Schubert).

Wikisource is a free online digital library that anyone can improve. Its logo is an iceberg; much is happening beneath the surface. It was created as a sister project to Wikipedia, and aims to be a reference library of primary source texts. Its value as a repository is not that it contains scanned images of the texts, but that these have been transcribed and proofread by volunteers (at least two per page) so they can be found by search engines.

The great strength of Wikisource is that its transcribed text is backed up by the original scanned pages; anyone can check the source and correct mistakes. It’s useful to compare it to the free ebook library Project Gutenberg. Gutenberg doesn’t include the original scanned text, and will sometimes merge several different editions: there’s no way to check the accuracy of its transcriptions. (Wikisource began back in 2003 as “Project Sourceberg” and numerous Gutenberg transcriptions have been added to Wikisource, even without scanned pages to back them up.)

There are other Wikimedia projects for housing texts that have different goals to Wikisource: Wikimedia Commons can host images or entire PDFs of publications, but it doesn’t transcribe; Wikibooks is a home for annotated publications and study guides; but Wikisource is solely for accurate transcriptions of the text as written, not commentary nor interpretation.

Wikisource is a small community, with only 421 active users in English and 26 admins. There are substantial efforts in French, Russian, and other languages: 98% of French Wikisource works are backed up by scanned pages, compared with 58% in English. The culture seems more laid-back and friendly than Wikipedia, with people working together to finish someone else’s project as a nice gesture. Rates of vandalism seem low, and the community uses watchlists and the list of recent changes to keep tabs on it. There are regular collaborations and “Proofread of the Month” projects – January is “quirky” month.

Author links to Julius von Haast’s Wikisource page

Usually the only blue hyperlinks in a Wikisource text will be Author pages, which display a short bio and a list of publications. Wikipedia-style linking would be seen as commentary. [To me author pages looked like something that could be generated from Wikidata, but Wikisource seems to have an arm’s-length relationship to Wikidata, and each project blames the other. — Mike] Because Wikisource hosts mostly public-domain works, the authors will usually be long-dead, so no page for J.K. Rowling. The rolling cutoff date for the US public domain is 1925: on January 1st this year The Great Gatsby entered the public domain, and very soon after a full transcribed and downloadable version appeared in Wikisource. [During question time there was some discussion about approaching local historians and convincing them to donate their copyrights to the public domain so some of the small local histories could be made available through Wikisource.]

Out of all the texts published in English from Chaucer to 1925, Wikisource holds about 300,000. That includes plenty of out-of-copyright novels, but Wikisource can also host:

From Why the Shoe Pinches  (1861) by Georg Hermann von Meyer

There are also portals, a curated collection on a particular area. The New Zealand portal (a selection from Category:New Zealand) includes legislation, treaties, travel writing, and floras. Some highlights:

One gem is the Letters from New Zealand, 1857–1911 by clergyman Henry W. Harper, who spent 9 years on the West Coast. Advice given to him before moving to Hokitika: “Take an old hand’s advice, don’t be discouraged, and if it rains, let it rain.”

But there is very little New Zealand work, and a need for more volunteers here to get busy expanding and correcting Wikisource’s holdings. There also plenty to do sorting and categorising what’s already been done, sourcing better images, finding scanned versions of works already done, and creating author pages.

To add a text to Wikisource, it needs to be scanned at at least 200 dpi for OCR to work, but 250–300 is fine. You can also source already-scanned works from the Internet Archive, the Biodiversity Heritage Library, or the Hathi Trust (which has clearer scans than the Internet Archive). An example text that could be brought into Wikisource would be the Hathi Trust scan of Horatio Gordon Robley’s Pounamu: notes on New Zealand greenstone (1915). I’ve written more on the process of scanning and OCR in another blog post. Once in Wikisource, each page needs to be both proofread and validated by volunteers – and they have to be two different volunteers (also known as Wikisourcerers). As this is happening the work can be transcluded into the main namespace, which is Wikisource jargon for being turned into a live digital document available for download.

George Marriner’s 1908 The Kea: a New Zealand problem, badly OCR’d in the Biodiversity Heritage Library, and a good candidate for Wikisource.

One example of a Wikisource project happened during COVID lockdown in the UK, when the staff of the Scottish National Library were sent home. The SNL had an extensive collection of scanned pamphlets – for example, The surprising adventures, miraculous escapes, and wonderful travels of the renowned Baron Munchausen – which had been (poorly) OCR’d, but needed to be proofread by humans. The resulting Wikiproject, advised by Beeswaxcandle, uploaded thousands of works, and over 1100 were eventually transcluded, by dozens of NLS staff, from April to August 2020. The NLS was then able to take the corrected text and reimport it to their database.

The talk was well received, with plenty of discussion, and may well be a catalyst for West Coast heritage organisations using Wikisource to make their collections more accessible. Watch this space.