Historical and present-day Scots and Scottish English corpora
The following lists of Scots and Scottish English language corpora were compiled at the end of 2016. There are three lists in total: Historical Scots, i.e. corpora containing pre-1700 materials; Modern Scots, i.e. corpora containing post-1700 materials; and corpora under construction. None of these lists claim to be exhaustive, nor do they specify the format of each corpus, its degree of linguistic annotation (if any) nor any copyright restrictions that may apply. Such information can be obtained via the URL or email address provided.
If you know of other materials that could usefully be added to this list, please send a brief description to AMC@ed.ac.uk.
Historical Scots (pre-1700)
- A Linguistic Atlas of Older Scots (LAOS) is an online linguistic atlas that shows what non-literary written Scots was like between about 1380 and 1500. The atlas has been compiled from some 1,200 documentary texts which have been transcribed and linguistically-annotated.
- The Helsinki Corpus of Older Scots covers the period 1450-1700 and supports studies of the last stages of the differentiation of the northern English dialect, the rise of a distinctive Scottish variety of English and the anglicization of Scots. It is available on CD-ROM from:
- The records of the Parliaments of Scotland 1424—1707 (with modern translations) are viewable online.
- The Older Scottish Textual Archive (aka The (Edinburgh) DOST Corpus) is a collection of texts prepared for the Dictionary of the Older Scots Tongue.
- Individual corpus components are also downloadable, e.g.
- The Aberdeen Burgh Records project provides online access to digital copies of Aberdeen’s Council Registers 1398-1511.
- The Breadalbane Collection is an eclectic gathering of 16th-century documents of the Campbells of Glenorchy.
- The Helsinki Corpus of Scottish Correspondence comprises 0.4 million words from the period 1540-1750.
- Helsinki Corpus of Scottish Correspondence (1540-1750) – META-SHARE
Modern Scots (post-1700)
- Glasgow University’s Corpus of Modern Scottish Writing contains some 350 documents from the period 1700—1945.
- The Linguistic Atlas of Scotland (3 volumes, available via EUL) is a comprehensive dialectological study of Lowland Scotland, Orkney and Shetland (and Northern Ireland, Northumberland and Cumberland) by the Linguistic Survey of Scotland. It provides a wealth of word-geographical material and phonological findings as well as detailed cartographic analyses of Scottish dialects.
- A Corpus of Dramatic Texts from Glasgow was compiled in the mid-1980s and comprises a collection of (then) contemporary plays. Access may be granted on written application to its compiler:
- jk at etinu dot com
- Katja Lenz compiled a corpus of twelve post-WWII dramatic texts in Scots, which she would be willing to share with interested researchers on written application.
- katja dot lenz at uni-koeln dot de
- The University of Edinburgh’s School of Scottish Studies has a Sound Archive and a small manuscript collection. It also houses the original field materials from the Linguistic Survey of Scotland.
- Glasgow University’s Scottish Corpus of Texts and Speech contains written and spoken samples of post-1940 Scots and Scottish English.
- The University of Edinburgh’s Phonetics Recording Archive contains recordings made by staff and students from the mid- to late-1900s. Many of the recordings are of Scots and Scottish English speakers.
- The Sounds of the City project at Glasgow University looks into speech sounds of Glasgow past and present based on its ever-expanding corpus of recordings. The corpus is not publicly available but access may be granted on written application.
- The How Stable is the Standard project has produced a corpus of male SSE speakers from the 1970s and 1990s from three age groups. The corpus is not publicly available but access may be granted on written application to:
- The University of Edinburgh’s HCRC Map Task Corpus contains 128 digitally-recorded and transcribed unscripted dialogues involving 64 (mostly Glaswegian) undergraduate students at the University of Glasgow.
- Glasgow University can provide access to two Shetland Scots corpora: one of vernacular Shetland Scots gathered from 30 speakers stratified by age and gender, the other a follow-up corpus of bidialectal data. These corpora are not publicly available but access may be granted on written application to:
- Jennifer dot Smith at glasgow dot ac dot uk
- As part of a study of what Polish immigrants do with the variation that exists in the English language around them, speech data were collected from 21 Edinburgh-born and 16 Poland-born adolescents living in Edinburgh.
- The West Fife High Pipe Band Corpus consists of 38 hours of conversation from a group of 54 speakers. The corpus is not publicly available but access may be granted on written application to:
- lynn dot clark at canterbury dot ac dot nz
- Thorsten Brato collected around 40 hours of speech data from c. 100 Aberdonians in 2006/07. Most of the speakers are children and teenagers from different parts of the city and social backgrounds, but there are also recordings with adults. The corpus is not publicly available but access may be granted on written application to:
- Thorsten dot Brato at sprachlit dot uni-regensburg dot de
- The Fisher Speak project investigated lexical attrition in five Scottish East Coast fishing communities. Its data — from dictionaries, wordlists and fieldwork — are not publicly available but access may be granted on written application.
- millar at abdn dot ac dot uk
- ASPECT (Access to Scottish Parliamentary Election Candidate Materials) is a digital archive of leaflets and newsletters produced by candidates and political parties for the Scottish parliamentary elections in 1999 and 2003.
- The Scots Syntax Atlas presents the results of over 100,000+ acceptability judgments from over 500 speakers on over 250 morphosyntactic phenomena. The Atlas also contains a text-to-sound aligned corpus of spoken data totalling 275 hours and over 3 million words.
- A Corpus of 21st Century Scots Texts, created by Chris Gilmour, is a database featuring word frequency statistics from texts written in Scots over the last twenty years. It was created to provide support in determining the appropriate spellings and usage of words in various Scots dialects. Texts were acquired and scraped from websites, journals, social media and books and tagged up into different regional dialects of Scots.
- Speak for Yersel (2021-22) is a digital resource which uses user input to map the different varieties of Scots spoken across Scotland and people’s attitudes towards these different varieties. The resource also includes a number of activities to explore further the linguistics of Scotland that Speak for Yersel reveals.
Corpora Under Construction
- The FITS project (From Inglis to Scots) is in the final stages of completion and is intended to become an online corpus of spelling-sound correspondences for every form of every item of Germanic origin in the LAOS corpus. (The LAOS corpus is described in the ‘Historical Scots’ list above.) The FITS corpus will examine the impressive range of spelling variants and explicate in unprecedented detail the historical development of each one of these form from its pre-Scots etymon.
- A Corpus of Nineteenth-century Scottish Correspondence aims to be a 500,000-word corpus of diplomatically-transcribed private and business letters.
- TheICE-Scotland project aims to compile by 2018 a 1-million-word corpus of spoken and written 21st-century Scottish English. The corpus will contain the text categories and annotations specified by the parent project (the International Corpus of English) with additional linguistic annotations such as part-of-speech and phonetic transcriptions.
- The Aberdeen Corpus of Older Scots (1375-1513, ‘ACOS’) is currently under construction but will provide online access to the Brus and all poems by Dunbar and Douglas. For further information, contact Charles-Henri Discry:
- c dot h dot discry at uu dot nl