Population data linkage at scale
In this project we aim to develop novel algorithms and techniques for linking large databases containing millions of Scottish birth, death, and marriage certificates.
Research focus
Building on the success of the Digitising Scotland project, all Scottish birth, death, and marriage certificates from 1855 to 1973 have been transcribed into electronic form. While individually these certificates already provide a unique source of information about Scotland’s people and their lives, the full potential of this data collection can be realised when individual certificates are linked with the objective to reconstruct the life courses of individuals, pedigrees of families (family trees), or even the full Scottish population over a period of almost 120 years.
We plan to develop novel algorithms and techniques that are based on innovative blocking and filtering approaches and that take temporal and spatial constraints into account (such as a person can only marry once they have reached a certain age, or that the same mother cannot have two births four months apart). These blocking approaches will provide candidate pairs of records that likely refer to the same person. We will then employ unsupervised machine learning based linkage methods to further exploit the different relationships people have in their lives and the uniqueness or ambiguities of their names and addresses, with the aim to identify links with different levels of confidence.
A major challenge of linking such historical data are transcription errors of the handwritten certificates, unstandardised addresses, name variations, as well as missing data. To overcome such challenges we will use fuzzy matching techniques. We will also use the domain expertise of historians and demographers to validate the quality of our obtained links.
Data sources
Data produced by the Digitising Scotland project containing the Scottish Statutory registers of births, deaths and marriages (1855-1973). For more information on these records visit Scotland’s People website.
What this will enable researchers to do
We will combine individual linked records into life courses and family pedigrees, which we ultimately will combine into a full reconstructed population. These will provide a new and unique resource for researchers in various disciplines.
Research team
Project Lead: Professor Peter Christen, Eilidh Garrett, Charini Nanayakkara, Lee Williamson and Professor Chris Dibben.