DATA INSIGHTS - Automatic Coding of Occupations: Methods to create the Scottish Historic Population Database (SHPD)

This Data Insights from researcher Dr Richard Tobin, explains how the SCADR team are using machine learning to code over 25 million Scottish civil registration vital events records, including birth, marriages, deaths and occupations. 

Data Insights Overview 
In 1855, the national system of compulsory registration of births, deaths and marriages was introduced in Scotland. These records provided a unique source of information about people and their lives. However, as they were handwritten up until 1973, this made analysis of this data and potential research very difficult and time consuming. Thanks to the Digitising Scotland project they are now transcribed and exist in electronic form and the work in this Data Insights is further developing this to make it usable for research.
Richard outlines how he and his colleagues at SCADR are working with The Centre for Data Digitisation and Analysis (CDDA) at Queen's University Belfast to translate records from the Digitising Scotland project into standard coding schemes so that they can be analysed at scale and new meaning and insights can be discovered. The team are using a relatively small subset of the records, manually coded by CDDA, to automatically code the full set by applying Natural Language Processing and Machine Learning techniques. The result will be a research ready database – the Scottish Historic Population Database (SHPD) - that includes coded occupations and causes of death and will be available to researchers, offering them a fantastic resource of approx 25.6 million individuals dating back to 1855.
Find out how they are coding the records, how successful training the dataset has been and the challenges that they faced in the full Data Insights here.
What does this research enable?

Professor Chris Dibben explains :

Using machine learning and natural language processing techniques has been essential for generating data from an information source of some 23 million separate strings of text. It would simply not be economically viable to use human coders. The data produced from these data science approaches to Scotland’s civil registration records will now allow demographic, economic and historic analysis over the nineteenth and twentieth centuries in Scotland. Importantly, it will allow researchers to better understand how the condition of Scotland’s living population is a product of their and their parent’s pasts.

 Once complete and available, this resource will enable researchers to answer a range of research questions, such as:

  • Whether people were located in agricultural or urban areas, and whether a change in occupation was the reason why by 1901 the majority of the population lived in urban areas
  • Whether people who worked in certain occupations were more susceptible to certain diseases
  • Whether children from working class families were more likely to die in infancy than those from professional, upper class parents
  • Whether children of parents who had professional occupations were better educated than those of the same age from working class parents
  • The relative importance of social class and environment in influencing early age mortality.
Future Projects

Linking this database with 1936 Birth Cohort – this will allow us to follow the occupations, marriages, births and deaths of those children who sat the Scottish Mental Survey in 1947 (every Scottish schoolchild born in 1936 sat the same mental ability test).

This article was published on 11 Nov 2021