BLOG - Automating Coding for Large Historical Datasets
Colleagues from our eCohorts strand led an online workshop this month which showcased the innovative coding methods that we are developing.
On 5th May 2020, SCADR hosted an International Workshop on Automated Coding for research teams from the UK, Denmark, Norway and Sweden, held virtually due to the COVID-19 pandemic. Bringing together computer scientists, social scientists, historians and historical demographers, each brought their experience and expertise to discuss the complexities of applying coding schemes to variables such as occupation or cause of death.
Increased computing power in recent years has meant that historians are now moving from studying single communities or sample populations to exploring patterns and processes amongst the inhabitants of whole regions or countries. This often means making very large sets of data machine readable. The transcription, cleaning and coding of original and often handwritten sources such as census records and church registers are time-consuming, labour-intensive and expensive stages in the research process and researchers are increasingly seeking ways of automating them.
The aim of the workshop was to discuss the possibility of applying coding schemes which would allow international comparison between the sources available in the participating countries. It was recognised that not only would there be language barriers to be overcome, but that even within countries the terms used might vary or change their meaning over time. New terms would emerge and older expressions die out. The way sources recorded information could also vary and this would have to be taken into account when coding, interpreting and analysing the data.
Presentations and experiences were shared from the different nations. The UK team members come from the Universities of Edinburgh, St Andrews and Cambridge and have all been involved in the wider Digitising Scotland project, led by Professor Chris Dibben. This work now involves researchers from SCADR who are focusing on making the data research ready. Richard Tobin introduced his work using natural language processing to assign codes to occupations and causes of death in the data collected through Digitising Scotland.
The Danish team was led by Barbara Revuelta Eugercios and Anne Løkke from the BioHistory Research Group and the Centre for Health Research in the Humanities, University of Copenhagen. Both are closely involved in the Link- Lives project, working to create an historical reconstruction of the Danish population. Hilde Sommerseth, Director of the Norwegian Historical Data Centre, The Arctic University of Norway, brought members of the team who work on the Historical Population Register for Norway. The Swedish team was headed by Elisabeth Engberg and Maria Hiltunen Maltesdotter, from the Centre for Demographic and Aging Research at Umeå University, which hosts the Demographic Data Base, DDB.
From these range of projects, the workshop explored the different programmes and platforms being utilised and the challenges of working across disciplines and languages. Future workshops are planned to unite the teams’ computer scientists and bring twenty-first century technology to bear on eighteenth, nineteenth and twentieth century population data in greater detail, with a proposed future Hackathon to bring ideas into practice.
If historic information on individuals can be made machine readable, it is then possible to create very powerful research resources at scale. By automatically linking these records through time, it is then possible to reconstruct someone’s life, such as: what jobs they did, where they were living when their first child was born or who they married. Using data science approaches these life studies or ‘eCohorts’ can be produced for tens of thousands of individuals. This allows us to explore important questions about how health or wellbeing is impacted through life events and these can be studied at a whole population level.
This article was published on 29 May 2020