Why linking data sets?

The Institute of Data Science hosted their third research seminar. Eli Sapir, Assistant professor at Faculty of Arts and Social Science was talking about his Data Science research on analyzing large-scale data analysis from a social science perspective. The theme in his talk was about the importance of linking different surveys in order to get more meaningful insights and being able to research bigger topics.

What data is available?

When I think about data being available, kinda “out there”, I think about text data available on websites, clicks recorded by companies, tweets, Facebook posts, likes and so on. Surveys collected by institutes and made available to others do not often come into my mind. However, Eli Sapir began his talk with detailing how much survey and interview data is already available on various websites.

  • Eurobarometer conducts since 1975 thousand interviews in each European country about motivations, feelings and reactions of selected social groups towards a given subject or concept.
  • The ICPRS hosts several data sets tallying up to more than 4 million variables.
  • The World value system conducts research on values in several countries for many years.
  • OECD provides economic data on countries.

That is a lot of data available to social scientists. Just there for us to grab. This data can be combined with other online data, such as social media data collected purposefully by a researcher. This is an approach advocated by Matthew Salganik in his book Bit by Bit (aff. link). I highly recommend this book to anyone planning to do research with online data.

Linking data sources is challenging

While there is a lot of data available, these are often not combined. Results from one survey might be used, but not combined with other sources. According to Eli, this happens because of :

  • Lack of harmonization between surveys: Variables are called different, operationalized, and measured differently, making it more time-consuming to combine variables from different surveys. There is also the desire to create something better, a better instrument, a more accurate measurement instrument. While these efforts are applaudable, there make it harder to link data coming from different surveys.
  • Lack of tools: Most statistical packages used by social scientists work with matrices: observations in rows and variables in columns. By combining variables from different data sets, the size of the data set can explode, making it not possible to store it, or run models on conventional computers.
  • Lack of training: The training many social scientists receive is still pretty traditional. Classical literature is taught, and classical statistical methods are taught.

While all three arguments are valuable, in my eyes the third one, the lack of training, carries most of the weight. My problem with the training social scientist receive is that it is too much driven by the knowledge and expertise of the teacher. While this might make sense, and it is of course more efficient to teach the topics we know and enjoy, it limits the speed of innovation. We, and this includes all social scientist, are pushing the responsibility to discover new research methods and research topics unto our inexperienced students. I do not advocate to eliminate all teaching about classical studies, but a balanced approach between classical methods and new methods is necessary.

Giving researchers the tool to link data sets

Eli’s solution is not to teach researchers how to link different data sets, but to provide them with a tool that links them. The idea behind this is that there are many decisions that need to be made when combining different data sources, and this tool will guide researchers through the decision-making process and provide them with a data set.

Potential Questions you need to ask yourself when combining data sets:

  • What variables am I interested in?
  • Do I want to aggregate certain variables across groups?
  • If I aggregate, what aggregation measure do I want?
  • If I aggregate, do I want to weight the aggregation measure (e.g., mean) by the variance in subgroups ?

For example, let’s assume I’m researching country level factors that influence the innovation index of a country, and I have country level information (e.g., GDP, income per capital, R&D spending, spending on education, infrastructure investment) and citizen level information (consumer spending, education level, career mobility, income, and gender). As country is the level of my analysis, I need to aggregate citizen level information. How do I do that? I can simply take the mean of all the variables that I have. But I can also decide to create a weighted mean, of income for example, and weight it by the proportion of men and women to take the pay gap into account.
Combining data sets isn’t as simple as glueing them together, decisions need to be made that have an impact on the outcome.

Why creating a tool instead of teaching researchers how to do it?

A tool is only useful if its potential users know why they should be using it and how to use it. I’m confident that Eli’s team will make the how clear: They are knowledgeable about how to combine data sets and have great expertise in guiding students. That should be a good foundation to create an easy to use-to-use tool.
My concern is with making researchers aware that the tool exists and in addition, highlighting why it should be used. My concern is based on a still strong assumption that most researchers work in silos. Our own domains are more important, and the relevance of other domains is perceived to be low. While interdisciplinary research has increased, I do not think that it has reached a tipping point where it becomes main stream. Going back to Eli’s arguments of why data sets are not combined, I think that the lack of harmonization and the insufficient training are the points we first need to tackle, before creating tools to combine data sets. At least this should be done concurrently.

I would even go so far to say that the training of researchers need to be reconsidered completely to ensure that we, employees at university who have earned the three magical letters P H D, are developing researchers who are able to conduct meaningful research in tomorrow’s world. We need to provide them with the tools to combine data sets, analyze large amount of data, move away from traditional survey design.

In the words of Eli, which Nosh Contractor mentioned already albeit slightly different: By giving and teaching individuals tools and research methods, they realize what research questions can be asked.

Leave a Reply

Your email address will not be published. Required fields are marked *