The underwear of data science

The underwear of data science

For those who follow my research series, yes, I’m still analzying academic vacancies for an upcoming conference, but it takes far too long to run the ERGM model, so in frustration I’ doing something simpler until I found a way to do what I wanted to do. I’m most probably going to focus on Dutch research universities, and maybe working towards a multilevel model (department nested within universities). But I’m not sure how to enter scientific fields in this model. It’s a cross-cutting level between departments and universities.

While I’m gonna keep on thinking about how to approach this task, enjoy reading this post on open science.

Thursday 25th January 2018 was the pre-DIES symposium at Maastricht University, DIES is the event organized at Maastricht to celebrate the establishment of the university. This year the theme was Data science. The Institute of Data Science organized several competitions, of which some I wanted to participate, but then my plate was too full, so it skipped my mind.

Due to other commitments, I was only able to attend the keynote during the pre-DIES. While I would have loved to be at the DIES, I had to leave for Switzerland as I registered earlier on for a workshop organized by women++ on Machine Learning for News: Theory, Applications and Visualization in Python at the 2018 Applied Machine Learning Days.

The underwear of open science

The little bit I saw from the pre-DIES was inspiring. The keynote was given by Carole Anne Goble. Carole is an expert in knowledge and information management and used a sticky metaphor to explain her work: It is the type of work you need, but nobody wants to see: Underwear work. She takes care of the data infrastructure that is needed for others to do research, analyze data sets etc. The metaphor was fitting, in the sense that her work is the invisible backbone that makes research possible. It is necessary, but is often not credited.

With this message in mind, I do want to give right now credit,where credit is due: My project “The future of business school” is only going to be successful thanks to the institute of data science, specifically the work of Alex Malic.

At IDS they are working on a data infrastructure that will enable the easy (!) storage, analyzes and sharing of data for research purposes. I’m really looking forward to the day the infrastructure is up and running. Hopefully this will happen before my contract ends this year. Given the relative short time I still have at Maastricht University, I might not be able to use it. Nevertheless, the knowledge that other people don’t have to go through the same troubles than I for collecting, storage, analyzes and sharing of data is a pleasure. Just this weekend, I was struggling with a laptop who is running out of power and space to run the analysis I need for Thursday. It is extremely annoying to not be able to use your computer because it just keeps on calculating.

The roadblocks for open science

What I found intrigued about the keynote from Carole was her explanation of how the current research process is hindering progress in open science. She described research as a two-stage process:

  1. Doing research : designing studies, collecting data, analyzing data and testing hypothesis.
  2. Writing about the research: writing manuscripts, attending conferences, publishing research.

The problem for open science, according to Carole, is that these two stages are disconnected from each other. Of course, the researchers involved in both stages see it as one process, however, if you are reviewing manuscripts, access to decisions and information that happened in the first stage, is limited to what the researchers decided (consciously or unconsciously) to include the ‘method’ section of their papers. This puts a lot of emphasis on the importance of writing a good method section.

Qualitative researchers have taught me to create a “research audit”, which, similar to accounting books, can be reviewed by others. This audit contains the decisions researchers made during the first stage of their project and increases the transparency. I had two great colleagues at Northwestern University who further strengthen this mindset.

Open science and citizen science

The issue of sharing information about our research process reminded me about a TEDRadio podcast on citizen science. The key in citizen science is sharing of information across stakeholders. Instead of competition, collaboration is key to advance science. Now, citizen science is not ‘amateur’ science. While not all of the citizen involved in the process have research credentials aka Ph.D degrees and similar, they are all driven by an interest to ask the right question and collect data to answer their question. The advantage of citizen scientists is that their promotion and job security does not depend on their research output. This eliminates all barriers for collaboration.

It is not only citizen science that benefits from collaboration. The same applies for academic science. It is great to see that more research is done in teams, across institutional and country boundaries. However, the disciplinary divides create hurdles all scientific teams need to pass.

Open science needs rich metadata

But open science needs more than sharing information for it to take off. Specifically, in Carole’s opinion open science needs people to write rich meat data. This means providing detailed information about the data that researchers collected. This entails investing time to write a rich description about what data was collected and how, what were the variables, how were they collected, what instruments, who distributed the instruments? This goes beyond your standard method section. I think the debate about reproducible social science highlights that most description of scientific experiments lack the detail for the research to be reproduced. Too many variables researchers do not consider to have an influence on the outcome or are even not consciously considered. For example, could the weather influence your outcome.

The meta data needs to be rich enough for other people to know what information is considered in a file, what questions can be answered using that data set. The bottleneck exists because rivals and competitors could be using your data to conduct research and get published. Currently, as most of the academic promotion system is based on the number of publication and the quality of publication outlets, having competitors publish on your expertise topics with your data can be bad for promotion.

Open data and a new publishing process

I was once in a meeting where the senior researchers talked about “getting scooped” because someone else published a paper on a specific topic before them. The discussion then turned into finding weaknesses in the paper in order to see if there is still a door/ opportunity open for them to publish their paper. I don’t know the end of the story.

Back to the keynote. Towards the end Carole talked about one new concept: Evolving manuscript. An evolving manuscript is a manuscript, which, pretty straightforward, keeps on evolving and changing after the original author(s) published it. Those familiar with coding will know Github. It’s a platform unto which programmers can upload their code into (“repositories”). If these are public, others can access them. As private repositories costs money, most of them are public. This means that other users can see your code, contribute to it, improve it, or use it for their own purpose.

Applying this concept for manuscripts, this means that researchers will upload their manuscript to a central platform. Their data will be linked to the paper. From then on, other researchers could access your paper, modify it, conduct other analysis with your data, add research questions to it and so on.

With that thought of others, including the guy who ‘stole’ your idea, potentially working on your paper, I like to end this post. Scary? Exciting?

If you finish reading the post, still wondering what the picture has to do with the content of the post, I managed to put you in the same puzzled state I was on Saturday morning at 8 AM, trying to figure out how that scooter made it down the stairs. Curiosity drives research!

This Post Has 3 Comments

    1. Thanks for linking to the slides and video!

  1. Oh, I just noticed that Carole’s name is “Goble”, not Globe. Could you change that in the post and in my comment? Thank you! 🙂

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Close Menu