Publications as data in the age of open science

Pierre Mounier, EHESS, OpenEdition Center (USR 2004, CNRS / EHESS / AMU / Avignon Université) and Didier Torny, CNRS, Institut Interdisciplinaire de l’Innovation (I3, UMR9217, CNRS / Mines ParisTech / Télécom ParisTech / École Polytechnique)

Since the invention of the “journal” form in the seventeenth century, publications have always been used as data by other scientists. As Christine L. Borgman, Professor of Information Science, puts it, “Publication, as the public record of research, is part of a continuous cycle of reading, writing, discussing, searching, investigating, presenting, submitting, and reviewing. No scholarly publication stands alone.” [1] But the ways these publications are mobilized and transformed into data are varied and involve ever more complex infrastructures.

The mere dissemination of scientific writings has, for instance, led to the elaboration of the notion of average. In astronomy, in the face of a series of divergent observations published in various papers, it became necessary to produce rules in order to be able to extract the maximum of information by gathering them under a single measure [2]. However, one can argue that this type of gathering of publication was then not systematic, in particular because, before the end of the nineteenth century, the form “research paper” was not really stabilized [3]. From that time on, systematic forms of publication analysis appeared.

Many scientific disciplines have thus developed bibliographic tools to help researchers identify all publications — books and research papers relevant to their topics as: Index Chemicus in chemistry, Année Philologique or Bulletin Annuel de l’Histoire de France, for instance, in the humanities and social sciences. Up to now, some disciplines have continued critical analysis of publications from a comparative perspective. This is the case of anthropology, which encourages the transformation of publications into data with the Human Relations Area Files database which produces an indexation of the contents of publications, now to the exact paragraph, according to multiple entries. With this kind of tools, and the subsequent multidisciplinary databases such as PubMed in biomedical sciences, a first type of systematic reuse of literature, often referred to as meta-analysis, has developed.

« Meta-analysis »

The first and very famous example is that of the statistician Karl Person, inventor of the correlation coefficient and the chi-square, who examined the entire literature available in 1904 on the inoculation of typhoid [4]. It is actually in this area of clinical trials that meta-studies or meta-analysis were developed the most in the second half of the twentieth century. The point was to limit publication bias, and to acquire greater statistical power to test published results. However, it does not mean that this was a machinic operation: the main issue was that of the selection criteria of the articles finally retained for statistical calculations, each publication being considered as more or less reliable data. This led to both radical criticism of the method as being over-selecting and very fine methodological developments, particularly within the Cochrane.

In the 21st century, two other forms of systematic use of publications as data, which have an opposite relation to primary publications, emerged. The first one developed from a crisis of confidence in experimental sciences and aims at reproducing the experiments and results described in the articles; the second one, on the contrary, literally takes the data inside the articles to produced new knowledge. But both rely on more open information infrastructures, in particular by making content available in structured formats (XML, HTML), not only readable but also usable with digital tools within the document (illustrations, tables, bibliographical references, quotations).

Reproducibility studies

From the 1970s onward, public cases of scientific fraud, either based on plagiarism or on the invention of data, mobilized beyond academic communities [5]. New forms of self-regulation were gradually being invented and implemented, both about authorship and the accountability of the signatories of the articles [6]. This did not prevent the emergence of new worldwide-publicized cases such as that of the physicist Schön [7] or of the “cloner” biologist Hwang. A very famous article by John Ioannidis concludes that most of the published articles are distorted by the addition of different identified biases, and that it is therefore normal that the results are not- or are poorly reproduced [8]. While some, in particular in science studies, focus on the causes of such a phenomenon (hysterization of publication, career management, bias of the journals…), others attempt to measure the phenomenon. In that respect, the reproducibilityinitiative, first launched by the Public Library of Science (PLOS), aims at financing and publishing studies specifically dedicated to the reproduction of previously published experiments. Collective initiatives have thus been able to show that a majority of studies in psychology could not be reproduced [9].

The massive aggregation of results

In contrast to this movement requesting an article to be redone from all the set of data described (methods, hypotheses…), the development of text and data mining has led to the systematic exploration of the contents of publications to extract singular data instead of focusing on the “results” displayed by the authors of the primary publications. To take a simple example, a growing literature focuses on p-curves, the distribution of p-values [10] in a journal, a discipline, etc., that were first manually collected, then extracted from computer scripts; studies have shown that there were likely massive experimental bias — if not arrangements with data — in order to exceed the commonly agreed significance threshold [11].

Other more constructive examples exist, in particular in the production of stable and shared ontologies, for example in molecular biology [12]. The issues related to the right of search, the format of the article, the coding of data in the publication, or license of diffusion are crucial aspects of these approaches. This movement is quite distinct from that of the publication of the data of any article since it makes it possible, on the one hand, to revisit all published literature in an approach different from that of scientometry [13], and on the other hand, to rely on tight formats of publications rather than dive into the infinite oceans of research data.

Infrastructures at the service of a better cumulativeness in SSH

To carry out the same type of approach towards other forms of use of publications as data, the opening of new research fronts in SHS supposes the provision of a new generation of tools making it possible to discover all available resources by creating intellectual relationships through the disciplinary-, linguistic-, temporal-, as well as format barriers. In a nutshell, this is to offer the exploratory and analytical tools of “remote reading” needed by the SHS to rediscover, reinterpret and remobilize from a fresh perspective an already available but completely fragmented corpus [14]

This supposes developing fine and partially automatized indexing of the contents, according to several scientific indexes utilized by the communities, which makes it possible to identify data embedded within publications, to link data and publications, and to provide researchers with rich and verified information about the contents through metadata and metrics to qualify the information. The paradigm of open science also implies a service open to all, free from any technical-, financial-, disciplinary- or institutional barrier, and indexed resources also open to all [15]. The handling of the tools of writing, editing and publication, that is, of the material inscription of scientific knowledge, cannot be delegated to operators outside the academic community, as it is the case today, but should be reintegrated into the research process itself.

Text originally published in French in January 2019 in La Lettre de I’InSHS.

[1] Borgman C. L. 2007, Scholarship in the Digital Age: Information, Infrastructure, and the Internet, The MIT Press (Chapter 4: The Continuity of scholarly Communication).

[2] Desrosières A. 1993, La politique des grands nombres: histoire de la raison statistique, La Découverte.

[3] Csiszar A. 2018, The scientific journal. Authorship and the politics of knowledge in the nineteenth century, University Chicago Press.

[4] Simpson R. J. S. & Pearson K. 1904, Report on certain enteric fever inoculation statistics, in The British Medical Journal: 1243-1246.

[5] Broad W. & Wade N. 1982, Betrayers of the Truth, Simon & Schuster.

[6] Pontille D. & Torny D. 2012, Behind the scenes of scientific articles: define the categories of fraud and regulate business, in Revue d’Epidémiologie et de Santé Publique, 60(6): 481-481.

[7] Reich E. S. 2009, Plastic fantastic: How the biggest fraud in physics shook the scientific world, Macmillan.

[8] Ioannidis J. P. A. 2005, Why Most Published Research Findings Are False, in PLOS Medicine 2(8): e124.

[9] Open Science Collaboration 2015, Estimating the reproducibility of psychological science, in Science 349(6251), aac4716.

[10] Probability for a given statistical model under the null hypothesis of obtaining the same value or an even more extreme value than that observed.

[11] Bruns S. B. & Ioannidis J. P. A. 2016, P-curve and p-hacking in observational research, in PLoS One 11(2), e0149144.

[12] Kõljalg U., Nilsson R. H., Abarenkov K., Tedersoo L., Taylor A. F., Bahram M. & Douglas B. 2013, Towards a unified paradigm for sequence‐based identification of fungi, in Molecular ecology 22(21) : 5271-5277.

[13]  It focuses on the keywords of the articles, the authors and their institutions.

[14]  Schonfeld R. 2018, One Platform to Rule Them All?, in The Scholarly Kitchen.

[15] The development of the current Isidore platform that mobilizes a set of TDM and of manual control is a prototype of these tools. See on this topic: Dumouchel S. 2018, The EOSC as a knowledge marketplace: the example of ISIDORE: A virtuous data circle for users and providers, EUDAT conference: Putting the EOSC vision into practice, Porto, Portugal.