Regensburg 2019 – scientific programme
Parts | Days | Selection | Search | Updates | Downloads | Help
SOE: Fachverband Physik sozio-ökonomischer Systeme
SOE 16: Networks and Systemic Risks (joint SOE/DY)
SOE 16.4: Talk
Thursday, April 4, 2019, 10:30–10:45, H17
Reproducibility in Statistical Analysis of Natural Language: the case of Project Gutenberg — •Francesc Font-Clos1 and Martin Gerlach2 — 1Center for Complexity and Biosystems, University of Milan, Italy — 2Department of Chemical and Biological Engineering, Northwestern University, USA
Data from the Project Gutenberg (PG) has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other fields, no standardized consensual version of the dataset exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potentially biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details).
In order to address these shortcomings, we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3× 109 word-tokens. We publish our methodology in detail, the code to download and process the data, as well as the corpus itself on 3 different levels of granularity. In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.
Manuscript: arxiv.org/abs/1812.08092
Code: github.com/pgcorpus/gutenberg
Data: zenodo.org/record/2422561