Regensburg 2019 – wissenschaftliches Programm

Bereiche | Tage | Auswahl | Suche | Aktualisierungen | Downloads | Hilfe

SOE: Fachverband Physik sozio-ökonomischer Systeme

SOE 16: Networks and Systemic Risks (joint SOE/DY)

SOE 16.4: Vortrag

Donnerstag, 4. April 2019, 10:30–10:45, H17

Reproducibility in Statistical Analysis of Natural Language: the case of Project Gutenberg — •Francesc Font-Clos¹ and Martin Gerlach² — ¹Center for Complexity and Biosystems, University of Milan, Italy — ²Department of Chemical and Biological Engineering, Northwestern University, USA

Data from the Project Gutenberg (PG) has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other fields, no standardized consensual version of the dataset exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potentially biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details).

In order to address these shortcomings, we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3× 10⁹ word-tokens. We publish our methodology in detail, the code to download and process the data, as well as the corpus itself on 3 different levels of granularity. In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

Manuscript: arxiv.org/abs/1812.08092

Code: github.com/pgcorpus/gutenberg

Data: zenodo.org/record/2422561