DPG Phi
Verhandlungen
Verhandlungen
DPG

Regensburg 2019 – scientific programme

Parts | Days | Selection | Search | Updates | Downloads | Help

SOE: Fachverband Physik sozio-ökonomischer Systeme

SOE 16: Networks and Systemic Risks (joint SOE/DY)

SOE 16.4: Talk

Thursday, April 4, 2019, 10:30–10:45, H17

Reproducibility in Statistical Analysis of Natural Language: the case of Project Gutenberg — •Francesc Font-Clos1 and Martin Gerlach21Center for Complexity and Biosystems, University of Milan, Italy — 2Department of Chemical and Biological Engineering, Northwestern University, USA

Data from the Project Gutenberg (PG) has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other fields, no standardized consensual version of the dataset exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potentially biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details).

In order to address these shortcomings, we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3× 109 word-tokens. We publish our methodology in detail, the code to download and process the data, as well as the corpus itself on 3 different levels of granularity. In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

Manuscript: arxiv.org/abs/1812.08092

Code: github.com/pgcorpus/gutenberg

Data: zenodo.org/record/2422561

100% | Mobile Layout | Deutsche Version | Contact/Imprint/Privacy
DPG-Physik > DPG-Verhandlungen > 2019 > Regensburg