↓ Skip to main content

PLOS

Modeling Statistical Properties of Written Text

Overview of attention for article published in PLOS ONE, April 2009
Altmetric Badge

Mentioned by

twitter
2 X users
facebook
1 Facebook page

Citations

dimensions_citation
90 Dimensions

Readers on

mendeley
119 Mendeley
citeulike
5 CiteULike
Title
Modeling Statistical Properties of Written Text
Published in
PLOS ONE, April 2009
DOI 10.1371/journal.pone.0005372
Pubmed ID
Authors

M. Ángeles Serrano, Alessandro Flammini, Filippo Menczer

Abstract

Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics.

X Demographics

X Demographics

The data shown below were collected from the profiles of 2 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 119 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Germany 4 3%
United States 4 3%
Switzerland 3 3%
Philippines 2 2%
China 2 2%
Vietnam 1 <1%
Australia 1 <1%
Italy 1 <1%
Argentina 1 <1%
Other 3 3%
Unknown 97 82%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 26 22%
Researcher 26 22%
Student > Master 16 13%
Professor 9 8%
Professor > Associate Professor 9 8%
Other 23 19%
Unknown 10 8%
Readers by discipline Count As %
Computer Science 31 26%
Physics and Astronomy 17 14%
Social Sciences 11 9%
Agricultural and Biological Sciences 9 8%
Linguistics 8 7%
Other 33 28%
Unknown 10 8%