Report for: Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript

Title	Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript
Published in	PLOS ONE, July 2013
DOI	10.1371/journal.pone.0067310
Pubmed ID	23844002
Authors	Diego R. Amancio, Eduardo G. Altmann, Diego Rybski, Osvaldo N. Oliveira, Luciano da F. Costa
Abstract	While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.

View on publisher site Alert me about new mentions

X Demographics

The data shown below were collected from the profiles of 11 X users who shared this research output. Click here to find out more about how the information was compiled.

Geographical breakdown

Country	Count	As %
India	1	9%
United States	1	9%
Germany	1	9%
United Kingdom	1	9%
Austria	1	9%
Norway	1	9%
Philippines	1	9%
Unknown	4	36%

Demographic breakdown

Type	Count	As %
Members of the public	7	64%
Scientists	4	36%

Mendeley readers

The data shown below were compiled from readership statistics for 66 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country	Count	As %
Germany	3	5%
Brazil	2	3%
Italy	1	2%
Canada	1	2%
Belarus	1	2%
Unknown	58	88%

Demographic breakdown

Readers by professional status	Count	As %
Researcher	14	21%
Professor > Associate Professor	10	15%
Student > Master	8	12%
Professor	6	9%
Student > Bachelor	5	8%
Other	18	27%
Unknown	5	8%

Readers by discipline	Count	As %
Physics and Astronomy	13	20%
Computer Science	11	17%
Social Sciences	7	11%
Agricultural and Biological Sciences	4	6%
Medicine and Dentistry	4	6%
Other	19	29%
Unknown	8	12%

PLOS

Article Metrics

Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript

Mentioned by

Citations

Readers on

X Demographics

Geographical breakdown

Demographic breakdown

Mendeley readers

Geographical breakdown

Demographic breakdown