preprints

Classification errors distort findings in automated speech processing: examples and solutions from child-development research

L. Gautheron, E. Kidd, A. Malko, M. Lavechin, and A. Cristia

Jul 2025

Abs DOI

Language acquisition Machine Learning Statistics

With the advent of wearable recorders, scientists are increasingly turning to automated methods of analysis of audio and video data in order to measure children’s experience, behavior, and outcomes, with a sizable literature employing long-form audio-recordings to study language acquisition. While numerous articles report on the accuracy and reliability of the most popular automated classifiers, less has been written on the downstream effects of classification errors on measurements and statistical inferences (e.g., the estimate of correlations and effect sizes in regressions). This paper proposes a Bayesian approach to study the effects of algorithmic errors on key scientific questions, including the effect of siblings on children’s language experience and the association between children’s production and their input. In both the most commonly used \glslena, and an open-source alternative (the Voice Type Classifier from the ACLEW system), we find that classification errors can significantly distort estimates. For instance, automated annotations underestimated the negative effect of siblings on adult input by 20–80%, potentially placing it below statistical significance thresholds. We further show that a Bayesian calibration approach for recovering unbiased estimates of effect sizes can be effective and insightful, but does not provide a fool-proof solution. Both the issue reported and our solution may apply to any classifier involving event detection and classification with non-zero error rates.
Dilemmas and trade-offs in the diffusion of conventions

Lucas Gautheron

Jul 2025

Abs DOI PDF

Collective cognition Inverse problems Statistics

Outside ideal settings, conventions are shaped by heterogeneous competing processes that can challenge the emergence of universal norms. This paper identifies three trade-offs challenging the diffusion of conventions and explores each of them empirically using observational behavioral data. The first trade-off (I) concerns the imperatives of social, sequential, and contextual consistency that individuals must balance when choosing between competing conventions. The second trade-off (II) involves the balance between local and global coordination, depending on whether individuals coordinate their behavior via interactions throughout a social network or external factors transcending the network. The third trade-off (III) is the balance between decision optimality (e.g., collective satisfaction) and decision costs when collectives with conflicting preferences choose one convention. We develop a utilitarian account of conventions which we translate into a broadly applicable statistical physics framework for measuring each of these trade-offs. We then apply this framework to a sign convention in physics using textual and network data. Our analysis suggests that the purpose of conventions may exceed coordination, and that multiple infrastructures (including prior cultural traits and social networks) concurrently shape individual preferences towards conventions. Additionally, we confirm the role of seniority in resolving conflicting preferences in collaborations, resulting in suboptimal outcomes.

publications

Balancing specialization and adaptation in a transforming scientific landscape

Lucas Gautheron

EPJ Data Science, Jan 2025

Abs DOI

Collective cognition Natural language processing Networks Statistics Inverse problems

How do scientists navigate between the need to capitalize on their prior knowledge through specialization, and the urge to adapt to evolving research opportunities? Drawing from diverse perspectives on adaptation, this paper proposes an unsupervised Bayesian approach motivated by Optimal Transport of the evolution of scientists’ research portfolios in response to transformations in their field. The model relies on 186,162 scientific abstracts and authorship data to evaluate the influence of intellectual, social, and institutional resources on scientists’ trajectories within a cohort of 2,195 high-energy physicists between 2000 and 2019. Using Inverse Optimal Transport, the reallocation of research efforts is shown to be shaped by learning costs, thus enhancing the utility of the scientific capital disseminated among scientists. Two dimensions of social capital, namely “diversity” and “power”, have opposite associations with the magnitude of change in scientists’ research interests: while “diversity” disrupts and expands research interests, “power” is associated with more stable research agendas. Social capital plays a more crucial role in shifts between cognitively distant research areas. More generally, this work suggests new approaches for understanding, measuring and modeling collective adaptation using Optimal Transport.
Establishing the reliability of metrics extracted from long-form recordings using LENA and the ACLEW pipeline

Alejandrina Cristia, Lucas Gautheron, Zixing Zhang, Björn Schuller, Camila Scaff, and 11 more authors

Behavior Research Methods, Sep 2024

Abs DOI CRediT

Language acquisition Statistics

Long-form audio recordings are increasingly used to study individual variation, group differences, and many other topics in theoretical and applied fields of developmental science, particularly for the description of children’s language input (typically speech from adults) and children’s language output (ranging from babble to sentences). The proprietary LENA software has been available for over a decade, and with it, users have come to rely on derived metrics like adult word count (AWC) and child vocalization counts (CVC), which have also more recently been derived using an open-source alternative, the ACLEW pipeline. Yet, there is relatively little work assessing the reliability of long-form metrics in terms of the stability of individual differences across time. Filling this gap, we analyzed eight spoken-language datasets: four from North American English-learning infants, and one each from British English-, French-, American English-Spanish, and Quechua-Spanish-learning infants. The audio data were analyzed using two types of processing software: LENA and the ACLEW open-source pipeline. When all corpora were included, we found relatively low to moderate reliability (across multiple recordings, intraclass correlation coefficient attributed to the child identity (Child ICC), was <50% for most metrics). There were few differences between the two pipelines. Exploratory analyses suggested some differences as a function of child age and corpora. These findings suggest that, while reliability is likely sufficient for various group-level analyses, caution is needed when using either LENA or ACLEW tools to study individual variation. We also encourage improvement of extant tools, specifically targeting accurate measurement of individual variation.
How research programs come apart: The example of supersymmetry and the disunity of physics

Lucas Gautheron, and Elisa Omodei

Quantitative Science Studies, Dec 2023

Abs DOI CRediT

Collective cognition Natural language processing Networks

According to Peter Galison, the coordination of different “subcultures” within a scientific field happens through local exchanges within “trading zones.” In his view, the workability of such trading zones is not guaranteed, and science is not necessarily driven towards further integration. In this paper, we develop and apply quantitative methods (using semantic, authorship, and citation data from scientific literature), inspired by Galison’s framework, to the case of the disunity of high-energy physics. We give prominence to supersymmetry, a concept that has given rise to several major but distinct research programs in the field, such as the formulation of a consistent theory of quantum gravity or the search for new particles. We show that “theory” and “phenomenology” in high-energy physics should be regarded as distinct theoretical subcultures, between which supersymmetry has helped sustain scientific “trades.” However, as we demonstrate using a topic model, the phenomenological component of supersymmetry research has lost traction and the ability of supersymmetry to tie these subcultures together is now compromised. Our work supports that even fields with an initially strong sentiment of unity may eventually generate diverging research programs and demonstrates the fruitfulness of the notion of trading zones for informing quantitative approaches to scientific pluralism.
Vocal input and output among infants in a multilingual context: Evidence from long-form recordings in Vanuatu

Alejandrina Cristia, Lucas Gautheron, and Heidi Colleran

Developmental Science, Feb 2023

Abs DOI CRediT

Language acquisition Statistics

What are the vocal experiences of children growing up on Malakula island, Vanuatu, where multilingualism is the norm? Long-form audio-recordings captured spontaneous speech behavior by, and around, 38 children (5–33 months, 23 girls) from 11 villages. Automated analyses revealed most children’s vocal input came from female adults and other children’s voices, with small contributions from male adult voices. The greatest changes with age involved an increase in the input vocalizations from other children. Total input (collapsing across child-directed and overheard speech, and across languages) was ∼11 min per hour, which was at least 5 min (31%) lower than that found in other populations studied using comparable methods in previous literature, as well as in archival American data analyzed with the same algorithm. In contrast, children’s own vocalization counts were two to four times higher than previous reports for North-American English-learning monolingual infants at matched ages, and comparable to estimates from archival American data, consistent with a resilient language-learning cognitive system for this aspect of vocal development. The strongest association between input and output was with vocalizations by other children, rather than those by adults, which is consistent with research in anthropology but less so with current theoretical trends in developmental psychology. These results invite further research in populations that are under-represented in developmental science.
Managing, storing, and sharing long-form recordings and their annotations

Lucas Gautheron, Nicolas Rochat, and Alejandrina Cristia

Language Resources and Evaluation, Feb 2022

Abs DOI CRediT

Language acquisition Software

The technique of long-form recordings via wearables is gaining momentum in different fields of research, notably linguistics and neurology. This technique, however, poses several technical challenges, some of which are amplified by the peculiarities of the data, including their sensitivity and their volume. In this paper, we begin by outlining key problems related to the management, storage, and sharing of the corpora that emerge when using this technique. We continue by proposing a multi-component solution to these problems, specifically in the case of daylong recordings of children. As part of this solution, we release ChildProject, a Python package for performing the operations typically required by such datasets and for evaluating the reliability of annotations using a number of measures commonly used in speech processing and linguistics. This package builds upon an annotation management system, which allows the importation of annotations from a wide range of existing formats, as well as upon data validation procedures, which assert the conformity of the data, or, alternatively, produce detailed and explicit error reports. Our proposal could be generalized to populations other than children and beyond linguistics.
Reverse Engineering Language Acquisition with Child-Centered Long-Form Recordings

Marvin Lavechin, Maureen Seyssel, Lucas Gautheron, Emmanuel Dupoux, and Alejandrina Cristia

Annual Review of Linguistics, Jan 2022

Abs DOI CRediT

Language acquisition Literature review

Language use in everyday life can be studied using lightweight, wearable recorders that collect long-form recordings—that is, audio (including speech) over whole days. The hardware and software underlying this technique are increasingly accessible and inexpensive, and these data are revolutionizing the language acquisition field. We first place this technique into the broader context of the current ways of studying both the input being received by children and children’s own language production, laying out the main advantages and drawbacks of long-form recordings. We then go on to argue that a unique advantage of long-form recordings is that they can fuel realistic models of early language acquisition that use speech to represent children’s input and/or to establish production benchmarks. To enable the field to make the most of this unique empirical and conceptual contribution, we outline what this reverse engineering approach from long-form recordings entails, why it is useful, and how to evaluate success.

talks

invitedtalks

A statistical physics approach to dilemmas and trade-offs challenging the diffusion of conventions

Lucas Gautheron

Mar 2025

Institut Jean-Nicod, École Normale Supérieure, Paris
“When her family finds you are using the wrong metric…”: dilemmas and trade-offs in the diffusion of scientific conventions

Lucas Gautheron

Dec 2024

Department of Philosophy, London School of Economics, UK
Correlational analyses of the effects of caregivers’ speech behaviour on children’s speech production

Lucas Gautheron

Jun 2024

Department of Linguistics, UCLA, Los Angeles, United States
Algorithmic bias in correlational analyses of the effects of caregivers’ speech behaviour on children’s speech production

Lucas Gautheron

May 2024

Daylong Audio Recordings of Children’s Linguistic Environments (DARCLE), online
Balancing Specialization and Adaptation in a Transforming Scientific Landscape: Modelling scientists’ behavior with Natural Language Processing

Lucas Gautheron

Nov 2023

NLP Seminar, LATTICE, Montrouge, France
Too beautiful to be false, or too beautiful to be true: supersymmetry and the future of high-energy physics

Lucas Gautheron

May 2023

2JM seminar, Sciences Po, Paris, France

contributedtalks

Dilemmas and trade-offs in the diffusion of conventions

Lucas Gautheron

Jul 2025

11th International Conference on Computational Social Science, Norrköping, Sweden
Plural pursuit across scales

Lucas Gautheron, and Mike D. Schneider

Apr 2025

Workshop on Large Language Models for the History, Philosophy, and Sociology of Science, TU Berlin
Social dilemmas in high-energy physics

Lucas Gautheron

Sep 2024

Workshop “Methodological Transformations in Fundamental Physics”, Wuppertal, Germany
A dialogue between philosophy of science and computational studies of science illuminates the crisis of fundamental physics

Lucas Gautheron

May 2024

Workshop “Philosophy of Science meets Quantitative Science Studies”’, Turin, Italy
From Colliders to Cosmos: Dynamics of Cooperation and Collaboration in High-Energy Physics

Lucas Gautheron

Sep 2023

Summer school “Collaboration and Interdisciplinarity in Science and Technology”, Wuppertal, Germany
Probing Socio-Epistemic Dynamics in High-Energy Physics Using the Inspire HEP Database

Lucas Gautheron

May 2023

Online conference “Big Data & History and Philosophy of Science”
La désunité de la physique des hautes-énergies

Lucas Gautheron

Apr 2023

XIV\textsuperscripte Congrès de la Société française d’histoire des sciences et des techniques: symposium “La physique de l’après Seconde guerre mondiale, entre ruptures et continuités”, Bordeaux, France
The many faces of supersymmetry: Supersymmetry across subcultures of High-Energy Physics, 1971–2019

Lucas Gautheron

Nov 2022

History of Science Society Annual Meeting, Chicago, United States
Who trusts supersymmetry? Probing quantitative methods for investigating research orientations in High-Energy Physics

Lucas Gautheron

Mar 2022

4\textsuperscriptth International Spring School of the Epistemology of the Large Hadron Collider: The History, Philosophy and Sociology of Large Scale Experiments, Wuppertal, Germany

posters

Balancing Specialization and Adaptation in a Transforming Scientific Landscape

Lucas Gautheron

Jul 2024

10\textsuperscriptth International Conference on Computational Social Science (IC2S2), Philadelphia, United States

news media

press

[DATA] Les morts invisibles du coronavirus : la vérité derrière les chiffres officiels (The Hidden Toll of Coronavirus: Revealing the Truth Beyond Official Numbers)

Lucas Gautheron, and Chloé Gence

Le Média, 2020

Abs

Epidemics Statistics

In this paper, we compared mortality data with the official death toll attributed to Covid. We showed that the number of deaths attributed to Covid significantly underestimated the actual number of deaths. We then showed that the government later reduced the discrepancy by accounting for Covid-related deaths occurring in nursing homes, but that there remained an unaccounted for excess mortality in deaths occurring at home that could be attributed to Covid. This article has been cited in research papers.
[DATA] Lutte contre le COVID-19 : oui, la lenteur de l’État français a tué (Fatal Sluggishness: How France’s COVID-19 Response Cost Lives)

Lucas Gautheron, and Chloé Gence

Le Média, 2020

Abs

Epidemics Statistics

In this article, we used data from the Oxford Government Response Tracker to show that France took containment measures against Covid relatively late compared to other countries, given the timing of the epidemics. We also estimated how many deaths could have been avoided if certain measures had been taken a few days earlier, by adapting a simulation from the Imperial College.
Élections européennes : un vote de classe avant tout (European Elections: It’s Always About Class!)

Théophile Kouamouo, and Lucas Gautheron

Le Média, 2019

Abs

Politics Statistics

We analyzed voter trajectories between the French presidential and European elections using a Bayesian ecological inference model. We assessed how these trajectories were influenced by socio-economic factors. This revealed, among other things, the rallying of the right-wing bourgeoisie behind Macron.
Référendum ADP : les médias au service du pouvoir (Silencing Democracy: Media Blackout on the ADP Privatization Referendum)

Lucas Gautheron

Le Média, 2019

Abs

Politics Data mining

In this article, I analyzed how signatures on a nationwide constitutional petition correlated with various socioeconomic/political variables in French cities. Low education turned out to be a strongly negative factor. I explored the possibility that this could be the result of poor media coverage. I then measured media coverage of the petition by applying a speech-to-text model to public television news archives. The article has been cited in a book, in research papers, and in an appeal to the French Constitutional Court to increase media coverage of these petitions.
Des chiffres pour appréhender l’anti-Mélenchonisme de la presse

Lucas Gautheron

Marianne, 2017

Abs

Politics Data mining

In this paper, I applied sentiment analysis and emotion detection to press articles and illustrations to explore differences in the journalistic treatment of various political figures.