Publications
You can find my articles on my Google Scholar profile. Publication records include self-assessed CRediT statements characterizing my areas of contribution. Hover icons to show the abstract.
Politics
Data mining
Statistical and Bayesian Inference
Epidemics
Language acquisition
Literature review
Data management
Software
Science and Collective Intelligence
Natural language processing
Networks
Inverse problems
Preprints
Peer-reviewed publications
- Gautheron L. “Balancing Specialization and Adaptation in a Transforming Scientific Landscape ” Permalink in EPJ Data Science, 2024Abstract: How do scientists navigate between the need to capitalize on their prior knowledge through specialization, and the urge to adapt to evolving research opportunities? Drawing from diverse perspectives on adaptation, this paper proposes an unsupervised Bayesian approach motivated by Optimal Transport of the evolution of scientists' research portfolios in response to transformations in their field. The model relies on $186,162$ scientific abstracts and authorship data to evaluate the influence of intellectual, social, and institutional resources on scientists' trajectories within a cohort of $2,195$ high-energy physicists between 2000 and 2019. Using Inverse Optimal Transport, the reallocation of research efforts is shown to be shaped by learning costs, thus enhancing the utility of the scientific capital disseminated among scientists. Two dimensions of social capital, namely "diversity" and "power", have opposite associations with the magnitude of change in scientists' research interests: while "diversity" disrupts and expands research interests, "power" is associated with more stable research agendas. Social capital plays a more crucial role in shifts between cognitively distant research areas. More generally, this work suggests new approaches for understanding, measuring and modeling collective adaptation using Optimal Transport.
- Cristia A., Gautheron L., Zhang Z., Schuller B., Scaff C., Rowland C., Räsänen O., Peurey L., Lavechin M., Havard W., Fausey C., Cychosz M., Bergelson E., Anderson H., Al N., Soderstrom M. “Establishing the reliability of metrics extracted from long-form recordings using LENA and the ACLEW pipeline ” Permalink in Behavior Research Methods, 2024 (Data curation,Software,Writing – review & editing)Abstract: Long-form audio recordings are increasingly used to study individual variation, group differences, and many other topics in theoretical and applied fields of developmental science, particularly for the description of children's language input (typically speech from adults) and children’s language output (ranging from babble to sentences). The proprietary LENA software has been available for over a decade, and with it, users have come to rely on derived metrics like adult word count (AWC) and child vocalization counts (CVC), which have also more recently been derived using an open-source alternative, the ACLEW pipeline. Yet, there is relatively little work assessing the reliability of long-form metrics in terms of the stability of individual differences across time. Filling this gap, we analyzed eight spoken-language datasets: four from North American English-learning infants, and one each from British English-, French-, American English-Spanish, and Quechua-Spanish-learning infants. The audio data were analyzed using two types of processing software: LENA and the ACLEW open-source pipeline. When all corpora were included, we found relatively low to moderate reliability (across multiple recordings, intraclass correlation coefficient attributed to the child identity (Child ICC), was <50% for most metrics). There were few differences between the two pipelines. Exploratory analyses suggested some differences as a function of child age and corpora. These findings suggest that, while reliability is likely sufficient for various group-level analyses, caution is needed when using either LENA or ACLEW tools to study individual variation. We also encourage improvement of extant tools, specifically targeting accurate measurement of individual variation.
- Gautheron L., Omodei E. “How research programs come apart: The example of supersymmetry and the disunity of physics ” Permalink in Quantitative Science Studies, 2023 (Conceptualization, Methodology, Software, Formal analysis, Data Curation, Writing - Original Draft, Visualization)Abstract: According to Peter Galison, the coordination of different “subcultures” within a scientific field happens through local exchanges within “trading zones.” In his view, the workability of such trading zones is not guaranteed, and science is not necessarily driven towards further integration. In this paper, we develop and apply quantitative methods (using semantic, authorship, and citation data from scientific literature), inspired by Galison’s framework, to the case of the disunity of high-energy physics. We give prominence to supersymmetry, a concept that has given rise to several major but distinct research programs in the field, such as the formulation of a consistent theory of quantum gravity or the search for new particles. We show that “theory” and “phenomenology” in high-energy physics should be regarded as distinct theoretical subcultures, between which supersymmetry has helped sustain scientific “trades.” However, as we demonstrate using a topic model, the phenomenological component of supersymmetry research has lost traction and the ability of supersymmetry to tie these subcultures together is now compromised. Our work supports that even fields with an initially strong sentiment of unity may eventually generate diverging research programs and demonstrates the fruitfulness of the notion of trading zones for informing quantitative approaches to scientific pluralism.
- Cristia A., Gautheron L., Colleran H. “Vocal input and output among infants in a multilingual context: Evidence from long-form recordings in Vanuatu ” Permalink in Developmental Science, 2023 (Data Curation, Formal analysis, Writing - Review & Editing)Abstract: What are the vocal experiences of children growing up on Malakula island, Vanuatu, where multilingualism is the norm? Long-form audio-recordings captured spontaneous speech behavior by, and around, 38 children (5–33 months, 23 girls) from 11 villages. Automated analyses revealed most children's vocal input came from female adults and other children's voices, with small contributions from male adult voices. The greatest changes with age involved an increase in the input vocalizations from other children. Total input (collapsing across child-directed and overheard speech, and across languages) was ∼11 min per hour, which was at least 5 min (31%) lower than that found in other populations studied using comparable methods in previous literature, as well as in archival American data analyzed with the same algorithm. In contrast, children's own vocalization counts were two to four times higher than previous reports for North-American English-learning monolingual infants at matched ages, and comparable to estimates from archival American data, consistent with a resilient language-learning cognitive system for this aspect of vocal development. The strongest association between input and output was with vocalizations by other children, rather than those by adults, which is consistent with research in anthropology but less so with current theoretical trends in developmental psychology. These results invite further research in populations that are under-represented in developmental science.
- Gautheron L., Rochat N., Cristia A. “Managing, storing, and sharing long-form recordings and their annotations ” Permalink in Language Resources and Evaluation, 2022 (Conceptualization, Software, Writing - Original Draft)Abstract: The technique of long-form recordings via wearables is gaining momentum in different fields of research, notably linguistics and neurology. This technique, however, poses several technical challenges, some of which are amplified by the peculiarities of the data, including their sensitivity and their volume. In this paper, we begin by outlining key problems related to the management, storage, and sharing of the corpora that emerge when using this technique. We continue by proposing a multi-component solution to these problems, specifically in the case of daylong recordings of children. As part of this solution, we release ChildProject, a Python package for performing the operations typically required by such datasets and for evaluating the reliability of annotations using a number of measures commonly used in speech processing and linguistics. This package builds upon an annotation management system, which allows the importation of annotations from a wide range of existing formats, as well as upon data validation procedures, which assert the conformity of the data, or, alternatively, produce detailed and explicit error reports. Our proposal could be generalized to populations other than children and beyond linguistics.
- Lavechin M., Seyssel M., Gautheron L., Dupoux E., Cristia A. “Reverse Engineering Language Acquisition with Child-Centered Long-Form Recordings ” Permalink in Annual Review of Linguistics, 2022 (Writing - Review & Editing)Abstract: Language use in everyday life can be studied using lightweight, wearable recorders that collect long-form recordings—that is, audio (including speech) over whole days. The hardware and software underlying this technique are increasingly accessible and inexpensive, and these data are revolutionizing the language acquisition field. We first place this technique into the broader context of the current ways of studying both the input being received by children and children's own language production, laying out the main advantages and drawbacks of long-form recordings. We then go on to argue that a unique advantage of long-form recordings is that they can fuel realistic models of early language acquisition that use speech to represent children's input and/or to establish production benchmarks. To enable the field to make the most of this unique empirical and conceptual contribution, we outline what this reverse engineering approach from long-form recordings entails, why it is useful, and how to evaluate success.
Peer-reviewed conference proceedings
- Gautheron L., Lavechin M., Riad R., Scaff C., Cristia A. “Longform recordings : Opportunities and challenges ” Permalink in In the proceedings of LIFT 2020 - 2èmes journées scientifiques du Groupement de Recherche "Linguistique informatique, formelle et de terrain", 2020 (Writing - Original Draft)
Talks
Invited Talks
- Gautheron L. “"When her family finds you are using the wrong metricdots": dilemmas and trade-offs in the diffusion of scientific conventions ” Permalink in Department of Philosophy, Logic and Scientific Method, London School of Economics, United Kingdom, 2024
- Gautheron L. “Correlational analyses of the effects of caregivers’ speech behaviour on children’s speech production ” Permalink in Department of Linguistics, UCLA, Los Angeles, CA, United States, 2024
- Gautheron L. “Algorithmic bias in correlational analyses of the effects of caregivers' speech behaviour on children's speech production ” Permalink in Daylong Audio Recordings of Children's Linguistic Environments (DARCLE), online, 2024
- Gautheron L. Balancing Specialization and Adaptation in a Transforming Scientific Landscape: Modelling scientists' behavior with Natural Language Processing in NLP Seminar, LATTICE, Montrouge, France, 2023
- Gautheron L. Too beautiful to be false, or too beautiful to be true: supersymmetry and the future of high-energy physics in 2JM seminar, Sciences Po, Paris, France, 2023
- Gautheron L., Cristia A. A python package for long-form recordings and their annotations in Bergelson Lab, Duke University, NC, United States [online], 2021
Contributed Talks
- Gautheron L. “Social dilemmas in high-energy physics ” Permalink in Workshop "Methodological Transformations in Fundamental Physics", Wuppertal, Germany, 2024
- Gautheron L. “A dialogue between philosophy of science and computational studies of science illuminates the crisis of fundamental physics ” Permalink in Workshop "Philosophy of Science meets Quantitative Science Studies"', Turin, Italy, 2024
- Gautheron L. “From Colliders to Cosmos: Dynamics of Cooperation and Collaboration in High-Energy Physics ” Permalink in Summer school "Collaboration and Interdisciplinarity in Science and Technology", Wuppertal, Germany, 2023
- Gautheron L. “Probing Socio-Epistemic Dynamics in High-Energy Physics Using the Inspire HEP Database ” Permalink in Conference "Big Data & History and Philosophy of Science", 2023
- Gautheron L. “La désunité de la physique des hautes-énergies ” Permalink in XIVe Congrès de la Société française d'histoire des sciences et des techniques: symposium "La physique de l'après Seconde guerre mondiale, entre ruptures et continuités", Bordeaux, France, 2023
- Gautheron L. “Who trusts supersymmetry? Probing quantitative methods for investigating research orientations in High-Energy Physics ” Permalink in 4th International Spring School of the Epistemology of the Large Hadron Collider: The History, Philosophy and Sociology of Large Scale Experiments, Wuppertal, Germany, 2022
Selected press articles
- Gautheron L., Gence C. “[DATA] Lutte contre le COVID-19 : oui, la lenteur de l’État français a tué (Fatal Sluggishness: How France's COVID-19 Response Cost Lives) ” Permalink in Le Média, 2020Abstract: In this article, we used data from the Oxford Government Response Tracker to show that France took containment measures against Covid relatively late compared to other countries, given the timing of the epidemics. We also estimated how many deaths could have been avoided if certain measures had been taken a few days earlier, by adapting a simulation from the Imperial College.
- Gautheron L., Gence C. “[DATA] Les morts invisibles du coronavirus : la vérité derrière les chiffres officiels (The Hidden Toll of Coronavirus: Revealing the Truth Beyond Official Numbers) ” Permalink in Le Média, 2020Abstract: In this paper, we compared mortality data with the official death toll attributed to Covid. We showed that the number of deaths attributed to Covid significantly underestimated the actual number of deaths. We then showed that the government later reduced the discrepancy by accounting for Covid-related deaths occurring in nursing homes, but that there remained an unaccounted for excess mortality in deaths occurring at home that could be attributed to Covid. This article has been cited in research papers.
- Kouamouo T., Gautheron L. “Élections européennes : un vote de classe avant tout (European Elections: It's Always About Class!) ” Permalink in Le Média, 2019Abstract: We analyzed voter trajectories between the French presidential and European elections using a Bayesian ecological inference model. We assessed how these trajectories were influenced by socio-economic factors. This revealed, among other things, the rallying of the right-wing bourgeoisie behind Macron.
- Gautheron L. “Référendum ADP : les médias au service du pouvoir (Silencing Democracy: Media Blackout on the ADP Privatization Referendum) ” Permalink in Le Média, 2019Abstract: In this article, I analyzed how signatures on a nationwide constitutional petition correlated with various socioeconomic/political variables in French cities. Low education turned out to be a strongly negative factor. I explored the possibility that this could be the result of poor media coverage. I then measured media coverage of the petition by applying a speech-to-text model to public television news archives. The article has been cited in a book, in research papers, and in an appeal to the French Constitutional Court to increase media coverage of these petitions.
- Gautheron L. “Des chiffres pour appréhender l'anti-Mélenchonisme de la presse ” Permalink in Marianne, 2017Abstract: In this paper, I applied sentiment analysis and emotion detection to press articles and illustrations to explore differences in the journalistic treatment of various political figures.