Posts by Collection
publications
Abstract: In this paper, I applied sentiment analysis and emotion detection to press articles and illustrations to explore differences in the journalistic treatment of various political figures.
Abstract: In this article, I analyzed how signatures on a nationwide constitutional petition correlated with various socioeconomic/political variables in French cities. Low education turned out to be a strongly negative factor. I explored the possibility that this could be the result of poor media coverage. I then measured media coverage of the petition by applying a speech-to-text model to public television news archives. The article has been cited in a book, in research papers, and in an appeal to the French Constitutional Court to increase media coverage of these petitions.
Abstract: We analyzed voter trajectories between the French presidential and European elections using a Bayesian ecological inference model. We assessed how these trajectories were influenced by socio-economic factors. This revealed, among other things, the rallying of the right-wing bourgeoisie behind Macron.
Abstract: In this paper, we compared mortality data with the official death toll attributed to Covid. We showed that the number of deaths attributed to Covid significantly underestimated the actual number of deaths. We then showed that the government later reduced the discrepancy by accounting for Covid-related deaths occurring in nursing homes, but that there remained an unaccounted for excess mortality in deaths occurring at home that could be attributed to Covid. This article has been cited in research papers.
Abstract: In this article, we used data from the Oxford Government Response Tracker to show that France took containment measures against Covid relatively late compared to other countries, given the timing of the epidemics. We also estimated how many deaths could have been avoided if certain measures had been taken a few days earlier, by adapting a simulation from the Imperial College.
Abstract: Language use in everyday life can be studied using lightweight, wearable recorders that collect long-form recordings—that is, audio (including speech) over whole days. The hardware and software underlying this technique are increasingly accessible and inexpensive, and these data are revolutionizing the language acquisition field. We first place this technique into the broader context of the current ways of studying both the input being received by children and children's own language production, laying out the main advantages and drawbacks of long-form recordings. We then go on to argue that a unique advantage of long-form recordings is that they can fuel realistic models of early language acquisition that use speech to represent children's input and/or to establish production benchmarks. To enable the field to make the most of this unique empirical and conceptual contribution, we outline what this reverse engineering approach from long-form recordings entails, why it is useful, and how to evaluate success.
Abstract: The technique of long-form recordings via wearables is gaining momentum in different fields of research, notably linguistics and neurology. This technique, however, poses several technical challenges, some of which are amplified by the peculiarities of the data, including their sensitivity and their volume. In this paper, we begin by outlining key problems related to the management, storage, and sharing of the corpora that emerge when using this technique. We continue by proposing a multi-component solution to these problems, specifically in the case of daylong recordings of children. As part of this solution, we release ChildProject, a Python package for performing the operations typically required by such datasets and for evaluating the reliability of annotations using a number of measures commonly used in speech processing and linguistics. This package builds upon an annotation management system, which allows the importation of annotations from a wide range of existing formats, as well as upon data validation procedures, which assert the conformity of the data, or, alternatively, produce detailed and explicit error reports. Our proposal could be generalized to populations other than children and beyond linguistics.
Abstract: What are the vocal experiences of children growing up on Malakula island, Vanuatu, where multilingualism is the norm? Long-form audio-recordings captured spontaneous speech behavior by, and around, 38 children (5–33 months, 23 girls) from 11 villages. Automated analyses revealed most children's vocal input came from female adults and other children's voices, with small contributions from male adult voices. The greatest changes with age involved an increase in the input vocalizations from other children. Total input (collapsing across child-directed and overheard speech, and across languages) was ∼11 min per hour, which was at least 5 min (31%) lower than that found in other populations studied using comparable methods in previous literature, as well as in archival American data analyzed with the same algorithm. In contrast, children's own vocalization counts were two to four times higher than previous reports for North-American English-learning monolingual infants at matched ages, and comparable to estimates from archival American data, consistent with a resilient language-learning cognitive system for this aspect of vocal development. The strongest association between input and output was with vocalizations by other children, rather than those by adults, which is consistent with research in anthropology but less so with current theoretical trends in developmental psychology. These results invite further research in populations that are under-represented in developmental science.
Abstract: According to Peter Galison, the coordination of different “subcultures” within a scientific field happens through local exchanges within “trading zones.” In his view, the workability of such trading zones is not guaranteed, and science is not necessarily driven towards further integration. In this paper, we develop and apply quantitative methods (using semantic, authorship, and citation data from scientific literature), inspired by Galison’s framework, to the case of the disunity of high-energy physics. We give prominence to supersymmetry, a concept that has given rise to several major but distinct research programs in the field, such as the formulation of a consistent theory of quantum gravity or the search for new particles. We show that “theory” and “phenomenology” in high-energy physics should be regarded as distinct theoretical subcultures, between which supersymmetry has helped sustain scientific “trades.” However, as we demonstrate using a topic model, the phenomenological component of supersymmetry research has lost traction and the ability of supersymmetry to tie these subcultures together is now compromised. Our work supports that even fields with an initially strong sentiment of unity may eventually generate diverging research programs and demonstrates the fruitfulness of the notion of trading zones for informing quantitative approaches to scientific pluralism.
Abstract: How do scientists navigate between the need to capitalize on their prior knowledge through specialization, and the urge to adapt to evolving research opportunities? Drawing from diverse perspectives on adaptation, this paper proposes an unsupervised Bayesian approach motivated by Optimal Transport of the evolution of scientists' research portfolios in response to transformations in their field. The model relies on $186,162$ scientific abstracts and authorship data to evaluate the influence of intellectual, social, and institutional resources on scientists' trajectories within a cohort of $2,195$ high-energy physicists between 2000 and 2019. Using Inverse Optimal Transport, the reallocation of research efforts is shown to be shaped by learning costs, thus enhancing the utility of the scientific capital disseminated among scientists. Two dimensions of social capital, namely "diversity" and "power", have opposite associations with the magnitude of change in scientists' research interests: while "diversity" disrupts and expands research interests, "power" is associated with more stable research agendas. Social capital plays a more crucial role in shifts between cognitively distant research areas. More generally, this work suggests new approaches for understanding, measuring and modeling collective adaptation using Optimal Transport.
Abstract: Long-form audio recordings are increasingly used to study individual variation, group differences, and many other topics in theoretical and applied fields of developmental science, particularly for the description of children's language input (typically speech from adults) and children’s language output (ranging from babble to sentences). The proprietary LENA software has been available for over a decade, and with it, users have come to rely on derived metrics like adult word count (AWC) and child vocalization counts (CVC), which have also more recently been derived using an open-source alternative, the ACLEW pipeline. Yet, there is relatively little work assessing the reliability of long-form metrics in terms of the stability of individual differences across time. Filling this gap, we analyzed eight spoken-language datasets: four from North American English-learning infants, and one each from British English-, French-, American English-Spanish, and Quechua-Spanish-learning infants. The audio data were analyzed using two types of processing software: LENA and the ACLEW open-source pipeline. When all corpora were included, we found relatively low to moderate reliability (across multiple recordings, intraclass correlation coefficient attributed to the child identity (Child ICC), was <50% for most metrics). There were few differences between the two pipelines. Exploratory analyses suggested some differences as a function of child age and corpora. These findings suggest that, while reliability is likely sufficient for various group-level analyses, caution is needed when using either LENA or ACLEW tools to study individual variation. We also encourage improvement of extant tools, specifically targeting accurate measurement of individual variation.