Modeling Language Development: How Machine Learning can Enhance Analysis of the Language Environment
Files
Publication or External Link
Date
Authors
Citation
DRUM DOI
Abstract
Language sampling elicits a representative picture of a child’s language and provides methods for assessing functional communication beyond what is offered by standardized tests. Naturalistic sampling reduces time costs, and offers an ideal way to assess differences in home language associated with differences in socioeconomic status (SES). Unfortunately, naturalistic dense recordings present challenges in terms of how to scale analysis and extract meaningful information. This study investigates the application and analysis of the Language ENvironment Analysis system (LENA) for sampling home language using technology-assisted transcription and topic modeling. To evaluate the efficacy of transcription, segments were selected in reference to their amount of meaningful speech as measured by LENA, and transcribed by Whisper, OpenAI’s automatic speech recognition software. Research assistants trimmed text files to retain available adult language separated by utterance. Results suggest that this method of sampling, technology-assisted transcription, and automated analysis of traditional language metrics reproduces expected associations between parental input, SES, and standardized child vocabulary size. Topic models did not identify activity contexts, likely due to the nature of the input. This research presents a validated pipeline to produce dense representative data that utilizes modern approaches to reduce traditional time costs.
Notes
URI (handle)
Rights
http://creativecommons.org/licenses/by-nd/3.0/us/