Filling in the white space: Spatial interpolation with Gaussian processes and social media data.

Giorgi, Salvatore; Eichstaedt, Johannes C; Preotiuc-Pietro, Daniel; Gardner, Jacob R; Schwartz, H Andrew; Ungar, Lyle H

Giorgi, Salvatore; Eichstaedt, Johannes C; Preotiuc-Pietro, Daniel; Gardner, Jacob R; Schwartz, H Andrew; Ungar, Lyle H.

Afiliación

Giorgi S; Department of Computer and Information Science, University of Pennsylvania, United States of America.
Eichstaedt JC; Department of Psychology & Institute for Human-Centered AI, Stanford University, United States of America.
Preotiuc-Pietro D; Bloomberg, United States of America.
Gardner JR; Department of Computer and Information Science, University of Pennsylvania, United States of America.
Schwartz HA; Department of Computer Science, Stony Brook University, United States of America.
Ungar LH; Department of Computer and Information Science, University of Pennsylvania, United States of America.

Curr Res Ecol Soc Psychol ; 52023.

Article en En | MEDLINE | ID: mdl-38125747

ABSTRACT

ABSTRACT

Full national coverage below the state level is difficult to attain through survey-based data collection. Even the largest survey-based data collections, such as the CDC's Behavioral Risk Factor Surveillance System or the Gallup-Healthways Well-being Index (both with more than 300,000 responses p.a.) only allow for the estimation of annual averages for about 260 out of roughly U.S. 3,000 counties when a threshold of 300 responses per county is used. Using a relatively high threshold of 300 responses gives substantially higher convergent validity-higher correlations with health variables-than lower thresholds but covers a reduced and biased sample of the population. We present principled methods to interpolate spatial estimates and show that including large-scale geotagged social media data can increase interpolation accuracy. In this work, we focus on Gallup-reported life satisfaction, a widely-used measure of subjective well-being. We use Gaussian Processes (GP), a formal Bayesian model, to interpolate life satisfaction, which we optimally combine with estimates from low-count data. We interpolate over several spaces (geographic and socioeconomic) and extend these evaluations to the space created by variables encoding language frequencies of approximately 6 million geotagged Twitter users. We find that Twitter language use can serve as a rough aggregate measure of socioeconomic and cultural similarity, and improves upon estimates derived from a wide variety of socioeconomic, demographic, and geographic similarity measures. We show that applying Gaussian Processes to the limited Gallup data allows us to generate estimates for a much larger number of counties while maintaining the same level of convergent validity with external criteria (i.e., N = 1,133 vs. 2,954 counties). This work suggests that spatial coverage of psychological variables can be reliably extended through Bayesian techniques while maintaining out-of-sample prediction accuracy and that Twitter language adds important information about cultural similarity over and above traditional socio-demographic and geographic similarity measures. Finally, to facilitate the adoption of these methods, we have also open-sourced an online tool that researchers can freely use to interpolate their data across geographies.

Palabras clave

Gaussian processes; Geographical psychology; Interpolation; Social media; Twitter

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Idioma: En Revista: Curr Res Ecol Soc Psychol Año: 2023 Tipo del documento: Article País de afiliación: Estados Unidos Pais de publicación: Países Bajos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google