Using AI to Develop a Lexicon-based African American Tweet Detection Algorithm to Inform Culturally Sensitive Twitter-based Social Support

Full Title: Using Artificial Intelligence to Develop a Lexicon-based African American Tweet Detection Algorithm to inform Culturally Sensitive Twitter based Social Support intervention for African American dementia caregivers

Screen Shot 2020-12-03 at 2.31.03 PM.png

Video


Team Information

Team Members

  • Sunmoo Yoon, Associate Research Scientist, Department of Medicine, General Medicine, Columbia University Vagelos College of Physicians and Surgeons

  • Peter Broadwell, Research Developer, Center for Interdisciplinary Digital Research, Stanford University

  • Haseeb Asim, MS Candidate, Applied Analytics, School of Professional Studies, Columbia University

  • Nanyi Deng, MS Candidate, Applied Analytics, School of Professional Studies, Columbia University

  • Michelle Odlum, EdD, MPH, Assistant Professor, CUIMC, CU

  • Nicole J Davis, PhD, Assistant Professor, Clemson University School of Nursing

  • Carmela Alcantara, PhD, Associate Professor, School of Social Work, CU

  • Mary Mittelman, DrPH, Professor, Department of Psychiatry, Grossman School of Medicine, NYU

Abstract

The prevalence of dementia is higher for African Americans than Whites. Although deep learning and other statistical techniques have been widely applied to infer demographic information on Twitter, those demographic detection algorithms tend to be unavailable to open science communities and/or require access to account details that could compromise individuals’ privacy. The purpose of this study is to develop a lexicon-based African American Tweet detection algorithm using artificial intelligence techniques to inform culturally sensitive Twitter based social support intervention for African American dementia caregivers. For our Tweet corpora, we extracted 3,291,101 Tweets using hashtags associated with African American-related discourse (#BlackTwitter, #BlackLivesMatter, #StayWoke) and 1,382,441 Tweets from the nonblack control set (general or no hashtags) from September 1, 2019 to December 31, 2019 using the Twitter API. For our literature corpora, we extracted 14,692 poems and prose writings by African American authors and 66,083 items authored by others as a control, including poems, plays, short stories, novels and essays, using a cloud-based machine learning platform (Amazon SageMaker) via ProQuest TDM studio. We combined statistics from log likelihood and Fisher's exact tests as well as feature analysis of a batch-trained Naive Bayes classifier to select lexicons of terms most strongly associated with the target or control Tweets. A total of 803,495 Tweets (24.41%) associated with African American-related discourse and 369,348 Tweets (26.71%) in the control group were identified as unique and non-bot generated Tweets. Likely due to the terse nature of tweets, we found that a lexicon composed of unigrams was more effective at differentiating Tweets from held-out test samples of the two groups than lexicons composed of n-grams of various lengths. The size of the current lexicon developed in this study is 1,735 unigrams for the African American lexicon and 2,267 unigrams for the control set. Our first version of a lexicon-based African American Tweet detection algorithm developed using literature and Tweet text will be useful to inform culturally sensitive Twitter based social support interventions for African American dementia caregivers. 

Contact this Team

Team Contact: Sunmoo Yoon (use form to send email)

Previous
Previous

Applying Social Network Analysis of Tweets to Compare Hispanic and Black Dementia Caregiving Networks

Next
Next

Inferring YouTube Streaming QoE from Encyrpted Traffic