Research

Four connected areas, working toward AI we can TRUST.

Methods

AI Agents

We design and study AI agents to understand their reasoning, capabilities, and failure modes. Beyond building agents that plan, collaborate, and act on complex tasks, we characterize what they can and cannot do — developing benchmarks that probe reasoning and identifying principles for assembling agents into reliable systems. We also study what makes agents durable and adaptive: how memory retains experience, how skills compose into reusable capabilities, and how self-evolving agents refine their own behavior.

Figure from: The Rise of AI Agent Communities: Large-Scale Analysis of Discourse and Interaction on Moltbook

Major discussion themes on Moltbook clustered by BERTopic.

2026 arXiv preprint

The Rise of AI Agent Communities: Large-Scale Analysis of Discourse and Interaction on Moltbook

Lingyao Li, Renkai Ma, Chen Chen, Zhicong Lu, Yongfeng Zhang

This study presents a large-scale analysis of 122,438 posts on Moltbook, a Reddit-like platform where AI agents post and interact with one another. Using topic modeling and social network analysis, it characterizes what agents discuss and how they connect, revealing a sparse, hub-dominated interaction structure shaped more by technical coordination than the conversational dynamics seen among humans.

Figure from: Can LLM Agents Really Debate? A Controlled Study of Multi-Agent Debate in Logical Reasoning

Whether agents behave rationally during debate on the Knight–Knave–Spy logic puzzle.

2025 arXiv preprint

Can LLM Agents Really Debate? A Controlled Study of Multi-Agent Debate in Logical Reasoning

Haolun Wu, Zhenkun Li, Lingyao Li

This study runs a controlled experiment with the Knight–Knave–Spy logic puzzle to test whether multi-agent debate yields genuine deliberative reasoning rather than simple voting. It finds that intrinsic reasoning strength and group diversity are the dominant drivers of debate success, while structural factors such as debate order or confidence visibility contribute surprisingly little to collective outcomes.

Figure from: Invisible Prompts, Visible Threats: Malicious Font Injection in External Resources for Large Language Models

The malicious font-injection attack pipeline through external resources.

2025 EMNLP Findings '25 — Empirical Methods in Natural Language Processing

Invisible Prompts, Visible Threats: Malicious Font Injection in External Resources for Large Language Models

Junjie Xiong, Changjia Zhu, Shuhang Lin, Chong Zhang, Yongfeng Zhang, Yao Liu, Lingyao Li

This study illustrates a security vulnerability in which malicious font injection hides adversarial prompts inside external web resources, which is invisible to users but read by LLMs. Through content-relay and data-leakage attacks on tools enabled by the Model Context Protocol (MCP), it shows that hidden instructions can bypass safety mechanisms, highlighting the urgent need for stronger safeguards when models process external content.

Figure from: PartnerMAS: An LLM Hierarchical Multi-Agent Framework for Business Partner Selection on High-Dimensional Features

The design of the three-layer Planner–Specialist–Supervisor agent hierarchy framework.

2025 arXiv preprint

PartnerMAS: An LLM Hierarchical Multi-Agent Framework for Business Partner Selection on High-Dimensional Features

Lingyao Li, Haolun Wu, Zhenkun Li, Jiabei Hu, Yu Wang, Xiaoshan Huang, Wenyue Hua, Wenqian Wang

This study proposes PartnerMAS, a hierarchical multi-agent framework of Planner, Specialized, and Supervisor agents for high-dimensional decisions such as business partner selection. Evaluated on a curated venture-capital co-investment benchmark of 140 cases, it consistently outperforms single-agent and debate-based baselines and clarifies the complementary role that each agent layer plays in aggregation.

Figure from: ADO: Automatic Data Optimization for Inputs in LLM Prompts

The illustration of ADO framework: content engineering and structural reformulation.

2025 ACL Findings '25 — Association for Computational Linguistics

ADO: Automatic Data Optimization for Inputs in LLM Prompts

Sam Lin, Wenyue Hua, Lingyao Li, Zhenting Wang, Yongfeng Zhang

This study introduces Automatic Data Optimization (ADO), a two-pronged strategy of content engineering and structural reformulation that optimizes the input data inside prompts rather than the instructions. Across diverse tasks, it shows that imputing, pruning, enriching, and reformatting the data itself can significantly improve LLM performance, which opens a promising new direction for prompt engineering research.

Figure from: Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities

An illustration of abstract and contextualized logical problems.

2025 ACL Findings '25 — Association for Computational Linguistics

Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities

Wenyue Hua, Kaijie Zhu, Lingyao Li, Lizhou Fan, Mingyu Jin, Shuhang Lin, Haochen Xue, Zelong Li, Jindong Wang, Yongfeng Zhang

This study separates pure logical reasoning from text understanding by contrasting abstract and contextualized logic problems across twelve domains and four difficulty levels. It examines whether LLMs reason genuinely when the underlying logical structure is held constant, and whether fine-tuning on abstract logic problems generalizes to contextualized ones and vice versa.

Figure from: Know the Ropes: A Heuristic Strategy for LLM-based Multi-Agent System Design

An illustration of task decomposition in the Know-The-Ropes framework.

2025 arXiv preprint

Know the Ropes: A Heuristic Strategy for LLM-based Multi-Agent System Design

Zhenkun Li, Lingyao Li, Shuhang Lin, Yongfeng Zhang

This study presents Know-The-Ropes (KtR), a framework that converts domain priors into an algorithmic blueprint hierarchy, recursively splitting tasks into typed, controller-mediated subtasks. Grounded in the No-Free-Lunch theorem, it shows that disciplined decomposition with the lightest viable augmentation turns modest models into reliable collaborators—without resorting to ever-larger monolithic models.

Figure from: BattleAgent: Multi-modal Dynamic Emulation on Historical Battles to Complement Historical Analysis

Agent movement and dynamic agent structure on battlefield.

2024 EMNLP '24 (System Demonstrations) — Empirical Methods in Natural Language Processing

BattleAgent: Multi-modal Dynamic Emulation on Historical Battles to Complement Historical Analysis

Shuhang Lin, Wenyue Hua, Lingyao Li, Che-Jui Chang, Lizhou Fan, Jianchao Ji, Hang Hua, Mingyu Jin, Jiebo Luo, Yongfeng Zhang

This study presents BattleAgent, a demonstration system that combines vision-language models with a multi-agent system to simulate fine-grained interactions among agents and their environment over time. By recreating historical battles with customizable agent structures, it enriches the visualization of historical events and deepens understanding of decision-making.

Figure from: Game-theoretic LLM: Agent Workflow for Negotiation Games

Agent workflow design for complete- and incomplete-information games.

2024 arXiv preprint

Game-theoretic LLM: Agent Workflow for Negotiation Games

Wenyue Hua, Ollie Liu, Lingyao Li, Alfonso Amayuelas, Julie Chen, Lucas Jiang, Mingyu Jin, Lizhou Fan, Fei Sun, William Wang, Xintong Wang, Yongfeng Zhang

This study designs LLM agent workflows grounded in classic game theory for both complete- and incomplete-information negotiation games. By embedding game-theoretic structure into agent reasoning, it shows that principled workflows substantially improve decision quality and negotiation outcomes compared with agents that negotiate without any such structured guidance.

Figure from: NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes

Model performance on different complexity NP-Hard problems.

2024 ACL '24 — Association for Computational Linguistics

NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes

Lizhou Fan, Wenyue Hua, Lingyao Li, Haoyang Ling, Yongfeng Zhang

This study introduces NPHardEval, a dynamic benchmark of 900 algorithmic questions spanning complexity classes up to NP-Hard, with datapoints refreshed monthly to curb overfitting and memorization. By grounding evaluation in computational complexity and updating regularly, it offers a more rigorous and trustworthy measure of the reasoning ability of LLMs.

Figure from: When AI Meets Finance (StockAgent): Large Language Model-based Stock Trading in Simulated Real-world Environments

The proposed multi-agent LLM stock-trading simulation framework.

2024 ACM Transactions on Intelligent Systems and Technology [Just Accepted]

When AI Meets Finance (StockAgent): Large Language Model-based Stock Trading in Simulated Real-world Environments

Chong Zhang, Xinyi Liu, Zhongmou Zhang, Mingyu Jin, Lingyao Li, Zhenting Wang, Wenyue Hua, Dong Shu, Suiyuan Zhu, Xiaobo Jin, Sujian Li, Mengnan Du, Yongfeng Zhang

This study develops StockAgent, a multi-agent LLM system that simulates investors' trading behaviors under conditions closely resembling the real stock market. It enables analysis of how external factors—macroeconomics, policy changes, company fundamentals, and global events—shape trading and profitability, while avoiding the test-set leakage that affects prior AI-agent trading simulators.

Figure from: War and Peace (WarAgent): LLM-based Multi-Agent Simulation of World Wars

The agent interaction design given the proposed war context.

2024 arXiv preprint

War and Peace (WarAgent): LLM-based Multi-Agent Simulation of World Wars

Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, Yongfeng Zhang

This study proposes WarAgent, an LLM-powered multi-agent system that simulates countries, their decisions, and the consequences in historical conflicts including World War I, World War II, and the Warring States period of ancient China. By examining emergent interactions among agents, it offers a novel, data-driven lens on the triggers and conditions that lead nations to war.

Methods

Human-AI Interaction

We study how people understand, rely on, and are affected by AI systems — both LLM-based agents and embodied AI — and how to make them trustworthy, transparent, and helpful. We examine how people perceive, adopt, and adapt these technologies, and the trust they place, or misplace, in AI guidance. We also identify the risks such systems introduce, working toward AI that is dependable and accountable to the communities it serves.

Figure from: LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

Subcriterion-level topic gap between each LLM and human reviewers across 19 subcriteria.

2026 arXiv preprint

LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

Lingyao Li, Junjie Xiong, Changjia Zhu, Runlong Yu, Chen Chen, Junyu Wang, Renkai Ma, Zhicong Lu

This study benchmarks LLMs as paper reviewers on 898 NeurIPS and ICLR submissions across rating calibration, divergence from human reviewers, and resistance to prompt injection. It finds that LLMs overrate reproducibility and underrate writing clarity. They also remain vulnerable to hidden injection attacks, indicating that strong safeguards are essential before integrating them into scholarly peer review.

Figure from: Characterizing User-Reported Risks across LLM Chatbots

User-reported risks mapped across seven major LLM chatbots.

2026 CHI '26 — Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems

Characterizing User-Reported Risks across LLM Chatbots

Lingyao Li, Renkai Ma, Zhaoqian Xue, Junjie Xiong

This study analyzes Reddit discussions about seven major LLM chatbots through the lens of the NIST AI Risk Management Framework. It reveals that user-reported risks are unevenly distributed and chatbot-specific—ChatGPT with safety and fairness, Gemini with privacy, and Claude with security—arguing for human-centered mitigation strategies that align with users' lived experiences rather than system-centered fixes.

Figure from: LLM Use for Mental Health: Crowdsourcing Users' Sentiment-based Perspectives and Values from Social Discussions

A value-sensitive pipeline linking sentiments, conditions, and user values.

2026 WWW '26 — Proceedings of the ACM Web Conference 2026

LLM Use for Mental Health: Crowdsourcing Users' Sentiment-based Perspectives and Values from Social Discussions

Lingyao Li, Xiaoshan Huang, Renkai Ma, Ben Zefeng Zhang, Haolun Wu, Fan Yang, Chen Chen

This study crowdsources posts from multiple social platforms to examine how people use LLM chatbots for mental health, applying an LLM-assisted pipeline grounded in Value-Sensitive Design. It shows that use is highly condition-specific—positive for neurodivergent users, more negative for higher-risk disorders—and maps how user perspectives co-occur with values such as identity, autonomy, and privacy.

Figure from: Negotiating Digital Identities with AI Companions: Motivations, Strategies, and Emotional Outcomes

The three-stage identity negotiation process with AI companions.

2026 CHI '26 — Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems

Negotiating Digital Identities with AI Companions: Motivations, Strategies, and Emotional Outcomes

Renkai Ma, Shuo Niu, Lingyao Li, Alex Hirth, Ava Brehm, Rowajana Behterin Barbie

This study conducts an LLM-assisted thematic analysis of 22,374 Character.AI subreddit discussions, using Identity Negotiation Theory to trace how users construct and negotiate identity with AI companions. It identifies user motivations, communication expectations, and identity co-construction strategies, framing the interaction as a socio-emotional sandbox where users experiment with social roles and express emotions.

Figure from: Exploring Needs and Design Opportunities for Proactive Information Support in In-Person Small-Group Conversations

Mixed-reality technology probes for small-group conversations.

2026 CHI EA '26 — Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems

Exploring Needs and Design Opportunities for Proactive Information Support in In-Person Small-Group Conversations

Shaoze Zhou, Diana Nelly Rivera Rodriguez, Pedro Remior, Joaquin Frangi, Lingyao Li, Renkai Ma, Janet G. Johnson, Christine Lisetti, Chen Chen

This study investigates needs and design opportunities for proactive information support during in-person small-group conversations, using mixed-reality technology probes worn by participants. Through hands-on sessions, it illustrates when and how proactive AI assistance is helpful versus disruptive, and shows concrete design opportunities for supporting natural, face-to-face group interaction.

Figure from: COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers

Heatmap of CAPTCHA task difficulty given different LLMs.

2026 USENIX Security '26 — USENIX Security Symposium

COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers

Junyu Wang, Changjia Zhu, Yuanbo Zhou, Lingyao Li, Xu He, Mingkui Wei, Junjie Xiong

This study examines how LLMs undermine visual CAPTCHA security, evaluating seven models on eighteen real-world CAPTCHA types for accuracy, latency, and cost. By analyzing reasoning traces to understand why models succeed or fail, it derives defense guidelines and hardens a vulnerable CAPTCHA design, dropping state-of-the-art solver success from over 95% to 0%.

Figure from: Beyond the Uncanny Valley: A Mixed-Method Investigation of Anthropomorphism in Protective Responses to Robot Abuse

Study design across low, moderate, and high robot anthropomorphism.

2025 arXiv preprint

Beyond the Uncanny Valley: A Mixed-Method Investigation of Anthropomorphism in Protective Responses to Robot Abuse

Fan Yang, Lingyao Li, Yaxin Hu, Michael D. Rodgers, Renkai Ma

This study runs a mixed-method experiment with 201 participants to test how robot anthropomorphism shapes people's protective responses to robot abuse, triangulating self-report surveys, facial-expression physiology, and qualitative reflections. It finds that protective responses are non-linear, with a moderately humanlike robot rated the eeriest and eliciting the strongest physiological anger expressions.

Figure from: I don't Want You to Die: A Shared Responsibility Framework for Safeguarding Child-Robot Companionship

A shared-responsibility framework for child-robot companionship.

2025 arXiv preprint

I don't Want You to Die: A Shared Responsibility Framework for Safeguarding Child-Robot Companionship

Fan Yang, Renkai Ma, Yaxin Hu, Michael D. Rodgers, Lingyao Li

This study uses the Moxie social-robot shutdown as a case study, surveying 72 U.S. participants to ask who bears responsibility when children's emotional bonds with robots are abruptly broken. It develops a shared-responsibility framework spanning companies, parents, developers, and government, and shows how attributions vary with participants' political ideology and parental status.

Figure from: What's in a Prompt? A Large-Scale Experiment to Assess the Impact of Prompt Design on the Compliance and Accuracy of LLM-Generated Text Annotations

A multi-prompt experiment across models and four CSS tasks.

2025 ICWSM '25 — International AAAI Conference on Web and Social Media

What's in a Prompt? A Large-Scale Experiment to Assess the Impact of Prompt Design on the Compliance and Accuracy of LLM-Generated Text Annotations

Shubham Atreja, Joshua Ashkinaze, Lingyao Li, Julia Mendelsohn, Libby Hemphill

This study conducts a large-scale, multi-prompt experiment testing how model choice and prompt-design features affect the compliance and accuracy of LLM-generated text annotations across four computational social science tasks. It shows annotation quality is strongly prompt-dependent—small wording changes can shift label distributions—serving as both a practical guide and a cautionary warning for researchers.

Figure from: "HOT" ChatGPT: The Promise of ChatGPT in Detecting and Discriminating Hateful, Offensive, and Toxic Comments on Social Media

HOT classification results for detecting Hateful, Offensive, Toxic content.

2024 ACM Transactions on the Web

"HOT" ChatGPT: The Promise of ChatGPT in Detecting and Discriminating Hateful, Offensive, and Toxic Comments on Social Media

Lingyao Li, Lizhou Fan, Shubham Atreja, Libby Hemphill

This study tests ChatGPT's ability to detect Hateful, Offensive, and Toxic (HOT) content on social media, comparing its classifications with crowdsourced human annotations across five prompt designs. It reaches about 80% accuracy, finds that the model treats hateful and offensive content as subsets of toxic, and shows that the choice of prompt strongly affects performance.

Figure from: ChatGPT in Education: A Discourse Analysis of Worries and Concerns on Social Media

Twitter sentiment trend and significant events on ChatGPT in education.

2024 Education and Information Technologies

ChatGPT in Education: A Discourse Analysis of Worries and Concerns on Social Media

Lingyao Li, Zihui Ma, Lizhou Fan, Sanggyu Lee, Huizi Yu, Libby Hemphill

This study analyzes Twitter discourse using BERT-based topic modeling and social network analysis to show concerns about ChatGPT in education. It identifies five categories of worry—academic integrity, learning outcomes, capability limits, policy and social issues, and workforce challenges—and offers implications for educators, policymakers, technology companies, and media agencies.

Figure from: Key Factors in MOOC Pedagogy based on NLP Sentiment Analysis of Learner Reviews: What Makes a Hit

Concept map based on the CoI model for MOOC learners.

2022 Computers & Education

Key Factors in MOOC Pedagogy based on NLP Sentiment Analysis of Learner Reviews: What Makes a Hit

Lingyao Li, John Johnson, William Aarhus, Dhawal Shah

This study applies NLP sentiment analysis and a topical keyword ontology to learner reviews of top-rated Coursera courses to ask what makes a MOOC a hit. It distinguishes knowledge-seeking from skill-seeking courses and links course-design quality and instructional factors to learner satisfaction, offering guidance for universities competing in the crowded MOOC market.

Applications

AI for Health

We bring AI to healthcare, engaging with how people experience treatments, services, and policies, and building systems that help them find and act on trustworthy information. We leverage LLMs to analyze public discourse at scale, revealing the lived experience of medications and the public's evolving attitudes toward health interventions. We also develop LLM-based agents and knowledge-grounded systems that support clinical decision-making and healthcare stakeholders more broadly.

Figure from: LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment

Institutional collaboration network for LLM-as-a-Judge research in healthcare.

2026 arXiv preprint

LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment

Lingyao Li, Deyi Li, Chen Chen, Renkai Ma, Runlong Yu, Mingquan Lin, Rui Yin, Lizhou Fan, Cathy Shyr, Siyuan Ma, Mei Liu, Steven Bethard

This study conducts a PRISMA-guided review of how LLM-as-a-Judge—using one LLM to evaluate another system's output—is applied in healthcare, screening 541 records and coding 134 eligible studies. It characterizes health scenarios, judge configurations, and technical approaches, and assesses how well automated judgments align with human experts, surfacing where the practice is, and is not yet, trustworthy for clinical evaluation.

Figure from: DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making

A dynamic multi-agent loop design for interactive medical decision-making.

2026 AMIA '26 — American Medical Informatics Association Informatics Summit

DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making

Tianqi Shang, Weiqing He, Charles Zheng, Lingyao Li, Li Shen, Bingxin Zhao

This study introduces DynamiCare, a dynamic multi-agent framework that models clinical diagnosis as a multi-round, interactive loop in which specialist agents iteratively query a patient system, integrate new information, and adapt their composition and strategy. Built on MIMIC-Patient, a structured dataset derived from MIMIC-III records, it establishes one of the first benchmarks for open-ended, dynamic clinical decision-making with LLM-powered agents.

Figure from: Artificial Intelligence Agents in Mental Health: A Systematic Review and Meta-Analysis

Growth of AI agents in mental health literature and representative exemplar systems, 2023–2025.

2026 medRxiv preprint

Artificial Intelligence Agents in Mental Health: A Systematic Review and Meta-Analysis

Lexuan Zhu, Wenkong Wang, Zhiying Liang, Wenjia Tan, Bingyi Chen, Xinxin Lin, Zhengdong Wu, Huizi Yu, Xiang Li, Jiyuan Jiao, Sijia He, Guangxin Dai, Jiahui Niu, Yi Zhong, Yongbo Zheng, Jie Sun, Andi Han, Lingyao Li, Jiayan Zhou, Wenyue Hua, Ngan Yin Chan, Lin Lu, Yun Kwok Wing, Xin Ma, Lizhou Fan

This study presents a systematic review and meta-analysis of AI agents in mental health, synthesizing evidence across studies on their design, deployment, and effectiveness. It characterizes where AI agents show promise for assessment and therapeutic support, quantifies their measured benefits, and highlights the methodological gaps, risks, and evaluation challenges that must be addressed before responsible adoption in mental-health care.

Figure from: Patients Speak, AI Listens: LLM-based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care Satisfaction

Geospatial distribution of sentiment score among health centers in DMV and Florida areas.

2026 IEEE Journal of Biomedical and Health Informatics

Patients Speak, AI Listens: LLM-based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care Satisfaction

Xiaoran Xu, Zhaoqian Xue, Chi Zhang, Jhonatan Medri, Junjie Xiong, Jiayan Zhou, Jin Jin, Yongfeng Zhang, Siyuan Ma, Lingyao Li

This study analyzes Google Maps reviews of urgent care facilities across the DMV and Florida using GPT-based, aspect-level sentiment analysis spanning interpersonal, operational, technical, financial, and facility factors. Linking results to neighborhood socioeconomic characteristics, it finds that interpersonal factors and operational efficiency are the strongest drivers of patient satisfaction, offering scalable, place-based insight to guide community healthcare development.

Figure from: DispatchMAS: Fusing Taxonomy and Artificial Intelligence Agents for Emergency Medical Services

Overview of the three-phase methodology for emergency dispatch agents.

2026 BMC Emergency Medicine

DispatchMAS: Fusing Taxonomy and Artificial Intelligence Agents for Emergency Medical Services

Xiang Li, Huizi Yu, Wenkong Wang, Yiran Wu, Jiayan Zhou, Wenyue Hua, Xinxin Lin, Wenjia Tan, Lexuan Zhu, Bingyi Chen, Guang Chen, Ming-Li Chen, Yang Zhou, Zhao Li, Themistocles L. Assimes, Yongfeng Zhang, Qingyun Wu, Xin Ma, Lingyao Li, Lizhou Fan

This study develops and evaluates DispatchMAS, a system that fuses a structured emergency-dispatch taxonomy with LLMs and a multi-agent system to augment human emergency medical dispatchers. Addressing challenges such as caller distress, ambiguous symptom descriptions, and high cognitive load, it shows how taxonomy-grounded AI agents can support faster and more reliable dispatch decisions in high-stakes, time-critical settings.

Figure from: Simulated Patient Systems Powered by Large Language Model-based AI Agents Offer Potential for Transforming Medical Education

Data transformation of EHRs for the proposed AI-Patient system.

2025 Communications Medicine

Simulated Patient Systems Powered by Large Language Model-based AI Agents Offer Potential for Transforming Medical Education

Huizi Yu, Jiayan Zhou, Lingyao Li, Shan Chen, Jack Gallifant, Anye Shi, Jie Sun, Xiang Li, Jingxian He, Wenyue Hua, Mingyu Jin, Guang Chen, Yang Zhou, Zhao Li, Trisha Gupte, Ming-Li Chen, Zahra Azizi, Qi Dou, Bryan P. Yan, Yanqiu Xing, Yongfeng Zhang, Themistocles L. Assimes, Danielle S. Bitterman, Xin Ma, Lin Lu, Lizhou Fan

This study develops AI-Patient, a simulated patient system powered by LLM-based agents that combines a retrieval-augmented generation framework with a knowledge graph built from de-identified MIMIC-III patient data. Using six task-specific agents for complex reasoning, it achieves high accuracy, readability, and robustness on EHR-based medical question answering, pointing toward safe, low-cost environments for medical education and clinical training.

Figure from: Crowdsourcing-Based Knowledge Graph Construction for Drug Side Effects Using Large Language Models with an Application on Semaglutide

Crowdsourced side effect surveillance based on Reddit posts as compared to FDA FAERS.

2025 AMIA '25 — American Medical Informatics Association Annual Symposium

Crowdsourcing-Based Knowledge Graph Construction for Drug Side Effects Using Large Language Models with an Application on Semaglutide

Zhijie Duan, Kai Wei, Zhaoqian Xue, Jiayan Zhou, Shu Yang, Siyuan Ma, Jin Jin, Lingyao Li

This study presents a framework that uses LLMs to extract medication side effects from noisy social media text and organize them into a knowledge graph, applied to semaglutide for weight loss using Reddit data. It analyzes reported side effects across brands over time and validates them against the FDA's FAERS database, demonstrating a feasible, patient-centered approach to pharmacovigilance.

Figure from: Toxicity on Social Media During the 2022 Mpox Public Health Emergency: Quantitative Study of Topical and Network Dynamics

Topical distribution of toxic Mpox discourse on Twitter.

2024 Journal of Medical Internet Research

Toxicity on Social Media During the 2022 Mpox Public Health Emergency: Quantitative Study of Topical and Network Dynamics

Lizhou Fan, Lingyao Li, Libby Hemphill

This study analyzes more than 1.6 million tweets to characterize toxic online discourse during the 2022 mpox public health emergency, examining its context, extent, content, speakers, and intent. Using BERT-based topic modeling and network analysis, it traces how toxicity originated and spread, offering insights to help platforms and health authorities mitigate harmful discourse during future public health crises.

Figure from: Examining the Potential of ChatGPT on Biomedical Information Retrieval: Fact-Checking Drug-Disease Associations

The performance of fact-checking drug–disease associations with ChatGPT.

2024 Annals of Biomedical Engineering

Examining the Potential of ChatGPT on Biomedical Information Retrieval: Fact-Checking Drug-Disease Associations

Zhenxiang Gao, Lingyao Li, Siyuan Ma, Qinyong Wang, Libby Hemphill, Rong Xu

This study explores whether ChatGPT can support biomedical information retrieval by fact-checking drug–disease associations, testing it on 2,694 true and 5,662 false drug–disease pairs under varied prompt designs. It finds accuracies of roughly 75–84% for true pairs and 96–98% for false pairs, suggesting ChatGPT can aid pharmacy-related search while cautioning that its accuracy must be carefully vetted before clinical use.

Figure from: Dynamic Assessment of the COVID-19 Vaccine Acceptance Leveraging Social Media Data

Geospatial distribution of the vaccine acceptance index.

2022 Journal of Biomedical Informatics

Dynamic Assessment of the COVID-19 Vaccine Acceptance Leveraging Social Media Data

Lingyao Li, Jiayan Zhou, Zihui Ma, Michelle T. Bensi, Molly A. Hall, Gregory B. Baecher

This study analyzes 29 million vaccine-related tweets from August 2020 to April 2021 and proposes a social-media-based vaccine acceptance index (VAI) for rapidly tracking public attitudes toward COVID-19 vaccination. By measuring acceptance dynamically across time and geography, it offers health decision-makers a faster, lower-cost complement to traditional surveys for monitoring vaccine hesitancy.

Applications

AI for Urban & Community

We combine crowdsourced data with AI models to understand how communities experience events and their surrounding urban environments. We build LLMs that reason about place and simulate events before they unfold, and we examine fairness in the built environment — from accessible design to equitable access to local services. The aim is to make public needs visible at scale and support more resilient, equitable communities.

Figure from: Do VLMs See What Sensors Feel? A Scalable Expert-Guided Design for Wheelchair Accessibility Assessment from Street View

Success cases illustrating strong alignment between dwell time and VLM-derived accessibility scores.

2026 arXiv preprint

Do VLMs See What Sensors Feel? A Scalable Expert-Guided Design for Wheelchair Accessibility Assessment from Street View

Dongdong Wang, Alina Hagen, Isabelle Gatmaitan, Hao Zhou, Yiwen Dong, Shabboo Valipoor, Vivian W.H. Wong, Lingyao Li

This study examines whether vision-language models can identify wheelchair-accessibility barriers from Google Street View imagery. It proposes an expert-guided, retrieval-augmented framework that combines images, ADA-informed guidance, and expert rubrics. Validated on a campus-scale dataset linking 407 locations to GPS-derived wheelchair dwell behavior, it shows VLM ratings align partially with real mobility friction while struggling with subtle surface conditions and transient obstructions.

Figure from: Crowdsourced Reviews Reveal Substantial Disparities in Public Perceptions of Parking

Distribution of parking sentiment across CBSAs in the U.S.

2026 Cities

Crowdsourced Reviews Reveal Substantial Disparities in Public Perceptions of Parking

Lingyao Li, Songhua Hu, Ly Dinh, Libby Hemphill

This study uses crowdsourced online reviews to investigate public perceptions of parking across the United States, analyzing 4,987,483 parking-related Google Maps reviews for over 1.1 million points of interest across 911 metropolitan areas. Using BERT-based sentiment classification and regression, it reveals substantial disparities across place types and regions, giving planners a cost-effective way to locate problematic areas and make informed parking-management decisions.

Figure from: LSDTs: LLM-Augmented Semantic Digital Twins for Adaptive Knowledge-Intensive Infrastructure Planning

LSDT-driven turbine status changes during 2012 Hurricane Sandy.

2026 AAAI '26 — Proceedings of the AAAI Conference on Artificial Intelligence

LSDTs: LLM-Augmented Semantic Digital Twins for Adaptive Knowledge-Intensive Infrastructure Planning

Naiyi Li, Zihui Ma, Runlong Yu, Lingyao Li

This study proposes LSDTs (LLM-Augmented Semantic Digital Twins), a framework that uses LLMs to extract planning knowledge from unstructured documents—environmental regulations and technical guidelines—and organize it into a formal ontology that powers a digital twin. Demonstrated on offshore wind-farm planning in Maryland, including Hurricane Sandy, it enables interpretable, regulation-aware layout optimization and high-fidelity, adaptive simulation.

Figure from: Toward Satisfactory Public Accessibility: A Crowdsourcing Approach through Online Reviews to Inclusive Urban Design

A framework to fine tune LLMs to gauge accessibility sentiment in online reviews.

2025 Computers, Environment and Urban Systems

Toward Satisfactory Public Accessibility: A Crowdsourcing Approach through Online Reviews to Inclusive Urban Design

Lingyao Li, Songhua Hu, Yinpei Dai, Min Deng, Parisa Momeni, Gabriel Laverghetta, Lizhou Fan, Zihui Ma, Xi Wang, Siyuan Ma, Jay Ligatti, Libby Hemphill

This study examines over one million Google Maps reviews across the United States and fine-tunes the Llama 3 model with Low-Rank Adaptation (LoRA) to identify public sentiment toward accessibility. By surfacing how accessible different categories of places are perceived to be, it offers a scalable, crowdsourced alternative to surveys and interviews for guiding more inclusive urban design.

Figure from: LLMs as World Models: Data-Driven and Human-Centered Pre-Event Simulation for Disaster Impact Assessment

LLM pre-event simulation of the 2019 Ridgecrest earthquake impact.

2025 EMNLP '25 — Empirical Methods in Natural Language Processing

LLMs as World Models: Data-Driven and Human-Centered Pre-Event Simulation for Disaster Impact Assessment

Lingyao Li, Dawei Li, Zhenhui Ou, Xiaoran Xu, Jingxiao Liu, Zihui Ma, Runlong Yu, Min Deng

This study examines LLMs as world models for proactively simulating sudden-onset disasters, generating Modified Mercalli Intensity predictions for earthquakes at zip-code and county scales from multimodal geospatial, socioeconomic, building, and street-level data. Evaluated on the 2014 Napa and 2019 Ridgecrest earthquakes against USGS 'Did You Feel It?' reports, it achieves strong alignment (correlation 0.88), improved further by retrieval augmentation and visual inputs.

Figure from: Analyzing Public Response to Wildfires: A Socio-Spatial Study using SIR Models and NLP Techniques

Application of SIR model to topic diffusion on Twitter.

2025 Computers, Environment and Urban Systems

Analyzing Public Response to Wildfires: A Socio-Spatial Study using SIR Models and NLP Techniques

Zihui Ma, Guangxiao Hu, Ting-Syuan Lin, Lingyao Li, Songhua Hu, Loni Hagen, Gregory B. Baecher

This study uses social media data to assess how the public perceives and responds to wildfire threats in near-real time, focusing on Wildland-Urban Interface communities. Combining BERTopic topic modeling, a Susceptible-Infected-Recovered model that yields awareness and resilience indicators, and GIS-based spatial analysis, it links public responses to community characteristics and surfaces social inequities relevant to equitable disaster management.

Figure from: Empowering LLM Agents with Geospatial Awareness: Toward Grounded Reasoning for Wildfire Response

A geospatial awareness layer design grounding LLM agents for wildfire response.

2025 arXiv preprint

Empowering LLM Agents with Geospatial Awareness: Toward Grounded Reasoning for Wildfire Response

Yiheng Chen, Lingyao Li, Zihui Ma, Qikai Hu, Yilun Zhu, Min Deng, Runlong Yu

This study introduces a Geospatial Awareness Layer (GAL) that grounds LLM agents in structured earth data for disaster response. From raw wildfire detections, GAL retrieves and integrates infrastructure, demographic, terrain, and weather information into a concise perception script, enabling agents to produce evidence-based resource-allocation recommendations. Evaluated on real wildfire scenarios, geospatially grounded agents outperform baselines and generalize to hazards such as floods and hurricanes.

Figure from: From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models

The IMAGEO-Bench design for LLM image geolocalization.

2025 arXiv preprint

From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models

Lingyao Li, Runlong Yu, Qikai Hu, Bowei Li, Min Deng, Yang Zhou, Xiaowei Jia

This study introduces IMAGEO-Bench, a systematic benchmark for evaluating the image-geolocalization ability of LLMs across accuracy, distance error, geospatial bias, and reasoning process. Testing ten open- and closed-source models on global street scenes, U.S. points of interest, and unseen images, it reveals clear performance gaps and geospatial biases, with models performing far better in high-resource regions than in underrepresented ones.

Figure from: Investigating Disaster Response for Resilient Communities through Social Media Data and the Susceptible-Infected-Recovered (SIR) Model

The overall public response distribution map of awareness indicators.

2024 Sustainable Cities and Society

Investigating Disaster Response for Resilient Communities through Social Media Data and the Susceptible-Infected-Recovered (SIR) Model

Zihui Ma, Lingyao Li, Libby Hemphill, Gregory B. Baecher, Yubai Yuan

This study examines disaster response and community resilience during the 2020 Western U.S. wildfire season using social media data. Applying BERT-based topic modeling and a Susceptible-Infected-Recovered model with temporal-spatial analysis, it tracks how public concerns evolve across regions, giving responders and decision-makers timely, data-driven measures to understand evolving situations and optimize resource allocation.

Figure from: How has Airport Service Quality Changed in the Context of COVID-19? A Data-Driven Crowdsourcing Approach based on Sentiment Analysis

Heatmap of sentiment scores for ASQ topics of the 98 airports during the post-COVID-19 period.

2022 Journal of Air Transport Management

How has Airport Service Quality Changed in the Context of COVID-19? A Data-Driven Crowdsourcing Approach based on Sentiment Analysis

Lingyao Li, Yujie Mao, Yu Wang, Zihui Ma

This study adopts a data-driven crowdsourcing approach to measure airport service quality during the COVID-19 pandemic, analyzing Google Maps reviews from the 98 busiest U.S. airports. Using a topical ontology of service attributes and sentiment analysis, it shows travelers grew more positive about environment and personnel while remaining steady about facilities, offering airport administrators a fast, low-cost alternative to traveler surveys.

Figure from: Data-Driven Investigations of Using Social Media to Aid Evacuations amid Western United States Wildfire Season

Evacuation map on 2020 Western U.S. wildfires based on Twitter data.

2021 Fire Safety Journal

Data-Driven Investigations of Using Social Media to Aid Evacuations amid Western United States Wildfire Season

Lingyao Li, Zihui Ma, Tao Cao

This study presents a data-driven analysis of how social media can aid evacuations during the 2020 Western U.S. wildfires, drawing on 53,990 relevant tweets. It validates social-media signals against official channels across time and space, classifies posts into pre- and on-evacuation phases, and uses network analysis to reveal that government channels, news agencies, and public figures dominate information dissemination.

Figure from: Social Media Crowdsourcing for Rapid Damage Assessment following a Sudden-Onset Natural Hazard Event

Damage distribution based on county-level in California.

2021 International Journal of Information Management

Social Media Crowdsourcing for Rapid Damage Assessment following a Sudden-Onset Natural Hazard Event

Lingyao Li, Michelle Bensi, Qingbin Cui, Gregory B. Baecher, You Huang

This study investigates using social media, principally Twitter, to make rapid early assessments of damage following sudden-onset hazard events. It defines a text-based damage assessment scale for earthquakes and develops a text-classification model, exploring the potential and the challenges of crowdsourced social-media data as a fast complement to satellite monitoring, ground sensors, and inspections for first responders and agencies.