Pragmatic Reasoning in Language Models

Recent advances in large language models (LLMs) have renewed the question of what it means to understand language, and whether pragmatic competence can emerge purely from exposure to statistical patterns. While prior benchmarks have tested models’ handling of implicature or discourse coherence, it remains unclear whether they track fine-grained informational-structural constraints in the way humans do. This project takes anaphoric reference as a case study, focusing on definites and demonstratives to evaluate whether models display human-like sensitivity to discourse context.

In collaboration with Jennifer Hu (Johns Hopkins University) and Kathryn Davidson, I evaluated 19 models across English, German, Turkish, and Mandarin. The results show that only larger models replicate human behavior: they prefer demonstratives in contexts supporting contrastive inference, while smaller models capture only partial pragmatic effects. These findings suggest that sensitivity to discourse-level constraints is not trivially learnable from surface statistics, but emerges at scale. Building on this foundation, ongoing work explores how and when such sensitivity arises—probing production behavior, developmental trajectories across training checkpoints, and the internal representations that encode pragmatic distinctions. Together, these directions aim to advance our understanding of pragmatic reasoning in humans and machines alike.

Selected works:

Hu¹, Saha¹ & Davidson 2025