Language models can be sampled multiple times to access the distribution underlying their responses, but existing methods cannot efficiently synthesize rich epistemic signals across different long-form responses. We introduce Consensus Graphs (ConGrs), a flexible DAG-based data structure that represents shared information, as well as semantic variation in a set of sampled LM responses to the same prompt. We construct ConGrs using a light-weight lexical sequence alignment algorithm from bioinformatics, supplemented by the targeted usage of a secondary LM judge. Further, we design task-dependent decoding methods to synthesize a single, final response from our ConGr data structure. Our experiments show that synthesizing responses from ConGrs improves factual precision on two biography generation tasks by up to 31% over an average response and reduces reliance on LM judges by more than 80% compared to other methods. We also use ConGrs for three refusal-based tasks requiring abstention on unanswerable queries and find that abstention rate is increased by up to 56%. We apply our approach to the MATH and AIME reasoning tasks and find an improvement over self-verification and majority vote baselines by up to 6 points of accuracy. We show that ConGrs provide a flexible method for capturing variation in LM responses and using the epistemic signals provided by response variation to synthesize more effective responses.
@article{ghosh2025samplealignsynthesizegraphbased,title={Sample, Align, Synthesize: Graph-Based Response Synthesis with ConGrs},author={Ghosh, Sayan and Warraich, Shahzaib Saqib and Tarsadiya, Dhruv and Yauney, Gregory and Swayamdipta, Swabha},journal={arXiv preprint arXiv:2510.03527},year={2025},abbr={arXiv},url={https://arxiv.org/abs/2510.03527},pdf={https://arxiv.org/pdf/2510.03527.pdf},}
CoRL
Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection
Evaluating learned robot control policies to determine their physical task-level capabilities costs experimenter time and effort. The growing number of policies and tasks exacerbates this issue. It is impractical to test every policy on every task multiple times; each trial requires a manual environment reset, and each task change involves re-arranging objects or even changing robots. Naively selecting a random subset of tasks and policies to evaluate is a high-cost solution with unreliable, incomplete results. In this work, we formulate robot evaluation as an active testing problem. We propose to model the distribution of robot performance across all tasks and policies as we sequentially execute experiments. Tasks often share similarities that can reveal potential relationships in policy behavior, and we show that natural language is a useful prior in modeling these relationships between tasks. We then leverage this formulation to reduce the experimenter effort by using a cost-aware expected information gain heuristic to efficiently select informative trials. Our framework accommodates both continuous and discrete performance outcomes. We conduct experiments on existing evaluation data from real robots and simulations. By prioritizing informative trials, our framework reduces the cost of calculating evaluation metrics for robot policies across many tasks.
@article{anwar2025efficient,title={Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection},author={Anwar, Abrar and Gupta, Rohan and Merchant, Zain and Ghosh, Sayan and Neiswanger, Willie and Thomason, Jesse},journal={arXiv preprint arXiv:2502.09829},url={https://arxiv.org/abs/2502.09829},abbr={CoRL},pdf={https://arxiv.org/pdf/2502.09829.pdf},year={2025},}
2024
EMNLP
Compare without Despair: Reliable Preference Evaluation with Generation Separability
Sayan Ghosh, Tejas Srinivasan, and
Swabha Swayamdipta
In Findings of the Association for Computational Linguistics: EMNLP 2024 Nov 2024
Human evaluation of generated language through pairwise preference judgments is pervasive. However, under common scenarios, such as when generations from a model pair are very similar, or when stochastic decoding results in large variations in generations, it results in inconsistent preference ratings. We address these challenges by introducing a meta-evaluation measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation. For a candidate test instance, separability samples multiple generations from a pair of models, and measures how distinguishable the two sets of generations are. Our experiments show that instances with high separability values yield more consistent preference ratings from both human- and auto-raters. Further, the distribution of separability allows insights into which test benchmarks are more valuable for comparing models. Finally, we incorporate separability into ELO ratings, accounting for how suitable each test instance might be for reliably ranking LLMs. Overall, separability has implications for consistent, efficient and robust preference evaluation of LLMs with both human- and auto-raters.
@inproceedings{ghosh-etal-2024-compare,title={Compare without Despair: Reliable Preference Evaluation with Generation Separability},author={Ghosh, Sayan and Srinivasan, Tejas and Swayamdipta, Swabha},editor={Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung},booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},month=nov,year={2024},abbr={EMNLP},address={Miami, Florida, USA},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2024.findings-emnlp.747},pdf={https://aclanthology.org/2024.findings-emnlp.747.pdf},pages={12787--12805},}
2023
ICWSM
Bridging Nations: Quantifying the Role of Multilinguals in Communication on Social Media
Julia Mendelsohn,
Sayan Ghosh, David Jurgens, and
Ceren Budak
In Proceedings of the International AAAI Conference on Web and Social Media Nov 2023
Social media enables the rapid spread of many kinds of in- formation, from pop culture memes to social movements. However, little is known about how information crosses linguistic boundaries. We apply causal inference techniques on the European Twitter network to quantify the structural role and communication influence of multilingual users in cross-lingual information exchange. Overall, multilinguals play an essential role; posting in multiple languages increases betweenness centrality by 13%, and having a multilingual network neighbor increases monolinguals’ odds of sharing domains and hashtags from another language 16-fold and 4-fold, respectively. We further show that multilinguals have a greater impact on diffusing information is less accessible to their monolingual compatriots, such as information from far-away countries and content about regional politics, nascent social movements, and job opportunities. By highlighting information exchange across borders, this work sheds light on a crucial component of how information and ideas spread around the world.
@inproceedings{mendelsohn2023bridging,title={Bridging Nations: Quantifying the Role of Multilinguals in Communication on Social Media},author={Mendelsohn, Julia and Ghosh, Sayan and Jurgens, David and Budak, Ceren},booktitle={Proceedings of the International AAAI Conference on Web and Social Media},volume={17},pdf={https://ojs.aaai.org/index.php/ICWSM/article/view/22174},pages={626--637},publisher={International AAAI Conference on Web and Social media},year={2023},abbr={ICWSM},}
2022
ACL
Learning to Mediate Disparities Towards Pragmatic Communication
Yuwei Bao,
Sayan Ghosh, and
Joyce Chai
In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) May 2022
Human communication is a collaborative process. Speakers, on top of conveying their own intent, adjust the content and language expressions by taking the listeners into account, including their knowledge background, personalities, and physical capabilities. Towards building AI agents with similar abilities in language communication, we propose a novel rational reasoning framework, Pragmatic Rational Speaker (PRS), where the speaker attempts to learn the speaker-listener disparity and adjust the speech accordingly, by adding a light-weighted disparity adjustment layer into working memory on top of speaker’s long-term memory system. By fixing the long-term memory, the PRS only needs to update its working memory to learn and adapt to different types of listeners. To validate our framework, we create a dataset that simulates different types of speaker-listener disparities in the context of referential games. Our empirical results demonstrate that the PRS is able to shift its output towards the language that listeners are able to understand, significantly improve the collaborative task outcome, and learn the disparity more efficiently than joint training.
@inproceedings{bao-etal-2022-learning,title={Learning to Mediate Disparities Towards Pragmatic Communication},abbr={ACL},author={Bao, Yuwei and Ghosh, Sayan and Chai, Joyce},booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},month=may,year={2022},address={Dublin, Ireland},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2022.acl-long.202},doi={10.18653/v1/2022.acl-long.202},pages={2829--2842},pdf={https://aclanthology.org/2022.acl-long.202.pdf},}
2021
W-NUT
Detecting Cross-Geographic Biases in Toxicity Modeling on Social Media
Sayan Ghosh, Dylan Baker, David Jurgens, and
Vinodkumar Prabhakaran
In Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-NUT 2021) Nov 2021
Online social media platforms increasingly rely on Natural Language Processing (NLP) techniques to detect abusive content at scale in order to mitigate the harms it causes to their users. However, these techniques suffer from various sampling and association biases present in training data, often resulting in sub-par performance on content relevant to marginalized groups, potentially furthering disproportionate harms towards them. Studies on such biases so far have focused on only a handful of axes of disparities and subgroups that have annotations/lexicons available. Consequently, biases concerning non-Western contexts are largely ignored in the literature. In this paper, we introduce a weakly supervised method to robustly detect lexical biases in broader geo-cultural contexts. Through a case study on a publicly available toxicity detection model, we demonstrate that our method identifies salient groups of cross-geographic errors, and, in a follow up, demonstrate that these groupings reflect human judgments of offensive and inoffensive language in those geographic contexts. We also conduct analysis of a model trained on a dataset with ground truth labels to better understand these biases, and present preliminary mitigation experiments.
@inproceedings{ghosh-etal-2021-detecting,title={Detecting Cross-Geographic Biases in Toxicity Modeling on Social Media},abbr={W-NUT},author={Ghosh, Sayan and Baker, Dylan and Jurgens, David and Prabhakaran, Vinodkumar},booktitle={Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-NUT 2021)},month=nov,year={2021},address={Online},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2021.wnut-1.35},pages={313--328},pdf={https://aclanthology.org/2021.wnut-1.35.pdf},}
2019
ACL
Wetin dey with these comments? Modeling Sociolinguistic Factors Affecting Code-switching Behavior in Nigerian Online Discussions
Innocent Ndubuisi-Obi*, Sayan Ghosh*, and
David Jurgens
In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics Jul 2019
Multilingual individuals code switch between languages as a part of a complex communication process. However, most computational studies have examined only one or a handful of contextual factors predictive of switching. Here, we examine Naija-English code switching in a rich contextual environment to understand the social and topical factors eliciting a switch. We introduce a new corpus of 330K articles and accompanying 389K comments labeled for code switching behavior. In modeling whether a comment will switch, we show that topic-driven variation, tribal affiliation, emotional valence, and audience design all play complementary roles in behavior.
@inproceedings{ndubuisi-obi-etal-2019-wetin,title={Wetin dey with these comments? Modeling Sociolinguistic Factors Affecting Code-switching Behavior in Nigerian Online Discussions},author={Ndubuisi-Obi*, Innocent and Ghosh*, Sayan and Jurgens, David},booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},abbr={ACL},month=jul,year={2019},address={Florence, Italy},publisher={Association for Computational Linguistics},url={https://aclanthology.org/P19-1625},doi={10.18653/v1/P19-1625},pages={6204--6214},pdf={https://aclanthology.org/P19-1625.pdf},}