Helpful or Harmful? Benchmarking Large Language Models as Therapy Tools Across Empathy, Specificity, and Safety

Hannah Zhao; Yifan Zhang

doi:10.69987/JACS.2024.40708

Authors

Hannah Zhao Applied Analytics, Columbia University, NY, USA Author
Yifan Zhang Department of Counseling and Clinical Psychology, Teachers College, Columbia University Author

DOI:

https://doi.org/10.69987/JACS.2024.40708

Keywords:

large language models, mental health, empathy, response specificity, safety, suicide risk classification, counseling AI, retrieval benchmarking

Abstract

Recent LLM-based mental-health assistants promise around-the-clock support but combine two opposing properties: they can sound empathic at low cost, and they can deliver unsafe or generic advice when risk is high. This paper presents a fully reproducible benchmark that decomposes therapy-tool behavior into three measurable abilities: empathy, response specificity, and safety boundary enforcement. We conducted full experimental evaluations on EmpatheticDialogues, CounselChat, and a four-class mental-health text classification corpus. From the EmpatheticDialogues raw turns we derived 40,254 train, 5,738 validation, and 5,259 test listener-response examples; CounselChat provided 1,839/173/117 train/validation/test counselor answers; the safety corpus supplied 49,612 train and 992 balanced test posts. We evaluated four empathy retrievers, four counseling-answer retrievers, five risk classifiers, and three end-to-end therapy agents. Dialogue-aware TF-IDF retrieval achieved the best overlap on EmpatheticDialogues (BLEU-2 3.33, ROUGE-L 0.1022), improving ROUGE-L by 17.9% over an emotion-matched random baseline and by 5.2% over prompt-only retrieval. On CounselChat, global question retrieval achieved the best reference overlap (BLEU-2 11.69, ROUGE-L 0.1461), while topic-plus-upvote reranking produced the highest specificity score (4.92). For safety, a linear hinge-loss classifier reached 83.67% accuracy and 0.8329 macro-F1 on the four-class task. When used as a gate inside a hybrid therapy agent, this classifier raised suicidal-referral coverage from 0% to 95.56%, but it also increased false referrals on non-suicidal posts from 0.27% to 6.85%. These findings show that AI is helpful as a bounded first-line support tool and harmful when empathy, specificity, and crisis escalation are not jointly optimized.