Prompt engineering and diagnostic accuracy of multimodal large language models in thyroid fine‑needle aspiration cytology

HOME | PDF |

Title	Prompt engineering and diagnostic accuracy of multimodal large language models in thyroid fine‑needle aspiration cytology
Authors	Bibhas Saha Dalal^1,*, Kaushik Mukhopadhyay², Dwaipayan Roy³, Souvik Bhattacharya¹, Indranil Chakrabarti¹& Santosh Kumar Mondal¹
Affiliation	¹Department of Pathology, All India Institute of Medical Sciences (AIIMS), Kalyani, West Bengal, India; ²Department of Pharmacology, All India Institute of Medical Sciences (AIIMS), Kalyani, West Bengal, India; ³Department of Computational and Data Sciences, Indian Institute of Science Education and Research (IISER), Kolkata, West Bengal, India; *Corresponding author
Email	Bibhas Saha Dalal - E-mail: bibhas.patho@aiimskalyani.edu.in Kaushik Mukhopadhyay - E-mail: kaushik.pharm@aiimskalyani.edu.in Dwaipayan Roy - E-mail: dwaipayan.roy@iiserkol.ac.in Souvik Bhattacharya - E-mail: souvik.patho_pgt23@aiimskalyani.edu.in Indranil Chakrabarti - E-mail: indranil.patho@aiimskalyani.edu.in Santosh Kumar Mondal - E-mail: santosh.path@aiimskalyani.edu.in
Article Type	Research Article
Date	Received June 1, 2025; Revised June 30, 2025; Accepted June 30, 2025, Published June 30, 2025
Abstract	Role of Large language models (LLMs) in fine-needle aspiration cytology (FNAC) image analysis remain uncertain. We evaluated two LLMs - Chat GPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic) on 63 thyroid FNAC cases, each represented by eight microscopic images (Pap and MGG, 10×/40×), using generic and structured prompts. Structured prompts improved Bethesda concordance and near-match rates but inter-rater agreement remained poor (κ ≤ 0.09). Specificity reached 100% with structured prompts, but sensitivity dropped to ≤11.8% and misclassification persisted. LLMs show potential, but domain-specific training and validation are necessary for clinical use.
Keywords	Fine-needle aspiration cytology, large language models, Thyroid nodule, Artificial intelligence, Prompt engineering, Diagnostic accuracy
Citation	Dalal et al. Bioinformation 21(6): 1317-1323 (2025)
Edited by	P Kangueane
ISSN	0973-2063
Publisher	Biomedical Informatics
License	This is an Open Access article which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. This is distributed under the terms of the Creative Commons Attribution License.