Automated extraction of functional biomarkers of verbal and ambulatory ability from multi-institutional clinical notes using large language models

Table 5 BGR dataset GPT performance

Extraction Task	Ground Truth Labels	Experi-ment	GPT 3.5			GPT- 4 t			GPT- 4o
Extraction Task	Ground Truth Labels	Experi-ment	Avg Precision	Avg Recall	Weighted Avg F1	Avg Precision	Avg Recall	Weighted Avg F1	Avg Precision	Avg Recall	Weighted Avg F1
Verbal Ability	RNAP (CARS- 2 or Telehealth Screener)	MCP-All	.80	.89	.88	.88	.84	.91	.87	.88	.92
		BCP- All	.78	.83	.87	.84	.84	.91	.82	.84	.89
		MCP- 1.5Y	.71	.76	.82	.80	.79	.86	.82	.84	.89
Ambulatory Ability	RNAP (GMFCS or M-CHAT-R	MCP-All	.60	.69	.75	.89	.91	.95	.75	.87	.89
		BCP- All	.52	.54	.45	.67	.81	.81	.63	.76	.74
		MCP- 1.5Y	.63	.74	.78	.84	.86	.94	.74	.83	.88

The precision, recall, and weighted-average F1 scores are reported across 3 GPT versions (GPT- 3.5, GPT- 4 t, ad GPT- 4o) and for verbal and ambulatory extraction tasks. Additionally, results are provided for each experiment, including the multi-class prediction experiment (MCP-All), the binary class prediction experiment (BCP-All), and the multi-class prediction experiment on notes written within 1.5 years of patient enrollment in the BGR (MCP- 1.5Y). The top scoring GPT model for each experiment in the table is bolded, and the top scoring experiment and GPT model combination for each extraction task is underlined

ISSN: 1866-1955