Automated extraction of functional biomarkers of verbal and ambulatory ability from multi-institutional clinical notes using large language models

Table 6 CP dataset GPT performance

Extraction Task	Ground Truth Labels	Experi-ment	GPT 3.5			GPT- 4 t			GPT- 4o
Extraction Task	Ground Truth Labels	Experi-ment	Avg Precision	Avg Recall	Weighted Avg F1	Avg Precision	Avg Recall	Weighted Avg F1	Avg Precision	Avg Recall	Weighted Avg F1
Verbal Ability	VSS	MCP- 1.5Y	.69	.63	.51	.72	.71	.66	.74	.73	.68
Verbal Ability	CFCS	MCP- 1.5Y	.69	.63	.52	.68	.67	.63	.72	.71	.66
Ambulatory Ability	GMFCS	MCP- 1.5Y	.73	.75	.69	.86	0.9	.88	.88	.91	.90

The precision, recall, and weighted-average F1 scores are reported across 3 GPT versions (GPT- 3.5, GPT- 4 t, ad GPT- 4o) and for verbal and ambulatory extraction tasks. All extractions were performed using multi-class prediction on notes written within 1.5 years of patient enrollment in the registry. For the verbal ability extraction, the extraction results utilizing ground truth labels from both the VSS and CFCS assessments were evaluated. The top scoring GPT model ground truth label set is bolded, and the top scoring label and GPT model combination for each extraction task is underlined

ISSN: 1866-1955