Skip to main content

Table 6 CP dataset GPT performance

From: Automated extraction of functional biomarkers of verbal and ambulatory ability from multi-institutional clinical notes using large language models

Extraction Task

Ground Truth Labels

Experi-ment

GPT 3.5

GPT- 4 t

GPT- 4o

Avg Precision

Avg Recall

Weighted Avg F1

Avg Precision

Avg Recall

Weighted Avg F1

Avg Precision

Avg Recall

Weighted Avg F1

Verbal Ability

VSS

MCP- 1.5Y

.69

.63

.51

.72

.71

.66

.74

.73

.68

CFCS

MCP- 1.5Y

.69

.63

.52

.68

.67

.63

.72

.71

.66

Ambulatory Ability

GMFCS

MCP- 1.5Y

.73

.75

.69

.86

0.9

.88

.88

.91

.90

  1. The precision, recall, and weighted-average F1 scores are reported across 3 GPT versions (GPT- 3.5, GPT- 4 t, ad GPT- 4o) and for verbal and ambulatory extraction tasks. All extractions were performed using multi-class prediction on notes written within 1.5 years of patient enrollment in the registry. For the verbal ability extraction, the extraction results utilizing ground truth labels from both the VSS and CFCS assessments were evaluated. The top scoring GPT model ground truth label set is bolded, and the top scoring label and GPT model combination for each extraction task is underlined