Skip to main content

Table 5 BGR dataset GPT performance

From: Automated extraction of functional biomarkers of verbal and ambulatory ability from multi-institutional clinical notes using large language models

Extraction Task

Ground Truth Labels

Experi-ment

GPT 3.5

GPT- 4 t

GPT- 4o

Avg Precision

Avg Recall

Weighted Avg F1

Avg Precision

Avg Recall

Weighted Avg F1

Avg Precision

Avg Recall

Weighted Avg F1

Verbal Ability

RNAP (CARS- 2 or Telehealth Screener)

MCP-All

.80

.89

.88

.88

.84

.91

.87

.88

.92

BCP- All

.78

.83

.87

.84

.84

.91

.82

.84

.89

MCP- 1.5Y

.71

.76

.82

.80

.79

.86

.82

.84

.89

Ambulatory Ability

RNAP (GMFCS or M-CHAT-R

MCP-All

.60

.69

.75

.89

.91

.95

.75

.87

.89

BCP- All

.52

.54

.45

.67

.81

.81

.63

.76

.74

MCP- 1.5Y

.63

.74

.78

.84

.86

.94

.74

.83

.88

  1. The precision, recall, and weighted-average F1 scores are reported across 3 GPT versions (GPT- 3.5, GPT- 4 t, ad GPT- 4o) and for verbal and ambulatory extraction tasks. Additionally, results are provided for each experiment, including the multi-class prediction experiment (MCP-All), the binary class prediction experiment (BCP-All), and the multi-class prediction experiment on notes written within 1.5 years of patient enrollment in the BGR (MCP- 1.5Y). The top scoring GPT model for each experiment in the table is bolded, and the top scoring experiment and GPT model combination for each extraction task is underlined