Email categorization performance across 50 synthetic email threads
Models tested
8
Best accuracy
96%
Gemini 3.1 Flash Lite Preview
Avg accuracy
91%
across 8 models
Best value ≥ 90%
$0.0023
Deepseek V4 Flash
Percentage of correctly classified threads, sorted by accuracy
Accuracy vs total cost for 50 threads. Top-left = highest value.
F1 score per email type (0–100). Green = strong, red = weak.
| Model | Needs Action | Waiting On | Review | FYI | Done | Newsletter | Ignore | Sent |
|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Flash Lite Preview | 94 | 100 | 91 | 91 | 100 | 92 | 100 | 100 |
| Deepseek V4 Flash | 94 | 100 | 91 | 91 | 100 | 92 | 100 | 100 |
| GPT 5.4 | 89 | 100 | 91 | 91 | 100 | 100 | 100 | 100 |
| Grok 4.20 | 100 | 100 | 83 | 83 | 100 | 100 | 100 | 100 |
| Gemma4 26B | 94 | 86 | 83 | 91 | 100 | 92 | 100 | 92 |
| GPT 5.4 Nano | 94 | 94 | 80 | 91 | 100 | 80 | 89 | 91 |
| GPT OSS 20B | 80 | 100 | 77 | 91 | 75 | 92 | 57 | 100 |
| Nemotron 3 Nano 4B | 70 | 86 | 83 | 67 | 91 | 91 | 0 | 100 |