InboxBench

Email categorization performance across 50 synthetic email threads

Models tested

Best accuracy

96%

Gemini 3.1 Flash Lite Preview

Avg accuracy

91%

across 8 models

Best value ≥ 90%

$0.0023

Deepseek V4 Flash

Accuracy by Model

Percentage of correctly classified threads, sorted by accuracy

Local (Ollama)

Cloud (OpenRouter)

Accuracy vs total cost for 50 threads. Top-left = highest value.

Local (Ollama)

Cloud (OpenRouter)

Gemini 3.1 Flash Lite Preview96% · $0.0069

Deepseek V4 Flash96% · $0.0023

GPT 5.496% · $0.0656

Grok 4.2096% · $0.0524

Gemma4 26B92% · local

GPT 5.4 Nano90% · $0.0053

GPT OSS 20B86% · local

Nemotron 3 Nano 4B78% · local

F1 score per email type (0–100). Green = strong, red = weak.

Model	Needs Action	Waiting On	Review	FYI	Done	Newsletter	Ignore	Sent
Gemini 3.1 Flash Lite Preview	94	100	91	91	100	92	100	100
Deepseek V4 Flash	94	100	91	91	100	92	100	100
GPT 5.4	89	100	91	91	100	100	100	100
Grok 4.20	100	100	83	83	100	100	100	100
Gemma4 26B	94	86	83	91	100	92	100	92
GPT 5.4 Nano	94	94	80	91	100	80	89	91
GPT OSS 20B	80	100	77	91	75	92	57	100
Nemotron 3 Nano 4B	70	86	83	67	91	91	0	100