InboxBench

Email categorization performance across 50 synthetic email threads

Models tested

8

Best accuracy

96%

Gemini 3.1 Flash Lite Preview

Avg accuracy

91%

across 8 models

Best value ≥ 90%

$0.0023

Deepseek V4 Flash

Accuracy by Model

Percentage of correctly classified threads, sorted by accuracy

Local (Ollama)
Cloud (OpenRouter)

Performance vs Cost

Accuracy vs total cost for 50 threads. Top-left = highest value.

Local (Ollama)
Cloud (OpenRouter)
Gemini 3.1 Flash Lite Preview96% · $0.0069
Deepseek V4 Flash96% · $0.0023
GPT 5.496% · $0.0656
Grok 4.2096% · $0.0524
Gemma4 26B92% · local
GPT 5.4 Nano90% · $0.0053
GPT OSS 20B86% · local
Nemotron 3 Nano 4B78% · local

Scores by Category

F1 score per email type (0–100). Green = strong, red = weak.

ModelNeeds ActionWaiting OnReviewFYIDoneNewsletterIgnoreSent
Gemini 3.1 Flash Lite Preview94100919110092100100
Deepseek V4 Flash94100919110092100100
GPT 5.4891009191100100100100
Grok 4.201001008383100100100100
Gemma4 26B948683911009210092
GPT 5.4 Nano94948091100808991
GPT OSS 20B801007791759257100
Nemotron 3 Nano 4B7086836791910100