Weather Report

Weather Report

What models we're running today, how they're configured, and what role each one plays in the factory.

Subscribe via RSS

Why do we publish The Weather Report? The Weather Report started out as a casual internal summary of how each provider and model was performing on our most important use cases. We update it frequently and have found it essential to our process.

As of February 23rd, 2026

No specific changes in defaults, but please note for anyone evaluating Gemini 3.1, the gemini-3.1-pro-preview-customtools may significantly outperform gemini-3.1-pro-preview depending on your harness. We've switched to gpt-realtime-1.5 for our internal use cases but aren't officially defaulting to it yet. Very happy with Sonnet 4.6, it may overtake Opus for some of our everyday use cases.

UseModels (by preference)ParametersNotes
CS/Math Hard Problems
Feb 6
gpt-5.3-codexdefault
Image comprehension
Feb 6
gemini-3-flash-previewdefault
Frontend Aesthetics
Feb 6
opus-4.6default
Frontend Architecture
Feb 6
gpt-5.3-codexdefault
Architectural Critique
Feb 6
gpt-5.2extra high
Sprint Planning
Feb 13
consensus(opus-4.6, gpt-5.2)high / extra high
Devops Tasks
Feb 6
opus-4.6default
QA Orchestration
Feb 6
opus-4.6default
Security review
Feb 6
gpt-5.3-codexhigh
Bulk classification
Feb 6
AnydefaultGo up cost and strength as needed
Bulk MapReduce
Feb 6
AnydefaultGo up cost and strength as needed
UX Ideation
Feb 13
gemini-3-pro-image-previewdefaultNano Banana Pro
Agentic dialogues
Feb 13
gemini-3-flash-previewdefaultGeneral message handling loops with user interaction and limited tool calling
Voice (interactive)
Feb 23
gpt-realtime-1.5defaultInternal use; not yet an official default
Consensus operator refers to an LLM merge of the points from independent plans.

Log

February 23rd, 2026

No specific changes in defaults, but please note for anyone evaluating Gemini 3.1, the gemini-3.1-pro-preview-customtools may significantly outperform gemini-3.1-pro-preview depending on your harness. We've switched to gpt-realtime-1.5 for our internal use cases but aren't officially defaulting to it yet. Very happy with Sonnet 4.6, it may overtake Opus for some of our everyday use cases.

February 13th, 2026

Happy with gpt-5.3-codex-spark. gpt-5.3-codex continues to be our preferred default implementation model with critiques and suggestions from Opus. Modified: Sprint Planning. Added: UX Ideation, Agentic dialogues, Voice (interactive).

February 6th, 2026

New models this week. We're very happy with gpt-5.3-codex. No problems with Opus 4.6 so far.