Sabiá 4 Thinking
The reasoning model in the Sabiá family: frontier-level quality in Portuguese at the lowest cost in its class, with major gains over Sabiá 4 in tool use, legal tasks and response quality.
Sabiá 4 Thinking is the reasoning model in the Sabiá family. It reaches frontier-level quality in Portuguese and Brazilian contexts at the lowest cost among the evaluated models. And it improves substantially over Sabiá 4 — mainly in tool use, legal tasks and response quality.
On Sabiá 4 Thinking, running the full benchmark suite costs less than half of GPT-5.4 and about a third of Opus 4.8.
Benchmark evaluation
We evaluated Sabiá 4 Thinking against the leading frontier models — Gemini 3.1 Pro, GPT-5.4 and Opus 4.8 — across three areas: function calling and agents, legal, and general tasks. On the overall average it sits about two points behind the top (90.8% vs 92.4%–92.8%), and in the legal domain it leads.
| Category | Sabiá 4 Thinking | Gemini 3.1 Pro (medium) | GPT-5.4 (medium) | Opus 4.8 (medium) |
|---|---|---|---|---|
| Total cost to run · R$ | R$206 | R$281 | R$449 | R$590 |
| Function calling / Agents · Pix, Ticket, MARCA | 94% | 94.9% | 97.1% | 95.1% |
| Legal · OAB (judge), drafting, extraction | 86.7% | 86.1% | 86.7% | 86.4% |
| General · BLUEX, ENAMED, POSCOMP, PoETa v2, Sotaques | 91.3% | 94.6% | 93.8% | 94.7% |
| Overall average | 90.8% | 92.4% | 92.8% | 92.5% |
The table below breaks each category into the benchmarks that compose it. In bold, the best in each row.
Simple averages per category (accuracy % per benchmark). Costs in R$ — GPT, Opus and Gemini values converted at R$5.14/US$ (rate of June 19).
What improved over Sabiá 4
Sabiá 4 Thinking thinks before answering, and that extra step pays off mainly in three areas.
Agents is where the gain is largest. The model chains tool calls more consistently: it picks the right tool, keeps track of what it has already done, and makes fewer mistakes on multi-step tasks — like operating a system, searching the web, or completing a purchase.
In legal tasks, it judges pieces and questions with more rigor and drafts better-reasoned documents.
And in conversation, its answers tend to be more complete and better organized.
In the example below, Sabiá 4 Thinking is given a workstation quote, looks up each part’s real power draw on the web, and spots that the power supply is undersized:
Can you review this quote?
Thought and searched the web
techpowerup.comnvidia.comamd.com| Component | Power draw (peak) |
|---|---|
| 2× NVIDIA RTX 5090 | 2 × 575 W = 1,150 W |
| AMD Ryzen 9 9950X | ~230 W |
| RAM, motherboard, SSDs and fans | ~135 W |
| Estimated peak total | ≈ 1,515 W |
| Quoted power supply | 1,200 W |
The 1,200 W power supply is undersized. The two RTX 5090s alone draw 1,150 W — and NVIDIA recommends 1,000 W for a single card. With the CPU and the rest, the peak reaches ≈ 1,515 W. You’d want a power supply of at least 1,600 W — ideally 2,000 W for headroom.
Illustrative example. Sabiá 4 Thinking reads the PDF, looks up each part’s official power draw and adds it up — catching a sizing problem that can go unnoticed.
About the benchmarks
We evaluated Sabiá 4 Thinking on eleven benchmarks, grouped into three areas: function calling and agents, legal, and general tasks. Here’s what each one measures.
Pix-Bench
Internal · PT · single-turn · pass@1
Pix-Bench evaluates how well models help with everyday banking tasks, like paying a bill or sending a Pix to someone. Acting as the assistant of a banking platform, the model has to interpret the user’s request, figure out the right contact or account on its own, and call the functions needed to carry out the action. Identification or parameter errors make the task fail, so the benchmark measures end-to-end accuracy, not just intent.
Ticket-Bench
Ticket-Bench evaluates how well models operate a ticket-buying platform. The environment provides information about the user and tools to search events, choose seats and complete the purchase; the model has to chain these calls correctly across a multi-turn conversation. It’s a multilingual benchmark, and here we report the Portuguese run, measuring the success rate on completing the purchase (pass@1).
MARCA (MAritaca Research Checklist evAluation)
MARCA evaluates how well models find information by browsing the web, focusing on questions that require breadth-first search — that is, collecting and synthesizing information from multiple sources to produce a report listing several entities. Each question comes with a checklist, used by a judge model (GPT-4.1) to measure the completeness and correctness of the answer.
OAB (judge)
Internal
Here the model acts as a judge: it scores pieces and questions from the second phase of the OAB (Brazilian Bar) exam, and the metric is the agreement between the model’s score and the human examiner’s. It measures how well the model masters legal grading criteria — not whether it writes well itself, but whether it judges as an experienced examiner would.
Legal drafting
Internal
Here the model drafts legal documents: initial petitions, defenses and rulings. The quality of each piece is judged by an LLM (GPT-5.4) comparing the answer to a reference, evaluating structure, reasoning and fit to what was requested. This is the benchmark where Sabiá 4 Thinking leads among the evaluated models.
Case extraction
Internal
Measures the ability to extract structured fields from real court cases — parties, claims, amounts, dates and other relevant metadata. Evaluation is done by rubric, comparing the extracted fields to a reference annotation. It’s a core task for automating the triage and organization of legal archives.
BLUEX
BLUEX gathers questions from the USP and UNICAMP university entrance exams, covering all areas of Brazilian high school. They are multiple-choice questions, with the model’s answer checked by a judge model (Sabiazinho 4). By using recent Brazilian exams, it’s a test of general knowledge anchored in our curriculum.
ENAMED
PROPOR 2026 paper · huggingface.co/datasets/recogna-nlp/enamed-2025
ENAMED is based on the National Medical Training Assessment Exam (INEP), given to graduating medical students. They are multiple-choice questions scored by exact match, testing medical knowledge in Portuguese — a domain where errors are costly and precision matters.
POSCOMP
POSCOMP is the National Exam for Admission to Graduate Programs in Computing, organized by the Brazilian Computer Society (SBC). It covers computing fundamentals, mathematics and technology, in multiple-choice questions scored by exact match. It requires technical and quantitative reasoning, not just memorization.
PoETa v2
PoETa v2 is a broad Portuguese evaluation suite, with 44 tasks — 12 native to Portuguese and 32 translated from English — covering classification, reading comprehension and reasoning. By bringing together such diverse tasks, it measures the model’s robustness in aggregate; the score is reported on the NPM metric, with ten runs per task.
Sotaques Digitais
Sotaques Digitais (Digital Accents) evaluates the understanding of everyday Portuguese: irony, idioms and regionalisms, in the kind of text that circulates on social media and WhatsApp. There are 90 open-generation scenarios, with the answer rated by a judge model (GPT-5.4) on a 1-to-5 scale. It’s the benchmark closest to how Brazilians actually write.
Cost
We compare the cost of running the whole suite, not the per-token list price — because the number of reasoning tokens generated per task also matters. Even so, running the full suite on Sabiá 4 Thinking costs less than half of GPT-5.4 and about a third of Opus 4.8. Frontier-level quality in Portuguese doesn’t have to come with frontier-level cost.
Availability
Sabiá 4 Thinking is already available via API. The documentation is at docs.maritaca.ai, and you can chat with the Sabiá family models at chat.maritaca.ai.