All posts
·6 min

Sabiá 4 Thinking

The reasoning model in the Sabiá family: frontier-level quality in Portuguese at the lowest cost in its class, with major gains over Sabiá 4 in tool use, legal tasks and response quality.

Sabiá 4 Thinking is the reasoning model in the Sabiá family. It reaches frontier-level quality in Portuguese and Brazilian contexts at the lowest cost among the evaluated models. And it improves substantially over Sabiá 4 — mainly in tool use, legal tasks and response quality.

On Sabiá 4 Thinking, running the full benchmark suite costs less than half of GPT-5.4 and about a third of Opus 4.8.

Benchmark evaluation

We evaluated Sabiá 4 Thinking against the leading frontier models — Gemini 3.1 Pro, GPT-5.4 and Opus 4.8 — across three areas: function calling and agents, legal, and general tasks. On the overall average it sits about two points behind the top (90.8% vs 92.4%–92.8%), and in the legal domain it leads.

CategorySabiá 4
Thinking
Gemini 3.1
Pro (medium)
GPT-5.4
(medium)
Opus 4.8
(medium)
Total cost to run · R$R$206R$281R$449R$590
Function calling / Agents · Pix, Ticket, MARCA94%94.9%97.1%95.1%
Legal · OAB (judge), drafting, extraction86.7%86.1%86.7%86.4%
General · BLUEX, ENAMED, POSCOMP, PoETa v2, Sotaques91.3%94.6%93.8%94.7%
Overall average90.8%92.4%92.8%92.5%

The table below breaks each category into the benchmarks that compose it. In bold, the best in each row.

BenchmarkSabiá 4
Thinking
Gemini 3.1
Pro (medium)
GPT-5.4
(medium)
Opus 4.8
(medium)
Function calling / Agents
Pix-Bench · internal100%100%100%97%
Ticket-Bench · public98%100%98%96.7%
MARCA · public83.9%84.8%93.2%91.5%
Legal
OAB (judge) · internal90.1%91.1%91.6%90.1%
Legal drafting · internal77.7%75.9%72.8%74.8%
Case extraction · internal92.3%91.4%95.7%94.3%
General
BLUEX · public93%96.8%95.7%95.4%
ENAMED · public94.4%98.9%97.8%97.8%
POSCOMP · public90.8%94.6%94.6%96.2%
PoETa v2 · public83.7%85%83.3%86.3%
Sotaques Digitais · public94.6%97.6%97.8%97.8%

Simple averages per category (accuracy % per benchmark). Costs in R$ — GPT, Opus and Gemini values converted at R$5.14/US$ (rate of June 19).

What improved over Sabiá 4

Sabiá 4 Thinking thinks before answering, and that extra step pays off mainly in three areas.

Agents is where the gain is largest. The model chains tool calls more consistently: it picks the right tool, keeps track of what it has already done, and makes fewer mistakes on multi-step tasks — like operating a system, searching the web, or completing a purchase.

In legal tasks, it judges pieces and questions with more rigor and drafts better-reasoned documents.

And in conversation, its answers tend to be more complete and better organized.

In the example below, Sabiá 4 Thinking is given a workstation quote, looks up each part’s real power draw on the web, and spots that the power supply is undersized:

Question · quote in PDF
PDFworkstation-quote.pdf

Can you review this quote?

Sabiá 4 Thinking

Thought and searched the web

techpowerup.comnvidia.comamd.com
ComponentPower draw (peak)
2× NVIDIA RTX 50902 × 575 W = 1,150 W
AMD Ryzen 9 9950X~230 W
RAM, motherboard, SSDs and fans~135 W
Estimated peak total≈ 1,515 W
Quoted power supply1,200 W

The 1,200 W power supply is undersized. The two RTX 5090s alone draw 1,150 W — and NVIDIA recommends 1,000 W for a single card. With the CPU and the rest, the peak reaches ≈ 1,515 W. You’d want a power supply of at least 1,600 W — ideally 2,000 W for headroom.

Illustrative example. Sabiá 4 Thinking reads the PDF, looks up each part’s official power draw and adds it up — catching a sizing problem that can go unnoticed.

About the benchmarks

We evaluated Sabiá 4 Thinking on eleven benchmarks, grouped into three areas: function calling and agents, legal, and general tasks. Here’s what each one measures.

Pix-Bench

Internal · PT · single-turn · pass@1

Pix-Bench evaluates how well models help with everyday banking tasks, like paying a bill or sending a Pix to someone. Acting as the assistant of a banking platform, the model has to interpret the user’s request, figure out the right contact or account on its own, and call the functions needed to carry out the action. Identification or parameter errors make the task fail, so the benchmark measures end-to-end accuracy, not just intent.

Ticket-Bench

arxiv.org/abs/2509.14477

Ticket-Bench evaluates how well models operate a ticket-buying platform. The environment provides information about the user and tools to search events, choose seats and complete the purchase; the model has to chain these calls correctly across a multi-turn conversation. It’s a multilingual benchmark, and here we report the Portuguese run, measuring the success rate on completing the purchase (pass@1).

MARCA (MAritaca Research Checklist evAluation)

github.com/maritaca-ai/MARCA

MARCA evaluates how well models find information by browsing the web, focusing on questions that require breadth-first search — that is, collecting and synthesizing information from multiple sources to produce a report listing several entities. Each question comes with a checklist, used by a judge model (GPT-4.1) to measure the completeness and correctness of the answer.

OAB (judge)

Internal

Here the model acts as a judge: it scores pieces and questions from the second phase of the OAB (Brazilian Bar) exam, and the metric is the agreement between the model’s score and the human examiner’s. It measures how well the model masters legal grading criteria — not whether it writes well itself, but whether it judges as an experienced examiner would.

Internal

Here the model drafts legal documents: initial petitions, defenses and rulings. The quality of each piece is judged by an LLM (GPT-5.4) comparing the answer to a reference, evaluating structure, reasoning and fit to what was requested. This is the benchmark where Sabiá 4 Thinking leads among the evaluated models.

Case extraction

Internal

Measures the ability to extract structured fields from real court cases — parties, claims, amounts, dates and other relevant metadata. Evaluation is done by rubric, comparing the extracted fields to a reference annotation. It’s a core task for automating the triage and organization of legal archives.

BLUEX

arxiv.org/abs/2307.05410

BLUEX gathers questions from the USP and UNICAMP university entrance exams, covering all areas of Brazilian high school. They are multiple-choice questions, with the model’s answer checked by a judge model (Sabiazinho 4). By using recent Brazilian exams, it’s a test of general knowledge anchored in our curriculum.

ENAMED

PROPOR 2026 paper · huggingface.co/datasets/recogna-nlp/enamed-2025

ENAMED is based on the National Medical Training Assessment Exam (INEP), given to graduating medical students. They are multiple-choice questions scored by exact match, testing medical knowledge in Portuguese — a domain where errors are costly and precision matters.

POSCOMP

arxiv.org/abs/2511.17808

POSCOMP is the National Exam for Admission to Graduate Programs in Computing, organized by the Brazilian Computer Society (SBC). It covers computing fundamentals, mathematics and technology, in multiple-choice questions scored by exact match. It requires technical and quantitative reasoning, not just memorization.

PoETa v2

arxiv.org/abs/2511.17808

PoETa v2 is a broad Portuguese evaluation suite, with 44 tasks — 12 native to Portuguese and 32 translated from English — covering classification, reading comprehension and reasoning. By bringing together such diverse tasks, it measures the model’s robustness in aggregate; the score is reported on the NPM metric, with ten runs per task.

Sotaques Digitais

ramondomingos.com.br

Sotaques Digitais (Digital Accents) evaluates the understanding of everyday Portuguese: irony, idioms and regionalisms, in the kind of text that circulates on social media and WhatsApp. There are 90 open-generation scenarios, with the answer rated by a judge model (GPT-5.4) on a 1-to-5 scale. It’s the benchmark closest to how Brazilians actually write.

Cost

We compare the cost of running the whole suite, not the per-token list price — because the number of reasoning tokens generated per task also matters. Even so, running the full suite on Sabiá 4 Thinking costs less than half of GPT-5.4 and about a third of Opus 4.8. Frontier-level quality in Portuguese doesn’t have to come with frontier-level cost.

Availability

Sabiá 4 Thinking is already available via API. The documentation is at docs.maritaca.ai, and you can chat with the Sabiá family models at chat.maritaca.ai.