Sabiá-4 — Maritaca AI

We’re launching in preview Sabiá-4, our next-generation model in the Sabiá family, designed with a focus on cost and performance for complex tasks. The models represent a significant advance over the previous generation, especially in areas where the older version had limitations.

Cost vs quality chart of evaluated models

Figure 1: cost (X axis) vs quality (Y axis) of evaluated models.

Pre-training improvements

We identified key limitations in more demanding scenarios for the previous generation and improved our pre-training to cover these areas:

Brazilian legal domain — laws, precedents, decisions and legal writing.
Long context — up to 128K tokens, ideal for analyzing court cases and contracts.
Brazilian knowledge — current events, institutions and national literature.
Agent capabilities — stable function calling and tool orchestration.

Post-training improvements

We also brought improvements to post-training, especially to address gaps in the previous generation:

Instruction following — more consistent, faithful responses to user requests.
Function calling — correct execution in complex environments.
Web search — appropriate use of external tools when available.

Benchmark evaluation

The models were evaluated on multiple benchmarks. The tables below show performance across price tiers.

Benchmark	Sabiá-4	Sabiá-3.1	GPT-4.1	GPT-5.2 no reasoning	GPT-5.2 reasoning	Gemini-3-Pro (reasoning low)	Gemini-3-Pro (reasoning high)	Kimi-k2- thinking	Qwen3- 235b- instruct-2507	Deepseek- v3.2
Cost	R$80.49	R$62.15	R$182.49	R$307.12	R$752.41	R$403.31	R$804.07	R$516.52	R$44.36	R$49.22
Brazilian Laws	97.4	77.8	80.8	84.0	86.3	74.9	88.6	59.1	65.9	67.3
OAB Bench	7.49	7.21	7.30	8.07	8.73	9.05	8.90	6.62	6.33	6.40
Magis Bench	5.08	4.97	5.55	6.66	6.99	7.79	7.48	4.49	4.52	4.88
Agentic capabilities	72.2	43.1	73.3	81.1	85.7	90.4	90.1	77.3	67.8	40.5
Brazilian exams	86.6	82.4	86.1	88.0	92.9	93.3	95.0	83.0	82.0	84.0
Multi-IF Portuguese	82.0	80.7	82.7	83.7	87.2	86.0	88.0	86.0	84.4	81.5
BRACEval	53.8	44.6	50.2	59.0	60.2	70.8	68.1	56.9	65.6	60.8

Table 1: Sabiá-4 — quality and cost comparison across frontier models.

Below we highlight some of the main benchmarks.

OAB-Bench

arxiv.org/abs/2504.21202

OAB-Bench is a benchmark that evaluates language models on complex legal-writing tasks, based on the second phase of Brazil’s Bar exam (OAB). It includes 105 questions from recent editions of the exam, distributed across seven areas of law, with the same complete evaluation guidelines used by human assessors.

Magis-Bench

Magis-Bench evaluates LLMs on highly complex legal tasks, focusing on Brazilian public-service exams for substitute judge positions. It is built from real, recent exam materials, including a written exam and two practical exams (civil and criminal sentences).

Ticket-Bench

arxiv.org/abs/2509.14477

Ticket-Bench evaluates models’ ability to operate a soccer-ticket purchase platform — searching matches, picking seats and completing the purchase.

Pix-Bench

Pix-Bench evaluates models on everyday financial tasks like paying a bill or sending a Pix transfer. Acting as a banking assistant, the model must interpret the user request and execute the correct action.

MARCA (MAritaca Research Checklist evAluation)

MARCA evaluates models on finding information via web navigation, focusing on questions that require breadth-first search across multiple sources. Each question comes with a checklist used to assess answer completeness and correctness.

CLIMB (CheckList-based Inference for Multihop with Browsing)

CLIMB tests models on multi-hop chained search until reaching a final answer — questions requiring browsing across linked pages.

Brazilian Laws

This benchmark evaluates models on Brazilian federal legislation (50,000+ acts including laws, decrees, provisional measures). Multiple-choice in two variations: identify the law a passage belongs to, or identify the exact reference.

Multi-IF

arxiv.org/abs/2410.15553

Multi-IF evaluates whether models can follow instructions that accumulate across a multi-turn conversation, with constraints added each turn.

BRACeval (Brazilian Chat Evaluation)

arxiv.org/abs/2403.09887

BRACEval evaluates chatbots on open-ended, complex multi-turn instructions emphasizing Brazilian context — 150 questions across 13 categories.

Total cost

When calculating the real cost of a language model, all factors matter: token price, latency, tokens per task, and benchmark-specific cost.

Benchmark	Sabiá-4	Sabiá-3.1	GPT-4.1	GPT-5.2 no reasoning	GPT-5.2 reasoning	Gemini-3-Pro (reasoning low)	Gemini-3-Pro (reasoning high)	Kimi-k2- thinking	Qwen3- 235b- instruct	Deepseek- v3.2
OAB Bench	R$3.06	R$2.17	R$6.26	R$9.02	R$22.03	R$12.11	R$26.15	R$12.68	R$1.43	R$0.75
Magis Bench	R$2.44	R$1.74	R$4.88	R$9.33	R$23.91	R$14.39	R$24.89	R$8.07	R$0.69	R$0.49
Brazilian Laws	R$7.18	R$7.89	R$13.73	R$11.97	R$37.38	R$31.68	R$86.33	R$33.17	R$1.99	R$2.19
Agentic capabilities	R$35.97	R$25.49	R$102.15	R$181.09	R$467.29	R$178.90	R$224.55	R$304.58	R$27.47	R$38.81
Brazilian exams	R$6.36	R$4.38	R$7.48	R$4.21	R$30.77	R$36.32	R$101.22	R$41.37	R$3.07	R$1.65
Multi-IF Portuguese	R$16.08	R$13.80	R$28.73	R$58.80	R$97.76	R$79.40	R$206.23	R$70.62	R$5.43	R$3.55
BRACEval	R$2.38	R$1.48	R$3.74	R$7.50	R$13.51	R$11.27	R$28.50	R$10.64	R$0.90	R$0.50
Total	R$80.49	R$62.15	R$182.49	R$307.12	R$752.41	R$403.31	R$804.07	R$516.52	R$44.36	R$49.22

Figure 11: Sabiá-4 — costs in BRL to evaluate models on the published benchmarks.

Next steps

The launch of generation 4 (Sabiazinho-4 and Sabiá-4) is an important step toward our next model generations. You can find more details on how to use the new model in our documentation: docs.maritaca.ai.