All posts
·7 min

Sabiazinho-4

We're introducing our new model focused on speed and low cost: Sabiazinho-4, with improvements in legal domain, long context, instruction following and agent capabilities.

We’re launching in preview Sabiazinho-4, the first model in the next generation of the Sabiá family, designed with a focus on cost and latency. The model represents a significant advance over Sabiazinho-3, especially in areas where the previous generation had limitations.

BenchmarkDescriptionMetricSabiazinho 4Sabiazinho 3gpt-oss
120b
GPT-4.1
mini
GPT-5
mini
CostCost to run benchmarks belowBRL spent on tokens via APIR$15.87R$9.42R$33.24R$47.59R$102.13
OAB BenchLegal writing (lawyer), 21 examsAvg. score (0-10)7.026.015.995.506.37
Magis BenchLegal writing (judge), 24 examsAvg. score (0-10)4.503.643.623.674.47
Brazilian LawsKnowledge of Brazilian legislationAccuracy (5 options)85.0%72.9%52.3%57.0%68.2%
Agentic
capabilities
Tool use across 4 Portuguese envsPass³ and success@155.2%14.1%60.9%59.4%85.1%
Brazilian exams13 exams (ENEM, USP, OAB, etc)Accuracy (4-5 options)81.5%77.9%77.0%81.0%84.6%
Multi-IF PortugueseInstruction followingStrict, avg of 3 turns81.4%72.2%82.0%79.6%85.8%
BRACEvalPortuguese conversational skillsWins vs GPT-4o66.5%36.2%55.8%32.7%56.3%

Table 1: performance and cost comparison across analyzed models.

Pre-training improvements

Identified key limitations in demanding scenarios for Sabiazinho-3 and improved pre-training:

  • Brazilian legal domain — laws, precedents, decisions, legal writing.
  • Long context — up to 128K tokens.
  • Brazilian knowledge — current events, institutions, national literature.
  • Agent capabilities — stable function calling and tool orchestration.

Post-training improvements

Also brought improvements to post-training, addressing previous-model gaps:

  • Instruction following — more consistent responses.
  • Function calling — model now invokes available functions correctly.
  • Web search — proper use of external tools when needed.

In the figures below, we illustrate behavior differences between Sabiazinho-3 and Sabiazinho-4 when answering a simple question that requires a web search. Sabiazinho-3 didn’t call web_search() correctly; Sabiazinho-4 makes the call as expected.

Old Sabiazinho-3 response

Figure 1: response from the old Sabiazinho-3 model. Note the model not calling the function it has access to: web_search().

New Sabiazinho-4 response

Figure 2: response from the new Sabiazinho-4 model. The new model makes the call correctly.

Total cost

When calculating real model cost, all factors matter: token price, latency, tokens per task, and benchmark-specific cost.

Benchmarksabiazinho-3sabiazinho-4gpt-oss-120bgpt-4.1-minigpt-5-mini
OAB BenchR$0.50R$0.81R$1.50R$1.20R$5.90
Magis BenchR$0.30R$0.46R$0.71R$0.89R$3.84
Brazilian LawsR$1.44R$1.97R$2.86R$2.74R$7.36
Agentic
capabilities
R$3.88R$7.70R$20.49R$33.75R$47.70
Brazilian examsR$0.62R$0.54R$1.62R$2.16R$8.51
Multi-IF PortugueseR$2.35R$3.71R$5.11R$6.18R$25.87
BRACEvalR$0.33R$0.68R$0.96R$0.67R$2.95
TotalR$9.42R$15.87R$33.24R$47.59R$102.13

Figure 12: costs in BRL to evaluate models on the published benchmarks.

Next steps

The launch of Sabiazinho-4 is an important step toward our next model generations. You can find more details on how to use the new model in our documentation: docs.maritaca.ai.