Scaling LLM training at Maritaca AI
How we build each new generation of the Sabiá family: a benchmark-centric development cycle, a scale pyramid that goes from hundreds of small experiments up to the final training run, and the architecture, data and infrastructure choices that sustain this pace.
by Rodrigo Nogueira
The talk above is a video summary of what is described in this post.
Training a language model is not a phase with a beginning, middle and end, but a continuous cycle. Every new generation of the Sabiá family is born from a set of decisions about what the model needs to learn to do better, decisions that come from three sources:
- Customer feedback: pain points identified from real-world use of the model.
- Parity with new models on the market: capabilities that show up in competitors and that we haven’t covered yet.
- Differentiation opportunities: capabilities that no one yet handles well and that can become a real edge for us.
From there, the cycle repeats in three big stages:
Figure 1: the development cycle of our models. Each turn refines the set of capabilities the model needs to deliver.
In the next sections we open each of these boxes.
Step 1: Selecting capabilities
The first box of the cycle is to decide, for the next generation, which capabilities we are going to attack. Here “capability” means the LLM’s ability to solve a class of tasks: writing a legal motion, summarizing a medical record, executing a search workflow, doing analysis of long documents, and so on.
The three sources listed above give us an initial candidate list. From there, the work is prioritization: coverage of the Brazilian market, return for the current customer base, technical feasibility, and competitive differentiation.
Step 2: Building and evaluating benchmarks
For every capability we decide to attack, we build a benchmark before any change to training. At Maritaca, all development is benchmark-centric: we don’t change the model if we can’t measure the capability we are trying to implement.
The reason is simple: unlike traditional software, in machine learning it is hard to know whether a change to the model actually improved its overall behavior. A localized improvement after a prompt tweak, a new tool, or a fine-tuning on a few samples can be genuine for the two or three examples you observed, but that alone doesn’t tell you whether the change broke other capabilities that were working, nor whether the gain holds over a larger set of cases. Without a benchmark broad enough to detect regressions and measure robustness, it is impossible to separate a real improvement from local noise.
Where benchmarks come from
The benchmarks we use come from three main sources. We import public benchmarks into our internal evaluation framework, manually collect exams and selection tests (ENEM, OAB, Revalida, college entrance exams, public-service exams), and we build proprietary benchmarks, often in partnership with domain experts (doctors, lawyers, engineers).
From multiple choice to rubrics
For a long time the standard format was multiple-choice benchmarks: given a question and some options, the model has to pick the correct one. The advantage is being able to measure the knowledge of the model without depending on instruction tuning, the umbrella term for the training stages that teach the model to follow instructions and produce responses in the expected format (in our pipeline, those stages are SFT and RL, described later). For multiple-choice benchmarks a few few-shot examples in the prompt are enough to nudge the model to emit the letter of the correct option, so it is possible to evaluate even models that have only gone through pre-training.
The problem is that getting the right answer about, say, Brazilian economics or geography shows the model knows the subject, but doesn’t guarantee it can write a good essay about the topics it answered correctly, for example.
This is where the second generation of benchmarks comes in, and it increasingly dominates our pipeline: evaluations whose expected output is long-form text: essays, legal opinions, medical reports, answers with tables and charts. For long outputs, pairwise comparisons via LLM-as-a-judge (“given response A and response B, which is better?”) run into serious limits: responses mix strengths and weaknesses across different segments, and even a judge with deep subject knowledge has trouble weighing all those aspects simultaneously to pick a single absolute winner.
That’s why we adopted rubrics. Instead of comparing whole responses, we decompose the evaluation into objective items, each one checking a specific aspect of the answer. For a legal motion, a rubric might include items like “correctly cited article X”, “respected the formal structure of an initial petition”, “supported the request with applicable jurisprudence”. The evaluation then works like this:
- The model generates the response for the input.
- An LLM-as-a-judge receives the response and the rubric, with its various items.
- Item by item, the judge scores whether the response satisfies that criterion.
- The final score is the sum (or the proportion) of satisfied items.
Figure 2: comparison between the two evaluation approaches for long-form text. On the left, pairwise evaluation forces the judge to pick an absolute winner between two whole responses, even when they have strengths and weaknesses mixed across them. On the right, rubric-based evaluation decomposes the judgment into objective items: each aspect is scored separately, regressions can be localized, and comparison across runs is anchored to the same scale. Rubric cost also scales linearly in the number of responses, while pairwise requires N(N−1)/2 comparisons.
Compared to multiple choice, the rubric approach has two drawbacks: it only works on models that have already gone through instruction tuning, and it tends to be expensive, because the judge is usually a strong model, or a combination of several to reduce variance.
Rubric benchmarks can also be extended with tools, both on the generator’s side and on the judge’s. For example, if the task is to draft a legal document that cites a statute, the generator can use a legislation lookup to ground the text, and the judge can use the same lookup to verify whether the citations are real.
Evaluating before training
Once the benchmark is defined, we evaluate the current version of our model and competitors of comparable size/price. In general, one of two things happens:
- The anecdotal suspicion turns into a systematic problem: the model really does fail on that class of tasks, and the capability needs to be added to one or more training stages.
- The model already handles the task reasonably well: in that case we invest in more data and benchmark variations to raise the ceiling and open competitive headroom.
Benchmarks we publish
Some of our benchmarks remain restricted for confidentiality, licensing or competitive reasons, but whenever it’s possible to publish, we open them up so the community can reproduce and compare results:
- HealthBench-BR and PDCT-QA, Brazilian clinical protocols: github.com/hugoabonizio/clinical-protocols-br
- OAB-Bench, Brazilian bar exam questions: dl.acm.org/doi/10.1145/3769126.3769227
- Magis-Bench, drafting of civil and criminal sentences: github.com/maritaca-ai/magis-bench
- PROSA, rubric-based evaluation of LLMs on real Brazilian-Portuguese user chats: arxiv.org/pdf/2605.01630
- CAPITU, Brazilian-literature comprehension: arxiv.org/abs/2603.22576
- MARCA, multilingual web search evaluated with checklists: arxiv.org/pdf/2604.14448
- Poeta v2, broad Brazilian-Portuguese evaluation: ieeexplore.ieee.org/abstract/document/11303664
- TIEBE, Brazilian events: arxiv.org/abs/2501.07482
- ENEM + vision, image-based questions: arxiv.org/abs/2311.14169
- ENEM, text-only questions: arxiv.org/abs/2303.17003
- BLUEX, college entrance exams: link.springer.com/chapter/10.1007/978-3-031-45368-7_22
Step 3: Data collection and training
With the missing capabilities defined and the benchmarks in place, we move to what consumes the most time and resources: producing the data and running training. This work splits into three stages: pre-training, SFT (supervised fine-tuning) and reinforcement learning. Each one has distinct goals and characteristics.
Figure 3: the three training stages, with their goals and data types.
Pre-training
An important note before going further: what we call pre-training in our pipeline is, in practice, continued pre-training. Instead of starting from random weights, we begin from an open-weights checkpoint (a model whose pre-training from scratch was already done by another group) and continue training on our own data. The reason isn’t just economic. Training from scratch at frontier scale costs tens to hundreds of millions of dollars and ties up a large cluster for months, which alone is reason enough not to do it when there are solid public checkpoints to start from. But there’s also the technical side: you need to secure a cluster with thousands of GPUs, curate a corpus of tens of trillions of tokens, and learn to tune the weights starting from random state, three things only a handful of groups in the world handle well today.
In this continued pre-training, the model refines the language structure it already has and broadens its world knowledge in the domains that matter most for the generation we are building. The scale of the additional data is typically tens to hundreds of billions of tokens, with high diversity of domains and sources.
Here data quality is heterogeneous by design. We keep a mix of high-, medium-, and even low-quality data, since even an incomplete or partially incorrect document can contain information that doesn’t appear anywhere else in the corpus. What matters is that the mix, in aggregate, covers each capability we want to instill.
To steer that mix, we use classifier models that estimate, for each document, how much it is likely to contribute to a specific capability. These classifiers are on the order of a few hundred million parameters when fine-tuned, or a couple of billion when run in few-shot mode. They are small enough to run over the entire pre-training corpus in reasonable time.
To ensure broad coverage of the capabilities we chose to attack, the web crawl is not generic: we use focused crawl, in which specialized classifiers, starting from a set of seed pages, decide which links to follow and which to ignore based on the probability of leading to pages relevant to the capability being developed.
Figure 4: focused crawl. A small classifier (BERT-like or a small LM, ~100M–1B parameters) selects, from the seed pool, only the pages relevant to the capability being developed. From the selected seeds, we expand the links and reapply the same classifier at each new set of pages: relevant ones (✓) keep expanding, irrelevant ones (✗) stop right there.
Rewrites: organizing information to make learning easier
A technique that has been gaining ground in our pre-training is rewriting documents before using them for training. The intuition is that the order in which information appears in the text has a direct effect on how much the model can learn from that document.
Language models are causal: they predict the next token left to right, with no ability to go back and correct a prediction. When a document presents a result up front (for example, “the answer is 40 meters”) and only later explains the methodology, the model, when trying to predict the result tokens, doesn’t yet have access to the reasoning. The learning signal in that segment is weak: it can only learn to “guess” the result from the question, without the path that led to it. A document that introduces the problem, then builds the reasoning at increasing levels of detail, and ends in the result gives the model, at every token, rich context for the next prediction. The learning signal is much stronger.
Figure 5: an original document where the answer comes before the reasoning gives the causal model little context to predict it; rewriting the document so the reasoning comes first produces a much stronger learning signal.
It’s worth noting the cost: rewriting requires running hundreds of billions of tokens through a generator model, and therefore consumes a sizable fraction of the compute allocated to pre-training. The upside is that this compute is spent once, and the rewritten documents are reused across the multiple pre-training runs we do during the development of each generation. Given the volume of documents to process, the rewriter model we use is relatively small: using a larger model would be prohibitively expensive at this scale. It is still an open question in the literature whether rewrites by stronger models, or multi-document aggregation strategies, would yield significant gains over what we already get with a smaller rewriter.
Published research on this line
Some of the papers we have published on our techniques for selecting and curating pre-training data, co-authored with master’s and PhD students affiliated with Maritaca:
- Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora, by Thales Sales Almeida, Rodrigo Nogueira, Helio Pedrini. arxiv.org/abs/2509.08824
- Juru: Legal Brazilian Large Language Model from Reputable Sources, by Roseval Malaquias Junior, Ramon Pires, Roseli A. F. Romero, Rodrigo Nogueira. link.springer.com/chapter/10.1007/978-3-032-15984-7_9
- Curió-Edu 7B: Examining Data Selection Impacts in LLM Continued Pretraining, by Thales Sales Almeida, Rodrigo Nogueira, Hélio Pedrini. arxiv.org/abs/2512.12770
- Comparing Knowledge Injection Methods for LLMs in a Low-Resource Regime, by Hugo Abonizio, Thales Almeida, Roberto Lotufo, Rodrigo Nogueira. arxiv.org/abs/2508.06178
- The Interplay Between Domain Specialization and Model Size, by Roseval Malaquias Junior, Ramon Pires, Thales Sales Almeida, Kenzo Sakiyama, Roseli A. F. Romero, Rodrigo Nogueira. arxiv.org/abs/2501.02068
- Measuring Cross-lingual Transfer in Bytes, by Leandro De Souza, Thales Almeida, Roberto Lotufo, Rodrigo Frassetto Nogueira (NAACL 2024). aclanthology.org/2024.naacl-long.418
- Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining, by Thales Sales Almeida, Rodrigo Nogueira, Hélio Pedrini. arxiv.org/abs/2603.24826
SFT (supervised fine-tuning)
The second stage is supervised fine-tuning. Here the logic flips: the data volume is much smaller, and the dominant criterion becomes the individual quality of each example.
For each capability we want to instill in the model (for example, “do a thorough analysis of multiple documents attached to a conversation”), we build a set of very high-quality responses. These responses come from two main sources:
- Ensembles of strong models: we generate multiple responses with different models, identify the segments where they agree, and keep only the parts considered robust.
- Specialized human annotation: for domains that require technical judgment (medicine, law, academic writing), we hire professionals who write or revise the responses.
Everything goes through automatic filters and manual review before entering the dataset. The new capability we want in the model is, by the end, concretely represented by this set of examples.
Reinforcement learning
The last stage is reinforcement-learning training, and it has an important advantage: you don’t need to spell out for the model what an ideal answer looks like. Instead, you define a reward function that scores each generated response, and the model learns, over the course of training, to produce responses with higher scores.
It’s worth highlighting one difference with what shows up most often in the LLM-RL literature: a large share of recent work focuses on domains like math and code, where the reward is programmatically verifiable. For code, you just run against a test suite; for math, you compare the final answer to the gold answer. The reward signal is exact, cheap and free of ambiguity, which simplifies training a lot. The tasks we invest in most at Maritaca, however, are precisely the ones that fall outside that regime: legal motions, financial recommendations, study plans, exam-result analyses, academic writing. There is no programmatic verifier that can say “this motion is correct” or “this medical analysis is complete”. So in most of our RL trainings, the reward function is exactly the same mechanism described in Step 2: an LLM-as-a-judge applying the task’s rubric, item by item.
Modern techniques like GRPO push this further: for each input, the model generates several responses and the reward of each one is compared to the group’s mean. The difference between a response’s reward and that mean is what we call advantage. The larger the advantage, the more the weights are updated to favor responses similar to that one.
Figure 6: GRPO in detail. For each input, the model samples several responses, and the LLM-as-judge scores each one using the task’s rubric. A response’s advantage is its score minus the group mean. Responses with positive advantage are reinforced, those with negative advantage are suppressed, and those near zero barely move the weights.
There is a subtle point about what each stage actually does in the model. Pre-training is what introduces broad knowledge, that is, defines the repertoire of outputs the model is capable of generating. SFT aligns that repertoire to the formats and response patterns expected. And RL acts mainly by reinforcing responses the model already manages to produce with some probability, shifting probability mass toward the best ones. Whether the model is capable of discovering genuinely new capabilities through sampling is still an open question in the literature, but in practice, waiting for discovery via exploration is a slow and unreliable process when pre-training and SFT haven’t covered the skill we want to develop. If the model samples ten responses for the same input and they’re all alike, none stands out in reward, the advantage is near zero, the weights barely move, and the algorithm moves on to the next example. That’s why RL is always applied after pre-training, and in most pipelines also after SFT, although there are notable exceptions, like DeepSeek-R1-Zero, which goes from pre-training straight to RL without SFT, and R1 itself, which alternates SFT and RL cycles. What these pipelines have in common is the requirement of a model already with enough capability to produce useful variation across samples.
Quantization for serving
Once training is done, the model still needs to be prepared for production. To serve users with acceptable latency and cost, it has to fit in inference hardware and run economically. For that reason, after training we quantize the weights: from the precision used in training, which can be BF16 or FP8 with quantization-aware training (QAT), we reduce to 8 or even 4 bits at serving time.
The rule is simple: the more aggressive the precision reduction, the more training work is needed to preserve quality. Reducing to 8 bits is still typically viable with calibration based on activation statistics, and the degradation tends to be small. Reducing to 4 bits is another tier: each weight has only 16 possible values, and that is usually not enough to represent the network’s weight distribution well without compromising important model capabilities. So at 4 bits, modern techniques combine calibration with quantization-aware training, where training itself simulates the quantization effect so the model learns weights that tolerate the precision loss. Quantization stops being a cheap post-training step and becomes more like another training stage. Concretely, in the 4-bit regime:
- The data used matters. Different pipelines combine QAT during training, post-training calibration with a smaller set, or both. There are also formats like NVIDIA’s NVFP4, where pre-training happens directly in 4 bits and skips a separate calibration phase. In all of these paths, the data (whether the training data or the calibration data) needs to reflect the model’s real-world usage distributions: languages, domains, prompt formats, typical lengths. Random text sampling is not enough.
- Precisions are mixed across different parts of the network. Layers more sensitive to numerical loss (attention, embeddings, normalizations) tend to stay in higher precision (8 bits or the original precision). More redundant parts, in particular the experts in MoE models (where routing already dilutes the impact of any individual expert), tolerate lower precision. The specific combinations vary by team and workflow, but the common principle is to allocate precision in proportion to each component’s sensitivity.
- Validation uses the full benchmark suite. Quantization can look harmless on aggregate metrics and still degrade specific capabilities like function calling, math reasoning, and long context. The only way to catch this is to run the entire evaluation, which loops back to Step 2 of the cycle.
Figure 10: mixed precision per transformer block component. Embeddings, normalizations and the MoE router stay at higher precision (BF16) because they’re the most sensitive to numerical error. Attention and the output head drop to 8 bits with calibration. The MoE experts, which hold most of the model’s parameters, go down to 4 bits — the intrinsic redundancy of expert routing dilutes the impact of the reduction, and quantizing just them already cuts most of the inference memory.
Architecture: dense vs Mixture of Experts
Another important choice at training time is the model architecture. In recent generations, the dominant frontier choice has been Mixture of Experts (MoE), a path we also follow at Maritaca.
Dense models (every parameter is activated for every token) still dominate in some areas, especially in pure reasoning tasks , where the total capacity of the model matters more than its per-token efficiency. Dense models also reach MFU (Model Flops Utilization) of 50–60% in training, an indicator of computational efficiency.
Sparse MoE models, on the other hand, are typically at MFU of 20–40% when only the active parameters are counted. In return, they win an important advantage in production: since only a fraction of the network is activated per token, you can have a much larger total parameter count (and therefore more stored knowledge) while paying an inference cost equivalent to that of a much smaller dense model. For the end user, this means large-model quality at the price of a small model.
But MoE comes with its own complications. The main one is expert load balancing: since each token only routes to a subset of experts (and the routing is decided by the network itself during inference), in practice some experts get overloaded while others sit idle. This reduces effective hardware utilization, even when the nominal fraction of activated parameters per token is small. In training, we mitigate this with auxiliary loss functions that penalize imbalance; in serving, with dynamic token dispatch across machines and batching that’s aware of expert distribution.
Another area where MoE finds a natural fit with modern infrastructure is prefill–decode disaggregation: a technique that splits the two phases of inference, which have radically different compute profiles, across different machines (or groups of GPUs):
- Prefill: processing the user’s prompt at once. It is compute-bound: the bottleneck is the available FLOPs, and the operation is highly parallelizable (all prompt tokens enter at the same time).
- Decode: generating one new token at a time, autoregressively. It is memory-bandwidth-bound: the bottleneck becomes how fast the hardware can read the model weights and the attention cache for each new token.
Instead of running both phases on the same machine (forcing a hardware compromise that serves neither well), they get split: machines with a “prefill” profile are optimized for FLOPs and large batches; “decode” machines are optimized for memory bandwidth and cross-batching of many concurrent users. In MoE, this separation is especially valuable because the all-to-all routing between experts has opposite dynamics in the two phases. In prefill, the all-to-all is throughput-oriented: the batch is homogeneous, expert load distribution is more predictable, and you can amortize latency with large batches and static scheduling. In decode, the all-to-all becomes latency-oriented: each user’s token must be dispatched and a result returned in a short time, and the distribution across experts is less balanced because there are fewer tokens to dilute routing fluctuations. Tuning communication, parallelism and batching for each of these regimes separately is much simpler than trying to accommodate both on the same node.
The practical consequence is simple: every frontier model that needs to be served at scale today goes with MoE. The difference in inference cost per unit of delivered quality is too large to ignore.
Infrastructure: TPUs and GPUs
Our training stack mixes different accelerators across different stages: we use mostly Google TPUs for pre-training and SFT, and GPUs for reinforcement learning (with mixed use in some trainings). Behind that choice is a combination of technical, commercial and operational factors that’s worth detailing.
On the purely technical side, TPUs are notoriously harder to operate. The active community is smaller and the frameworks tend to lag behind architectural novelties. One example is MaxText, the reference repository for TPU training, which still doesn’t adequately optimize several recent open-source models with hybrid attention mechanisms. GPUs, with their huge community and frameworks like Megatron-LM, NeMo and DeepSpeed under active development by hundreds of companies, usually get the stack updated within days or weeks of a new model launch. TPUs also tend to be proportionally more expensive in FLOPs per dollar than GPUs in “neoclouds” like Verda, Lambda and CoreWeave.
So why do we prefer TPUs in our flow? The answer is not in the final-version training taken in isolation: if we were running only the big training, a dedicated GPU contract would probably be the best option, and reserving a cluster for the months of training would suffice. The thing is, the final training is just the tip of the iceberg. Most of the effort (and budget, as we’ll show below) sits in hundreds of small-scale experiments and dozens of medium-scale ones run before it. It’s that scale pyramid that flips the equation in TPUs’ favor.
How we scale trainings: from 3B to 1T
Frontier-model trainings are not deterministic exercises. What works in a 7B model rarely works with the same hyperparameters in a 70B model, and what works at 70B doesn’t trivially scale to 1T. So our methodology is a pyramid:
Figure 7: the scale pyramid. Every new generation of Sabiá starts with hundreds of small-scale trainings, refined through ten or so at the medium scale, and ends with one or two runs at the largest model.
-
Small models (3 to 30B parameters): we do, on average, ~500 trainings per generation. This is where we test hypotheses: new data mixes, new rewrites, new SFT setups, new RL reward functions, new hyperparameter combinations. Each training is relatively cheap and can be completed in hours or a few days, which makes this volume of experimentation possible. It’s worth noting that, at this scale, a large share of the cost is not in the training itself but in evaluation: since most of our benchmarks ask for long-form generation, running the full battery takes time. On top of that, the LLM-as-judge cost adds up, whether served locally on GPU, via an external API, or via our own Sabiá models (which also consumes capacity that could be serving customers in production).
-
Medium models (30 to 300B parameters): once the ideas have been validated, we move to the medium scale and run about a dozen trainings. But why still a dozen? Wouldn’t a single training be enough, now that we know what works at smaller scale? For two reasons. First, several effects only emerge at this scale: the learning rate and batch size combination that worked at 3B–30B, even when scaled proportionally per literature best practices, still requires trial-and-error corrections. Second, the open-weights model we use as the base in the 3B–30B range isn’t always the same one in the 30B–300B range: sometimes there are small differences in architecture, tokenizer, or even a completely different model family. This adds uncertainty and requires more experiments to recalibrate the recipe to the new starting point.
-
Final model (~1T parameters): only after all of the above do we run one or two trainings at the size of the largest version. Even then, the big-model training is, in part, a shot in the dark: not every open-weights model is available at every scale. Qwen3.5, for example, ranges from 4B up to 397B, but there is no public 1T+ checkpoint, which leaves us without an external calibration point exactly in the size range we care about most. DeepSeek V4 illustrates the opposite: it was released in 298B and 1.6T parameter versions, but without smaller versions that allow fast iteration; starting directly from 298B to test a new recipe is too expensive to do 500 trainings. In both cases, when scaling to 1T, we lose much of the external reference we’d use to calibrate, and we depend almost entirely on what we learned at the smaller scales we can assemble internally from other open-weights models.
This pattern has a direct practical consequence for the infrastructure discussion: the vast majority of our trainings are at small scale. If the training stack doesn’t allow fast cycles of “spin up cluster, test, shut down, repeat”, those ~500 small trainings become the bottleneck of the entire project. That’s why provisioning flexibility matters much more to us than raw price per FLOP.
Even so, the final training at ~1T scale remains a very-high-cost event. It’s worth doing the math.
What it costs to train a 1T-parameter model on GPU
Consider the exercise: pre-train a MoE model with 1.6 trillion total parameters and ~49 billion active parameters per token, on 500 billion tokens with a sequence length of 8k. A scale equivalent, in order of magnitude, to recent open models like DeepSeek-V4-Pro (1.6T total / 49B active).
The rule of thumb for the number of floating-point operations (FLOPs) needed in pre-training is 6 × parameters × tokens. For a MoE, what counts is the number of active parameters per token (not the total), because that’s the fraction of the network participating in each forward/backward:
6 × 4.9 × 1010 × 5 × 1011 ≈ 1.47 × 1023 FLOPs
NVIDIA’s B200, in the SXM version, delivers around 2.25 PFLOPs/s in BF16 dense per GPU. In real training, what’s actually used (the Model FLOPs Utilization, MFU) is typically around 40% in well-optimized regimes. So each GPU produces about ~0.9 PFLOPs/s of useful work.
Dividing, we get something like ~45 thousand GPU-hours to finish training.
But compute isn’t the only constraint that defines the minimum cluster size. Memory also imposes a limit, and here what matters is the total number of parameters, not the active ones: during training, all 1.6T must remain accessible across the chip mesh, because all experts get updated. In mixed-precision training (BF16 + Adam optimizer in FP32), the per-parameter memory footprint is:
- BF16 weights: 2 bytes
- BF16 gradients: 2 bytes
- Adam optimizer (FP32 momentum + FP32 variance + FP32 master weights): 12 bytes
- Total: 16 bytes per parameter
For 1.6T parameters: 1.6 × 10¹² × 16 bytes ≈ 25.6 TB just for state, not counting activations, communication buffers and gradient accumulation. For activations, with activation checkpointing it’s enough to save the input to each transformer block (the rest is recomputed in the backward pass), which in BF16 is 2 bytes per token per layer, that is:
activations ≈ layers × tokens_in_batch × hidden × 2 bytes
For a model of this scale (~61 layers, hidden ~7k) with sequence length of 8k and a global batch of ~4M tokens (a typical range in frontier pre-training), activations add up to 61 × 4 × 10⁶ × 7168 × 2 ≈ 3.5 TB; with larger batches (8–15M tokens), 7 to 13 TB. Taking ~5–10 TB as a representative range, the state + activations total lands around ~30–35 TB. Each B200 has 192 GB of HBM, so roughly ~135 B200s are needed just for state to fit, or ~150–200 B200s when including activations at sequence length of 8k.
Combining the two requirements (compute for the desired deadline and memory for the model to fit), the effective minimum number of B200s for each schedule is:
| Deadline | Min. by compute | Min. by memory | Effective min. | Neocloud cost (USD 4.4/h) | Big cloud cost (USD 12/h) |
|---|---|---|---|---|---|
| 30 days | ~63 | ~150 | ~150 | ≈$0.2 M | ≈$0.55 M |
| 14 days | ~135 | ~150 | ~150 | ≈$0.2 M | ≈$0.55 M |
| 7 days | ~270 | ~150 | ~270 | ≈$0.2 M | ≈$0.55 M |
Table 1: GPU-hours and total cost to pre-train a 1.6T-total / 49B-active MoE on 500B tokens, considering ~45 thousand GPU-hours (BF16, 40% MFU on active parameters). For long deadlines, memory is the bottleneck: at 30 days, compute would ask for 63 GPUs, but memory forces 150, and the training ends up finishing in ~13 days instead of 30. Total cost is roughly the same across deadlines because it is a function of GPU-hours, not cluster size.
In daily cash-flow terms, a cluster of 150 B200s running 24/7 in a neocloud costs about ≈$16K per day. At a big provider like Oracle/Azure/GCP/AWS, the same cluster runs about ≈$43K per day.
The math above is the cost of a single training at the largest scale. As we saw in the pyramid, before this final training we run ~500 trainings on small models and ~10 on medium ones. When everything is added up, the cost of the final round shows up as a relatively small fraction of the generation budget:
| Item | # trainings | GPU-hours / training | Total GPU-hours | Neocloud cost (USD 4.4/h) | % of budget |
|---|---|---|---|---|---|
| Training of small models (3 to 30B) | ~500 | ~300 | ~150K | ≈$0.66 M | ~46% |
| Evaluation of small models (LLM-as-judge included) | ~500 | ~50 | ~25K | ≈$0.11 M | ~8% |
| Medium (30 to 300B), with evaluation | ~10 | ~8K | ~80K | ≈$0.36 M | ~25% |
| Larger version (~1T), with evaluation | 1–2 | ~45K | ~67K | ≈$0.30 M | ~21% |
| Generation total | ~322K | ≈$1.4 M | 100% |
Table 2: aggregate budget estimate for an entire generation. Small-scale evaluations show up as a separate row because they run ~500 cycles with mostly long-form generation benchmarks and LLM-as-judge; they don’t dominate the budget, but they’re substantial enough to be visible. At medium and large scale, evaluation has a proportionally much smaller cost (few trainings) and is folded into each row. The largest training, despite being the most visible and stressful event (and depending on a larger short-term cluster), accounts for only ~21% of the budget. Most of it goes to the small-scale experimentation cycle: training + evaluation of small models add up to ~54% of the total cost. That is why provisioning flexibility and the per-hour cost of small jobs weigh much more for us than the chip peak of the final training.
Large-scale availability
The numbers above already show the first problem: at big cloud providers, a single pre-training run cost easily reaches $0.55M, almost three times the neocloud. And when it comes to models with more tokens (DeepSeek-V4-Pro was trained on 32 trillion, ~64 times more than our example), the numbers grow proportionally. But the problem is not only price. It is availability.
In neoclouds, renting a few hundred GPUs for a month means asking the provider to dedicate a significant fraction of cluster compute to a single customer for a short period. That pushes other customers away: when the base notices that there are no GPUs available for days in a row, trust in the service drops. These providers naturally prefer long contracts (1+ year) with multiple medium-sized customers in parallel. Short, large rentals don’t fit their business model.
Until recently (and as far as we know), no GPU cloud allowed renting a few hundred chips by the hour on demand. Today some options exist, but actual availability is irregular: you may dial in when you need it and find that, say, only 50 chips are free in the cluster at that moment.
The learning curve of training on 100+ machines
Now suppose you got the cluster, rented for 1 or 2 months. The clock starts ticking. And then comes the second problem: GPU training frameworks aren’t optimized for this scale.
Most of the community using Megatron, NeMo and similar tools operates at much smaller scales: tens of chips, not hundreds. Many problems only show up when you go from 10 to 100 machines. Here are some examples we’ve lived through:
- Automatic downloads: installing Docker or pip packages in parallel on 100 nodes makes the download service see the operation as a DDoS attack and start refusing connections.
- Coordination servers: the node coordinating the workers handles 10 connections without trouble; with 100 simultaneous connections, it becomes a bottleneck and training stalls waiting for everyone to sync.
- Monitoring: each worker sending metrics to a telemetry service? With 100 workers, that service may rate-limit or refuse.
- Rare bugs that become routine: a hardware error with 0.1% chance per GPU/day is negligible at 10 GPUs, but starting from a few hundred, it turns into a failure every few days with high probability.
Another step that consumes a lot of cluster time (and therefore a lot of compute) is tuning the partitioning across the parallelism axes. There are four main axes: FSDP (sharding of weights and optimizer state across chips), tensor parallelism (TP, splitting each layer’s matrices across chips), context parallelism (CP, splitting the long sequence across chips), and, in MoE models, expert parallelism (EP, distributing experts across chips). Each combination of model size, sequence length, and number of available chips requires a different configuration: the optimal TP at 64 GPUs is usually suboptimal at 128, long contexts force introducing CP, which has to be tuned per model size, and MoE adds the experts axis. The difference between a well-tuned configuration and a mediocre one can reach 15 days versus 30 days of training, and the cost of the experiment practically doubles when this tuning is not done well.
Each of these problems turns into direct time pressure, because while the engineers debug, the rental hours keep counting, and the cluster costs tens of thousands of dollars per day sitting idle.
The commercial advantage of TPUs
Here the TPUs’ advantage shows up: the billing and provisioning model is completely different. On GCP, TPUs are billed by the second of use (with a one-minute minimum), and it is possible to spin up (and shut down) large clusters on demand.
The workflow this enables is distinct: you bring up the cluster with the configuration you want to test, two or three engineers work side by side fine-tuning the optimization, and if nothing works and the ideas run out, you just shut down the cluster, discuss internally and come back the next day with a new approach.
On GPU, this flow is possible but significantly harder at this scale. Typical contracts span months, with reserved capacity paid up front: the contract clock keeps ticking, and each failed run is time taken away from the trainings that actually matter. On TPU, the cost of a frustrated attempt is proportional to the time effectively used: if something goes wrong and the engineers go to sleep, you just shut down the cluster and payment simply stops while no one is using it.
On top of that, there’s an infrastructure robustness that’s hard to match on GPU at these sizes:
- Consistent interconnect: TPU pods are designed from the factory for the distributed parallelism these trainings require, with predictable bandwidth and latency between all chips. The best neoclouds have reached comparable levels, and offering well-interconnected on-demand clusters is becoming industry standard. But there’s still variability across providers, and there’s a real risk of landing in a cluster whose interconnect doesn’t meet the load.
- Network and CPU sized for the load: storage, host CPU and networking around the accelerator are designed for the large-scale training regime, without the bottlenecks that show up in less specialized providers.
| Aspect | GPU (B200, neocloud) | TPU (v5/v6) |
|---|---|---|
| FLOPs per dollar | Better | Worse (proportionally) |
| Community / frameworks | Megatron, NeMo, DeepSpeed (active) | MaxText (small, lagging) |
| On-demand large-scale rental | Hard; irregular capacity | Easy; per-second billing |
| Cost of “fail and restart” | High (long contract, reserved capacity) | Low (spin up, test, shut down) |
| Decoding optimization | Excellent | Good |
| Pre-sized interconnect | Varies by provider | Guaranteed in pod |
Table 3: GPU vs TPU for large-scale training. The choice is not universal: it depends on the training stage and the usage profile.
Preemption and checkpoints
Training an LLM at large scale isn’t only about how many chips you have. It is also about how interruptible your pipeline is. On GCP, preemptible TPUs cost half of on-demand ones, and at Maritaca all our pre-trainings run on preemptible TPU by default. The catch is obvious: at any moment GCP can reclaim the machines and hand them to another customer.
For this to be viable without losing days of progress on each interruption, the training framework needs to do reliable resume from the checkpoint. When the TPU dies, the next run has to pick up exactly the state saved at the last checkpoint and continue from there. Sounds obvious, but it includes much more than the weights: the optimizer state, the dataloader RNG (so it doesn’t revisit or skip batches), the position on the learning-rate schedule, and the exact location in the dataset shuffle.
That leaves an engineering decision: how often to save checkpoints? It is a direct trade-off. Saving too rarely makes each preemption cost a lot of lost training, since the run restarts from the same checkpoint it had departed from. Saving too frequently runs into a less obvious bottleneck.
The training frameworks we use are efficient at saving: they off-load the weights and optimizer state from TPU memory to host CPU RAM in seconds, and the TPU goes back to training while the CPU does the rest of the work in the background, transferring the bytes from RAM to remote distributed storage. The TPU → CPU step is fast, given enormous local bandwidth. The CPU → storage step is slow, network-bound. If you trigger a new checkpoint before the previous one has finished its transfer, the new checkpoint’s weights start piling up in RAM on top of the old ones, and on large models that blows up the CPU memory quickly. The practical result is that the minimum frequency between saves is determined by the upload time to storage, not by the off-load time.
Figure 8: the checkpoint save flow, with the width of each band proportional to the actual bandwidth between each pair of components. Weights leave the TPU for the host CPU’s RAM via PCIe (~16 GB/s per TPU on v5p), and from there go to remote distributed storage via the data-center network, whose raw bandwidth is around ~6 GB/s per TPU. In practice the bottleneck tends to be even larger, because the effective upload throughput to a storage bucket depends on the storage backend and rarely reaches the network ceiling. Either way, the second step dominates the save time. For more details about TPUs and these bandwidths, we recommend the Scaling Deep Learning book by the JAX team.
We calibrate the frequency so that the time spent saving (effectively, the time blocked by the CPU → storage transfer when it doesn’t finish before the next save), plus the expected cost of resuming from the last checkpoint when preempted, is minimized. This optimal point depends on the MTBF (mean time between failures) of the TPUs in that region and time window, and on the network bandwidth to storage. It usually lands at one checkpoint every 30–60 minutes.
Why not use LoRA to relieve the memory problem?
A natural reaction to these numbers is: why not use LoRA (or other parameter-efficient fine-tuning techniques) to reduce the memory requirement, and therefore the cluster size?
For two reasons.
First: LoRA doesn’t work well in large trainings. In LoRA only a fraction of the weights is updated (a low-rank adaptation matrix replaces the original weight update). This works well when training data is small, say less than ~1 billion tokens. But Maritaca’s pre-trainings operate at the scale of hundreds of billions of tokens, and SFTs at the scale of tens of billions. In this regime, both our internal experience and the literature show that models trained with LoRA absorb less information per token than models trained with full fine-tuning, and that translates into clearly inferior final quality, visible in benchmark results.
Second: LoRA doesn’t save as much compute as intuition suggests. As Thinking Machines shows with the detailed FLOP math, LoRA uses about 2/3 of the operations per step that full fine-tuning uses, that is, ~30% faster per step, not orders of magnitude.
The combination of these two factors is what kills the idea: even if the lower per-token absorption weren’t a problem (it is), the ~30% speedup doesn’t justify shrinking the cluster proportionally to the memory savings. In LoRA, the optimizer state vanishes for the base model weights (they are frozen), but the weights themselves still need to sit in memory during the forward pass. Memory drops from ~25 TB to ~3 TB, and the per-memory minimum chip count goes from ~150 B200s to ~20 B200s. But with only 20 GPUs, the training that would take ~13 days in full fine-tuning configuration would take ~10 weeks even with the LoRA speedup. That’s too slow for a competitive iteration cycle.
The final TPU vs GPU split
For all of the above, at Maritaca, pre-training and SFT are done on TPU. For these stages, what matters is sustained throughput on long trainings, exactly where TPUs shine.
For reinforcement learning, we still use both accelerators depending on the case. GPUs have real advantages here:
- Token generation is where GPUs are most optimized. RL needs to sample many responses from the model at each step (in GRPO, dozens of responses per input), and that inference phase dominates the total training time. Pre-training and SFT don’t have this profile: they only do forward+backward.
- Generation scales without high interconnect. A 1T-parameter model fits comfortably in an 8-B200 machine for inference, and these generator machines can be replicated horizontally without InfiniBand interconnect between them.
- LoRA can be used in RL without notable degradation compared to full fine-tuning. The learning signal per RL example is small (a scalar advantage per sample, not the information density of a pre-training document), and the LoRA reduction in absorbed capacity stops being a limiting factor in this regime. The practical gain is in memory: since only the adaptation matrices have gradient and optimizer state, 1T-parameter models can fit in a single 8-B300 machine for the update step, eliminating the need for cross-machine parallelism in the updater. See Thinking Machines’ analysis for more discussion.
The practical conclusion is a per-stage split. We use TPU for the stages of massive information absorption (pre-training and SFT), and we use GPU, or a combination of the two, for the intensive-generation stage, which is RL.
Hybrid cluster: GPUs shared between development and production
Pre-training and SFT happen on TPU. But RL, pre-training rewrites, batch evaluations and, most importantly, serving the models to customers in production all run on GPU. The natural question is: do you rent separate GPUs for inference and for development?
Our strategy is different. We keep long-duration GPU contracts shared between training/development and production/inference. In time windows where customer use is naturally lower (overnight, weekends), the idle GPUs run rewrites, RL and evaluations. When traffic ramps up, those development processes are preempted: the GPUs go back to serving in production. It’s the same idea as preemptible TPUs, but applied internally: development always yields to the paying customer.
Figure 9: illustrative load curve over 24 hours. Production occupies the business-day peak; at night and on weekends, development (RL, rewrites, evaluations) takes almost the entire cluster. Contracted capacity stays constant; what shifts is the split between the two uses.
The win from this structure is direct: most of the time, GPUs that would be idle are producing training data or running RL. And the same machine becomes inference under load without needing extra provisioning. The cost: the development pipeline has to tolerate exactly the same requirements as preemptible TPUs, that is, frequent checkpoints and reliable resume.
Closing the cycle
Once the model is trained, we go back to Step 1. We reopen the list of capabilities we want to attack in the next generation. Part of it comes from the benchmarks we just ran, identifying where the model still falls short of the state of the art. Another part comes from new functionalities and task classes we decide to tackle: some because the competition just covered them, others because no one yet handles them well and we want to open a differentiation. These two sets together feed the next iteration: more data, new rewrites, an additional SFT, a new reward function. And the cycle starts over.
It’s this discipline of measuring before training and iterating until we surpass the frontier on the benchmark that grounds our philosophy: each new version of Sabiá is the result of hundreds of small turns of this cycle, expanding the set of tasks the model handles well.
But iterating so much over the same benchmarks inevitably produces a natural overfit to the measurement target. The gains we observe may reflect, in part, the model’s specialization to the very items we use to evaluate it, rather than a generalization to new tasks within the same class. Our strategy to fight this is to constantly create new benchmarks, prioritizing diversity within each capability we are evaluating, so that each generation is also put to the test by evaluations that haven’t yet influenced the recipe. It is not a problem you solve once: it requires continuous effort to revise and critique the benchmarks we use, and the willingness to drop (or reformulate) those that stop measuring what matters.
Technical reports for each Sabiá generation
For more details on data, training choices and benchmark results across each generation of the Sabiá family, see the technical reports:
- Sabiá: Portuguese Large Language Models (2023): arxiv.org/abs/2304.07880
- Sabiá-2: A New Generation of Portuguese Large Language Models (2024): arxiv.org/abs/2403.09887
- Sabiá-3 Technical Report (2024): arxiv.org/abs/2410.12049
- Sabiá-4 Technical Report (2026): arxiv.org/abs/2603.10213