Data Engineering

Claude Code: Picking the Model, the Effort Level, and the Cache Discipline

· 18 min read

Most of the Claude Code threads I see in team chats are not about prompts. They are about cost and speed: which model to leave on, when to bump effort, why a fresh chat suddenly burned through the daily limit. The interesting thing is that the questions are usually the wrong ones. People reach for a stronger model when they should have rewritten the prompt, or they crank up effort on a task that did not need any thinking at all.

In This Article

  1. Pick one model per chat, change effort first
  2. The three models, in one paragraph each
  3. The right-sized starting point
  4. Effort levels are a separate knob
  5. Quality scales logarithmically, cost scales linearly
  6. Prompt caching is the one optimization that always pays
  7. Cost discipline is just the three rules applied at once
  8. A working pattern for the day
  9. Further reading

This piece is the short version of the answer I keep giving. It covers the three knobs that actually matter in day-to-day use: the model, the effort level, and prompt caching. The fourth thing, cost discipline, falls out of the first three once you understand them.

A 3x3 grid showing Haiku / Sonnet / Opus on the rows and short task / medium task / long agentic task on the columns, with the recommended starting point highlighted in each cell
A 3×3 grid showing Haiku / Sonnet / Opus on the rows and short task / medium task / long agentic task on the columns, with the recommended starting point highlighted in each cell

Pick one model per chat, change effort first

The first rule has nothing to do with the model lineup. It is about how you use a chat.

A chat in Claude Code carries a working context: your files, your CLAUDE.md, the project state, and the running conversation. Switching models mid-chat does not reset that context, but it does change how the model reads it. Opus and Sonnet weigh instructions differently, and Haiku is materially worse at following multi-step plans the other two were happily executing five turns earlier. The chat ends up sounding like it has been handed off between three teammates who all read the same brief and reached different conclusions.

Stick with one model per chat. If the model is not giving you what you want, the right next move is almost never to switch models. Raise effort. Or, more often, fix the prompt. The model swap is a last step, not a first one.

The three models, in one paragraph each

Haiku 4.5 is the fast model. It is built for bulk work where the answer is short and the constraints are tight: rename this variable, write a four-line bash one-liner, summarize this 200-line file in three sentences. Haiku does not have effort levels. There is nothing to tune. If Haiku gets the answer wrong, the fix is to either give it a smaller, sharper task or to hand the task to Sonnet.

Sonnet 4.6 is the model you should leave on by default. It handles the vast majority of real coding work: multi-file refactors, debugging through a codebase you mostly understand, writing SQL of medium complexity, reviewing a PR. The reason Sonnet is the default is not that it is “smart enough.” It is that on the kind of work where you need a thinking model at all, Sonnet hits the best ratio of quality to cost across independent benchmarks. Opus is better, but it is better in places where Sonnet is already adequate, and the price for that improvement is real.

Opus 4.7 is the heavy model. Anthropic describes it as the model for problems the previous versions could not finish. In practice it shows up in three places: long autonomous tasks where the agent has to plan, execute, and adapt across many tool calls; architectural decisions where the surface area is wide and the trade-offs are subtle; and stubborn debugging where Sonnet has been circling the same wrong answer for three turns. Opus 4.7 makes about a third as many tool-use errors as Opus 4.6, which sounds like a small win until you watch a long agent run finish on the first try instead of failing twice and restarting.

Opus reads instructions more literally than the other two. That is a feature in agent loops, where literal interpretation prevents drift, and a tax in casual chats, where an underspecified instruction will be answered exactly as written instead of with the obvious helpful interpretation. The fix is the same as for any literal reader: be more specific in the prompt, not louder.

The right-sized starting point

The short version:

  • Start every chat on Sonnet 4.6. It is the right answer 70 to 80 percent of the time.
  • If the answer is shallow, raise effort before you reach for Opus.
  • If the answer is still shallow at high effort, switch to Opus 4.7 and reset effort to its default for the new model.
  • If the task is mass routine (lint a folder, normalize 200 file names, regenerate boilerplate), drop to Haiku and accept the trade-off: faster, cheaper, dumber.

The mistake everyone makes once is leaving Opus on for a week because last Tuesday it solved a hard refactor. The chat after the refactor is asking Opus to rename a column. Opus does it. Opus also eats your hourly limit faster than the rest of the week did combined.

Effort levels are a separate knob

Models pick how the answer is reasoned. Effort picks how much reasoning the model is allowed to do before it answers. The two knobs interact, but they are not the same knob.

Claude Code exposes five effort levels, set with /effort or with the --effort flag at launch. Haiku ignores them. Sonnet and Opus respect them, though the actual reasoning budget at each level is different between the two models.

low is for one-line tasks with one right answer. “Rename this function,” “format this JSON,” “tell me what this regex matches.” There is nothing to think about. Low gets you the answer with the smallest possible latency and the smallest possible token spend.

medium is for small refactors, simple SQL, short explanations of unfamiliar code. The model has to glance at context but not actually plan. Most “explain this snippet” requests live here.

high is the working level for serious coding. Multi-file edits, real debugging, schema design, a hard SQL query, a non-trivial code review. On independent benchmarks, high is where Sonnet’s quality-to-cost ratio peaks, and it is where most of the productive thinking happens. If you are going to leave one effort level on for a normal workday, high is the right choice.

extra-high exists only on Opus 4.7. It is built for the kind of task that needs to hold a long plan in working memory: agentic loops with twenty tool calls, architectural reviews where the conclusion depends on five things upstream, deep root-cause analysis where the first three hypotheses were wrong. Extra-high is rare in chat work and routine in agent work.

max is almost always wrong. Anthropic itself recommends it only when you have measurably hit a ceiling at extra-high and need a last push. On structured tasks like JSON generation or parsing, max can make the output worse, because the model over-thinks and second-guesses correct intermediate steps. The “more thinking is more quality” intuition does not survive contact with the actual benchmarks past extra-high.

A vertical ladder showing low / medium / high / extra-high / max with the typical use case for each level, the cost multiplier vs medium, and a marker showing the practical cap (extra-high for most agent work, max almost never)
A vertical ladder showing low / medium / high / extra-high / max with the typical use case for each level, the cost multiplier vs medium, and a marker showing the practical cap (extra-high for most agent work, max almost never)

Quality scales logarithmically, cost scales linearly

The reason effort is not a free dial is the curve. Doubling the effort budget does not double the quality. It moves the answer a small amount along a diminishing-returns curve while doubling the tokens and roughly doubling the latency. On a real workday those latency hits are not theoretical. A max-effort response on a routine task takes the better part of a minute. Run that ten times in an hour and the rate-limit math gets ugly fast.

The other piece of the curve is the 80-percent rule, which I keep on a sticky note: when the model is not giving you what you want, 80 percent of the time the problem is not that it did not think enough. The problem is that it did not see the right file, did not have the schema in front of it, did not know about the existing helper, did not understand the convention you forgot to write down. The fix for those is to fix the prompt, not to raise effort.

Effort is the right answer when the prompt is good and the task is genuinely hard. It is the wrong answer when the prompt is bad and the task only looks hard because the model is improvising.

Prompt caching is the one optimization that always pays

The third knob is not really a knob. It is a feature that runs in the background, and the only thing you can do about it is to not break it.

Prompt caching is the mechanism that prevents you from paying full freight for the same context twice. When Claude Code sends a request, the system looks at the start of the conversation: your CLAUDE.md, the files you have read, the running history. If those bytes match the bytes from a recent request, the model skips the re-encode and pulls from cache instead. The savings are real: cached input is cheaper and faster than fresh input, often by a factor of ten on the cost line.

Cache entries live for one hour from the last interaction. The clock resets every time you send a new message. So a chat where you send something every twenty to forty minutes keeps the cache warm indefinitely. A chat where you walk away for an hour and a half loses the cache, and the next message pays the full re-encode price for the entire context.

This produces a counterintuitive habit pattern. The cheapest way to use a long chat is to use it steadily. The most expensive way is to bounce between five chats with long gaps each. Pick one chat for the work in front of you, keep it warm with a message every half hour or so, and the running cost on that chat stays small for hours.

A timeline showing two chats over four hours. The top chat sends a message every 25 minutes and stays cached the entire window. The bottom chat sends, gaps for 90 minutes, sends again, and pays the full re-encode on the second message. Annotated cost markers on each
A timeline showing two chats over four hours. The top chat sends a message every 25 minutes and stays cached the entire window. The bottom chat sends, gaps for 90 minutes, sends again, and pays the full re-encode on the second message. Annotated cost markers on each
A few things break the cache that look like they would not. Editing CLAUDE.md mid-session resets the cache the next message, because the file is part of the cached prefix. Manually re-reading a file you had already read sometimes pushes the conversation past the cache boundary if the second read returns different content. Switching projects mid-chat (and therefore loading a new CLAUDE.md) always reloads. None of these are reasons to avoid the action; they are reasons to avoid the action right before you ask Claude to write a 400-line script.

Cost discipline is just the three rules applied at once

Once you understand the three knobs, the cost question stops being a separate topic. It is a consequence.

  • The model is right-sized to the task. Sonnet on default, Opus when stuck, Haiku for bulk.
  • Effort is at high for normal work, raised only when the prompt is already good and the task is genuinely hard, dropped to medium or low for routine tasks.
  • The chat stays warm during work hours, and you do not bounce between five chats unless you have to.

A team that follows those three rules uses Claude Code at maybe a quarter of the rate-limit pressure of a team that does not. The work that gets done is the same. The team that follows the rules has more headroom on the limit, faster response times because effort is not maxed out by default, and a more legible chat history because the model is not switching mid-thread.

The team that does not follow the rules tends to do three things repeatedly. First, they leave Opus and max effort on after they survived one hard task. Second, they open new chats reflexively because “this one is for a different feature,” even though the codebase context is the same and the new chat will pay the full re-encode cost. Third, they react to a shallow answer by raising effort, then by switching to Opus, then by raising effort again, before they look at whether the prompt actually contains the information the model needs to answer.

A working pattern for the day

  • Default to Sonnet 4.6 at high effort. Set it in your settings, do not re-pick it every chat.
  • Open one chat per active piece of work. Keep it open. Send something every twenty to forty minutes to keep the cache.
  • When the answer is shallow, your first move is to re-read what you sent. The fix is in the prompt 80 percent of the time.
  • When the prompt is good and the task is still hard, raise effort. If extra-high (Opus only) does not get there, the problem is structural, not budget. Stop and rewrite.
  • For bulk work that is not really thinking, drop to Haiku and accept the trade-off. Haiku is a different tool, not a worse Sonnet.
  • For long autonomous agent runs, start the chat on Opus 4.7 at extra-high and leave it. The cost saved by switching mid-run is smaller than the cost of the run going off the rails.

That is the whole framework. The model, the effort, the cache, and a small amount of discipline about which knob to turn when the work gets harder. The rate-limit math takes care of itself.

Further reading


About the author

Nick Valiotti is the founder of Valiotti Data. 15+ years building analytics infrastructure for SaaS, marketplaces, and consumer subscription. 50+ production deployments across BigQuery, Snowflake, dbt, Metabase, and modern BI stacks. Author of two books on data strategy. LinkedIn · Discovery call.

Keep reading

Enjoyed this article?

Get weekly data strategy insights delivered to your inbox.

Get in Touch

Let's Discuss Your Project

Book a 30-minute discovery call. We'll assess your data maturity and recommend the right approach — no strings attached.

Book a Discovery Call →
Need help with your data strategy? Book a Discovery Call →