We work as engineers at Bitboundaire, and lately we've been using Claude Code more and more in our daily workflow. It's fast, it understands context, and most of the time it produces code that actually works.

But "most of the time" isn't good enough when you're shipping to production.

So we decided to find out: How consistent is Claude Code, really? If we give it the exact same task 20 times in a row, will it produce the same result? What happens when our codebase has messy, misleading code lying around? Can we trust it enough to skip manual review?

We ran 100 executions across five different scenarios. Here's what we learned.

Why We Decided to Test This

Here's the thing about AI code generation that nobody talks about enough: it's not just about whether the tool can write good code—it's about whether it does so consistently.

Think about it. If Claude Code writes a perfect endpoint 90% of the time but introduces a subtle bug the other 10%, you've got a problem. Especially if you can't predict when that 10% will hit. You'd spend as much time reviewing AI code as you would writing it yourself.

We wanted real numbers. Not vibes, not "it feels reliable," but actual data we could use to decide how much to trust this tool in our workflow.

The Setup: A Real Codebase

The Tool

For this evaluation, we used Claude Code—Anthropic's official CLI tool for AI-assisted development. It runs directly in your terminal and has full access to your codebase, allowing it to read files, understand context, and make edits.

  • Claude Code version: 1.0.17 (February 2026)
  • Model: Claude Opus 4.5 (claude-opus-4-5-20251101)
  • Context: Full codebase access with automatic file reading

Claude Opus 4.5 is Anthropic's most capable model as of this writing, optimized for complex reasoning and code generation tasks. We chose this model specifically because we wanted to test the ceiling of what's possible—if the best model struggles with certain scenarios, lighter models likely will too.

The Codebase

We didn't want to test this on a trivial "hello world" app. Real codebases are messy. They have legacy code, inconsistent patterns, and that one file someone wrote at 2 AM that nobody wants to touch.

The test codebase:

  • FastAPI application
  • MVC architecture
  • SQL database with SQLAlchemy
  • Full test suite (unit + integration)
  • Clean, well-structured starting point

The task:

  • Create a new API endpoint
  • Modify an existing endpoint's behavior
  • Follow specifications from a markdown reference file

The rules:

  • Same exact prompt every time: requested to create a new endpoint to fetch database details and apply conditional flags and formatting
  • Reset the codebase between each run
  • No manual intervention
  • Code must compile on first try
  • All tests must pass without fixes

If either condition failed, that execution was marked as a failure. No partial credit.

The Five Scenarios We Tested

We wanted to understand not just baseline performance, but how Claude Code behaves as conditions get worse. So we designed five scenarios that progressively introduce complexity—first varying the guidance level, then adding bad code to the mix.

Scenario

What's Different

1. Clean state – No guidance

No reference file, codebase is clean

2. Clean state – With guidance

Reference file provided, codebase is clean

3. Bad code – No guidance

No reference file, bad code present, figure it out

4. Bad code – With guidance

Reference file provided, but bad code exists in the codebase

5. Bad code – Strict guidance

Reference file + ignore instruction + bad code present

The "bad code" we injected was designed to be maximally confusing:

  • Functions with nonsensical logic
  • Wrong model names and invalid database relationships
  • Same naming conventions and file structure as legitimate code
  • Wrong README file description
  • Disabled tests that would wield wrong values

Basically, we were trying to trick Claude Code into copying the wrong patterns.

The Results: What Actually Happened

Here's the raw data from 100 total executions (20 per scenario):

Bad Code?

Guidance

Compiles

Tests Pass

Final Success Rate

No

No guidance

75%

80%

60%

No

With guidance

80%

90%

72%

Yes

No guidance

35%

40%

14%

Yes

With guidance

50%

65%

32.5%

Yes

Strict guidance

60%

70%

42%

What We Learned

Clean Codebase: Guidance Helps, But Less Than You'd Think

The difference between Scenario 1 and Scenario 2 was smaller than we expected.

Without guidance (Scenario 1), Claude Code performed reasonably—75% compile, 80% tests passing, resulting in a final success rate of just 60%. With a clean codebase and consistent patterns to learn from, it could usually infer what we wanted, but showed some inconsistency.

With guidance (Scenario 2), compile success improved to 80%, and test pass rate jumped to 90%, pushing the final success rate to 72%. The reference file provided meaningful improvement in both compilation and logical correctness.

This tells us something important: even in a clean codebase, guidance makes a real difference. The final success rate jumped from 60% to 72%—a 12 percentage point improvement that shows explicit documentation helps Claude Code understand not just how to write code, but what the code should actually do.

Takeaway: Even in a well-maintained codebase, guidance provides meaningful improvement. Don't skip the reference documentation.

The Bad Code Problem: Where Guidance Actually Matters

Scenario 3 was rough. Without a reference file, and with misleading code in the codebase, Claude Code's success rate dropped to 35% compile, 40% tests passing—a final success rate of just 14%. That means only about 1 in 7 runs produced fully working code.

What happened? It is pattern-matched against the bad code. If there was a file called user_service.py with broken logic, and we asked for a new order_service.py, Claude Code would sometimes copy the broken patterns.

Compare that to Scenario 4: same bad code, but with a reference file provided. Success improved to 50% compile, 65% tests passing, bringing the final success rate to 32.5%. Better, but still means roughly two-thirds of runs failed. The guidance helped, but bad code still exerted significant influence.

This makes sense when you think about how these models work. They use context to infer conventions. Even with explicit guidance, if there's conflicting information in the codebase, some of that noise leaks through.

Takeaway: Codebase hygiene matters more with AI tools, not less. Bad code doesn't just sit there being bad—it actively teaches the AI to write more bad code. Guidance helps, but it can't fully compensate for a messy codebase.

Strict Guidance Helps

Instead of leaving it implicit to Claude Code that it should ignore files outside the guidance, explicitly telling what it should not look at makes it more precise.

The results: 60% compile, 70% tests passing, for a final success rate of 42%. That's better than Scenario 4's 32.5%, but still means more than half of runs failed.

The strict ignore directive helped Claude Code filter out the misleading patterns. Yes, it occasionally missed some legitimate helper modules, which explains why tests didn't hit 100%. But the net effect was positive—fewer bad patterns copied, more consistent output.


The key was being specific about what to ignore. Broad directives like "ignore everything except X" caused problems in early experiments. Targeted directives like "ignore files in /legacy/ and /deprecated/" worked much better.

Takeaway: When you have known bad code that you can't immediately fix, targeted ignore directives can help. But be specific—broad exclusions cause more problems than they solve.

Where Things Went Wrong: A Failure Breakdown

Across all 27 failing executions, here's what actually broke:

Failure Type

How Often

Wrong model or reference

38%

Missing imports

28%

Logic errors

24%

Syntax errors

10%

The most common failure was using the wrong model or database reference. This usually happened when bad code used similar names to legitimate code. Claude Code would grab the wrong one.

Missing imports were the second biggest issue—usually a side effect of the strict ignore mode cutting off access to necessary modules.

Actual syntax errors were rare (10%). When Claude Code fails, it usually fails at the logic level, not the typing level.

How to Get Reliable Results: What Actually Works

Based on these 100 runs, here's what we now do differently:

1. Keep Your Codebase Clean First

This is the biggest lever. In a clean codebase, Claude Code hit a 60% final success rate without guidance, and 72% with guidance. In a messy codebase, even strict guidance only got us to a 42% final success rate.

Bad code in your repo isn't just technical debt—it's training data for your AI tools. Every inconsistent pattern, every hack, every "temporary" fix becomes something Claude Code might copy.

2. Provide Reference Documentation for Complex Tasks

For any non-trivial task, we write a short markdown file describing:

  • What we want built
  • What patterns to follow
  • Any constraints or edge cases

In clean codebases, this moved the final success rate from 60% to 72%. In messy codebases, it moved the final success rate from 14% to 32.5%.

3. Use Targeted Ignore Directives for Known Problem Areas

If you have legacy code or known bad patterns that you can't immediately fix, targeted ignore directives help. The key word is targeted—specify exact directories or files, not broad exclusions.

This moved the final success rate from 32.5% to 42% in our bad code scenarios. Not transformative, but meaningful.

4. Validate Incrementally

Don't ask Claude Code to build an entire feature in one shot. Break it into small pieces. Validate each piece. This keeps the blast radius small when something does go wrong.

When to Trust Claude Code (And When to Double-Check)

Based on our testing, here's a simple framework:

High confidence (~72% final success rate):

  • Clean codebase with consistent patterns
  • Task aligns with existing architecture
  • Small, focused changes
  • Guidance provided for best results

Medium confidence (~32-42% final success rate):

  • Some bad code present, but guidance provided
  • Targeted ignore directives for known problem areas
  • Review output before committing

Lower confidence (~14% final success rate):

  • Legacy codebase with inconsistent patterns
  • No reference documentation
  • Bad code present with similar naming to legitimate code
  • Always review and test manually

The Bottom Line

Claude Code performs best when working with a clean codebase and clear guidance. With both, we saw a final success rate of 72%. Without guidance, even a clean codebase dropped to 60%.

Codebase quality matters more than prompt engineering. In a messy codebase:

  • No guidance: 14% final success rate
  • With guidance: 32.5% final success rate
  • With strict guidance: 42% final success rate

Guidance helps, but it can't fully compensate for bad code. The best you can do with a messy codebase and perfect prompts is a 42% final success rate—that's still 58% failures requiring manual intervention.

The key insight is this: Claude Code's reliability is directly proportional to the quality of your codebase. Clean code + guidance = 72% success. Messy code + perfect guidance = 42% success. That's a 30 percentage point gap that no amount of prompt engineering can close.

This isn't a limitation—it's actually how it should work. The tool learns from whatever context it has. Give it clean patterns, and you get clean output. Give it chaos, and you get chaos back.

For us at bitboundaire, that means Claude Code has earned a permanent spot in our workflow. But we've also changed how we work: stricter codebase hygiene first, documentation second, and always breaking big tasks into small, verifiable pieces.

The AI didn't make us lazier. It made us more disciplined about code quality. And honestly? Our codebase is better for it.

Test Details

Tool:                    Claude Code v1.0.17
Model:                   Claude Opus 4.5 (claude-opus-4-5-20251101)
Model knowledge cutoff:  May 2025
Test date:               February 2026
Executions per scenario: 20
Total executions:        100
Codebase reset:          Full git reset between runs
Prompt variation:        None (identical prompt)
Human intervention:      None
Validation:              Automated (pytest + import checks)
Codebase:  
Framework:               FastAPI  
ORM:                     SQLAlchemy  
Architecture:            MVC  
Test suite:              pytest (unit + integration)


⁠Note
: Results may vary with different Claude models (Sonnet, Haiku) or future versions. Lighter models may show lower success rates, while future model improvements could increase reliability across all scenarios.