How to Reduce AI Hallucinations in Code Generation
Every developer who uses AI coding tools has experienced it: the code looks perfect, the logic seems sound, and then you try to run it. The import fails because the function doesn’t exist in that library. Or the API call uses parameters from a different version. Or the algorithm handles nine out of ten cases correctly but silently produces wrong results for the tenth.
These are AI hallucinations in code, and they’re one of the most persistent problems with AI-assisted development. Unlike hallucinations in text (where you might notice a factual error), code hallucinations can hide inside otherwise correct code and only surface in production.
We’ve tracked AI hallucination patterns across thousands of code generation sessions and developed strategies that significantly reduce the rate and impact of these errors.
What AI Code Hallucinations Look Like
Invented APIs
The most common hallucination: the AI generates code that calls a function or method that doesn’t exist in the library. The name sounds plausible — it might be a function that should exist, or that existed in a previous version — but it’s not real.
# Hallucination: this method doesn't exist
import pandas as pd
df = pd.read_csv("data.csv")
result = df.smart_merge(other_df, strategy="fuzzy") # invented method
This happens because AI models learn from many versions of documentation and code. They blend features from different versions and sometimes generate composites that never existed in any version.
Wrong Library Versions
Related to invented APIs: the code is correct for a specific version of a library, but not the version you’re using. This is especially common with fast-moving libraries like React, Next.js, and TensorFlow, where APIs change between major versions.
// Correct for Next.js 12, wrong for Next.js 14+
export async function getServerSideProps(context) {
// This pattern was replaced by the App Router
}
Plausible But Wrong Logic
The most dangerous hallucination: code that runs without errors but produces incorrect results for certain inputs. The logic looks reasonable, and it passes basic testing, but it’s mathematically or logically wrong.
# Looks correct, but handles negative numbers wrong
def calculate_percentage_change(old_value, new_value):
return ((new_value - old_value) / old_value) * 100
# Fails when old_value is 0 (division by zero)
# Gives wrong sign when old_value is negative
Nonexistent Configuration Options
AI tools sometimes generate configuration files with options that don’t exist, combining real options with invented ones.
# Hallucinated config options mixed with real ones
server:
port: 3000
max_connections: 100 # real
auto_optimize: true # doesn't exist
smart_cache_mode: "adaptive" # doesn't exist
Fake Package Names
Occasionally, AI suggests installing packages that don’t exist. This is a security concern — attackers have registered package names that AI commonly hallucinates, embedding malware in them.
# The AI might suggest
pip install flask-smart-auth # this package may not exist (or may be malicious)
Always verify that suggested packages exist on PyPI or npm before installing them.
Why Hallucinations Happen
Understanding the mechanism helps prevent them.
Training data is a snapshot. Models are trained on code from a specific time window. Libraries change after that snapshot. The model doesn’t know about breaking changes, deprecated functions, or new APIs introduced after training.
Pattern completion, not understanding. AI models predict the most likely next token based on patterns. If a pattern like df.smart_merge() appears plausible given the context, the model generates it even if it doesn’t exist. The model doesn’t check against a list of real API methods.
Blending across sources. Models see thousands of examples from different libraries, versions, and languages. They sometimes blend features: a Python function with a JavaScript naming convention, or a React 17 pattern with React 18 syntax.
Confidence doesn’t indicate accuracy. AI models don’t have a reliable internal “I’m not sure about this” signal. They generate hallucinated code with the same confidence as correct code. You can’t tell from the output how certain the model is.
Strategies That Work
1. Verify Every Import and API Call
This is the single most effective practice. After AI generates code, verify that:
- Every imported module exists and is installed
- Every function and method called actually exists in the version you’re using
- Every parameter name and type matches the real API
IDE autocompletion helps here — if your editor doesn’t autocomplete a function name, it might not exist. Running help() in Python or checking TypeScript types catches most invented APIs.
2. Specify Versions in Your Prompts
Instead of asking “How do I do X with React?”, specify “How do I do X with React 18 using the App Router in Next.js 14?” The more specific your prompt, the less room the model has to blend features from different versions.
Even better, paste a snippet of your package.json or requirements.txt so the AI knows exactly which versions you’re working with.
3. Use Type-Checked Languages
TypeScript, Python with type hints and mypy, Rust, and Go catch many hallucinations at compile time. If the AI invents a method or uses wrong parameter types, the type checker flags it immediately.
This is one of the strongest arguments for using typed languages with AI tools. The feedback loop is instant: generate code, type check fails, fix the issue.
4. Run the Code Immediately
Don’t accumulate large amounts of AI-generated code before running it. Generate a small piece, run it, verify it works, then generate the next piece. This catches hallucinations early when they’re easy to fix, rather than after they’ve become entangled with other generated code.
5. Test Edge Cases, Not Just the Happy Path
AI hallucinations often hide in edge cases. The code works for normal inputs but fails for:
- Empty inputs (empty strings, empty arrays, null)
- Boundary values (zero, negative numbers, very large numbers)
- Unicode and special characters
- Concurrent access
- Network failures and timeouts
After generating code, write tests specifically targeting these edge cases.
6. Cross-Reference Documentation
When AI generates code using a library API, spend 30 seconds verifying against the official documentation. This catches most invented APIs and wrong parameter names. It’s faster than debugging a hallucination after the fact.
Bookmark the documentation for your most-used libraries. The time investment pays for itself quickly.
7. Use AI to Check AI
Ask a different AI model to review the generated code. Different models have different hallucination patterns, so one model often catches what another invented. Claude is particularly good at reviewing code for correctness because it tends to be more conservative and will flag things that look suspicious.
8. Prefer Popular, Well-Documented Libraries
AI models generate more accurate code for popular libraries because they’ve seen more training examples. If you’re choosing between two libraries for a task, the one with more GitHub stars and better documentation will likely produce fewer hallucinations.
9. Watch for “Too Good to Be True” Solutions
If the AI generates a solution that seems remarkably clean and simple for what you thought was a complex problem, be suspicious. It might have invented a convenient function that doesn’t exist, or simplified away important edge cases.
Real-world code is messy for a reason. If the AI’s version is suspiciously clean, it might be ignoring complexity rather than solving it.
Model-Specific Hallucination Patterns
Different models have different hallucination tendencies:
Claude tends to be conservative — it’s more likely to say it’s not sure about something than to hallucinate an API. When it does hallucinate, it’s usually about less popular libraries.
GPT-4o/4.5 hallucinations often involve blending features from different library versions. It’s confident about everything, which means you need to verify more carefully.
Gemini hallucinates less about Google-related APIs (GCP, Android, Angular) because of its training data advantage. It hallucinates more about niche libraries.
Open source models (Llama, DeepSeek, etc.) hallucinate more frequently, especially for less popular languages and frameworks. The smaller the model, the more hallucinations.
Building Hallucination-Resistant Workflows
For Individual Developers
- Generate code in small chunks, not large blocks
- Run after every generation
- Use TypeScript or typed Python
- Verify imports and API calls before testing logic
- Keep official docs open for your key libraries
For Teams
- Include “verify AI-generated APIs” in your code review checklist
- Require type checking in CI/CD
- Flag AI-generated code in PRs for extra scrutiny
- Maintain a team wiki of known hallucination patterns for your stack
- Run security scans that catch fake packages
For Critical Systems
- Require human implementation for safety-critical code paths
- Use formal verification where possible
- Mandate 100% test coverage for AI-generated code in critical modules
- Implement runtime validation for outputs of AI-generated functions
- Log and monitor AI-generated code paths separately in production
The Trend Is Positive
AI hallucination rates are decreasing with each model generation. Better training data, retrieval-augmented generation (RAG) that checks real documentation, and tool use (where the model can call APIs to verify its output) are all helping.
But hallucinations won’t reach zero anytime soon. The architecture of current language models — predicting likely tokens based on patterns — inherently allows for plausible-but-wrong output. Until models can formally verify their own code, human verification remains essential.
The good news is that the strategies above catch the vast majority of hallucinations before they reach production. Building these practices into your workflow doesn’t take much time and saves enormous debugging effort later.
Treat AI-generated code like advice from a brilliant but sometimes overconfident colleague. Often excellent, occasionally wrong, and always worth double-checking.