Stop Wasting 80% of Your LLM Tokens: The Caveman + Graphify Framework

Most LLM usage in software development is wasteful. Not because engineers are careless — because the default interaction patterns assume you're chatting, not engineering. Every prompt includes courteous phrasing, full context re-explanation, and the model responds with hedges, caveats, and padding that reads well but carries no information.

I got tired of it. Over the past few months I built a three-step stack that cuts token usage by 80-90% on typical development tasks. The LinkedIn post went to 30k impressions, which tells me the frustration is widespread.

Here's the full framework.

Why Token Waste Is a Real Problem

It's not just cost. Tokens are also latency and rate limits. When you're deep in a complex system, you hit context limits faster when your prompts are verbose. You wait longer for responses padded with filler. You burn through your API quota on pleasantries.

A typical development interaction might look like:

"I hope you can help me with this! I'm working on a FastAPI backend and I'm running into an issue where the PostgreSQL connection pool seems to be exhausted under load. Could you take a look at this code and help me understand what might be going wrong? I've tried a few things but nothing seems to work..."

That's 60 tokens before you've said anything technically meaningful. Multiply that by 50 interactions a day and you've spent 3,000 tokens on sentence structure.

Step 1: Caveman Mode — Compress Without Losing Accuracy

Caveman is a Claude Code plugin that rewrites your prompts and Claude's responses in compressed format. The key constraint: technical accuracy is non-negotiable. Only filler, hedging, and social padding gets stripped.

Before: "Could you please help me understand why this function might be returning undefined when it should be returning an array?"

After: "fn returns undefined, expect array, why?"

Claude's before: "That's a great question! There are several reasons why this might be happening. Let me walk you through the most common causes..."

Claude's after: "3 causes: 1) async not awaited 2) filter returns empty 3) mutation before return"

Setup is one line in your Claude Code config:

{
  "plugins": ["caveman"],
  "caveman": {
    "mode": "full",
    "preserve": ["code_blocks", "technical_terms"]
  }
}

The preserve array tells Caveman what never to compress. Code blocks, variable names, API endpoints, error messages — these travel verbatim. Only natural language wrapping gets stripped.

Step 2: Graphify — Replace Codebase Pasting with a Knowledge Graph

The second waste source is context re-establishment. Every new Claude session starts cold. Engineers compensate by pasting relevant files into the prompt — which burns tokens and often hits context limits before the actual question.

Graphify indexes your project into a local knowledge graph served via an MCP server. Instead of pasting code, you reference it:

analyze auth.py#validate_token — perf issue

The MCP server resolves auth.py#validate_token to the function, its dependencies, its callers, and recent git changes — then injects only the relevant subgraph into Claude's context.

Setup:

npx graphify init  # indexes your project
npx graphify serve  # starts local MCP server on port 3456

Add to Claude Desktop's MCP config:

{
  "mcpServers": {
    "graphify": {
      "url": "http://localhost:3456"
    }
  }
}

Now Claude has persistent structural knowledge of your codebase without you pasting files. The graph updates incrementally on file saves.

Step 3: RULES.md Shorthand — Project-Specific Compression Dictionary

The third layer is a RULES.md file in your project root that defines shorthand specific to your stack and domain. Claude reads this at session start (one-time cost) and then interprets compressed references for the rest of the session.

# Project Shorthand
 
- crt fapi be w/ pg = Create a FastAPI backend with PostgreSQL
- add auth = Add JWT authentication with refresh tokens
- std err = Standard error response format (see errors.py)
- tst cov = Generate pytest tests with 80% coverage target
- db mgr = DatabaseManager class (see db/manager.py)
- prod cfg = Production config (see config/prod.yaml)

After loading RULES.md, a prompt like:

crt fapi be w/ pg, add auth, std err on all routes

...is fully interpretable. That's 12 tokens for what would otherwise be a 150-token specification.

The Numbers

Task	Before	After	Reduction
Simple API request	150 tokens	30 tokens	-80%
Complex RAG setup	800 tokens	120 tokens	-85%
Status update	50 tokens	5 tokens	-90%
Architecture question	400 tokens	80 tokens	-80%

Response time improvement: approximately 2.4x faster end-to-end. The speed gain compounds with the token reduction — smaller requests process faster, and compressed responses stream faster.

Honest Caveats

If your workflow is input-heavy, output compression helps less. If you're pasting large codebases or uploading documents, the compression gains on prompt structure are a small fraction of your total token budget. Graphify helps here, but there's a ceiling.

Local LLMs are a partial solution. I've tested this stack with Gemma 4 via Ollama for the simple tasks. It works for routine operations but lacks the context-awareness of frontier models for anything requiring judgment. Don't try to run your architecture questions through a local model just to save tokens — you'll get worse answers and iterate more.

Team adoption has friction. RULES.md shorthand is powerful but requires everyone on the team to learn the dictionary. It works best for solo developers or tight teams with a shared domain vocabulary.

Caveman mode can be disorienting. Some people find compressed AI responses harder to read, especially when learning a new domain. I'd recommend full compression only for domains you know well.

Where It Works Best

This framework delivers the most value when:

You're deep in a single codebase (Graphify pays off most)
Your tasks are routine development operations (Caveman pays off most)
You're cost-sensitive (free tier or pay-as-you-go API)
You're at or near rate limits on Claude Pro

If you're doing open-ended research, exploring new domains, or producing content rather than code, the compression wins are smaller and the tradeoffs bite harder.

The framework is open source and evolving. If you're building internal AI tooling for a development team and want to talk through token optimization at scale — where it matters a lot more than it does for individual developers — let's talk.

Discussion