wdpkr: first thoughts

wdpkr is a command line tool we are building to solve a personal pain point with AI coding tools. When starting a task, a plan, or even a simple bug fix in a new Claude Code session, the agent spends an annoying amount of time "Exploring". The explore step is needed in a new session to gather needed context. Aside from memory files and rules in the project repo, an AI coding tool really has no knowledge of the code that exists in the repo, so it must explore.

If my source code is knowledge in a library, Claude is using the World Book Encyclopedia set to find answers to my questions. Hunting through various versions of the set, finding the right book, flipping through the pages until it comes across the paragraph that contains the data it needs. wdpkr exists to give my Claude a search engine equivalent for the codebase.

More tokens for the token gods

Codebase exploration happens the same way anyone would explore a repo if they were dropped in a Linux terminal without a code editor. A series of ls, find, grep and more, chained together. Back to back, just searching for basic keywords to see what exists, what looks right, what might be relevant. It works because it's simple; however it consumes more time and tokens than I'd like.

Tokens are the unit of cost for these tools. At some point the flat, monthly bill for my Claude Code can go away and I'll be subject to the same usage based API pricing that exists in every other AI tool. I don't want to be charged for excess token usage from the company that makes a tool that consumes a nondeterministic amount of tokens every time I use it.

A simple tool for a simple task

wdpkr (woodpecker without the vowels, how clever) is a command line tool for Claude to use during codebase exploration. Its main functions are index and search.

index can be done in full on a fresh codebase. Files can be walked in parallel and stored efficiently. After initial indexing, wdpkr can be added to a CI/CD pipeline to incrementally index changes as they are made.

search provides semantic search for the requirements you've given an agent in your prompt. Search results include file paths, symbols, line numbers, and even dependency trees for code that matches. Like the bird it's named after, wdpkr taps through your codebase to find exactly where things live.

How we've built the bird

wdpkr is built around a 4 component architecture: chunk, summarize, embed, store. These components are all used when indexing entries into its datastore, while fewer are needed for queries.

1. Chunking

Breaking a code file into searchable parts can be done in many ways. We've chosen to parse the Abstract Syntax Tree for various languages to drive the primary chunking strategy in wdpkr. AST parsing provides a fast, mostly language agnostic, and deterministic way to chunk files for storage.

The added benefit of AST parsing is the ability to capture a dependency graph for all nodes. For each function parsed, we capture a list of imports and references (upstream dependencies that the function calls and downstream references to the function). This information is held as metadata for each stored chunk.

2. Summarizing

To enrich the ability to provide semantic search of a codebase, we generate natural language descriptions of stored chunks to improve how chunks are embedded and stored. Long term, this can be skipped by utilizing docstrings tied to all functions however that requires a thoroughly documented codebase.

To provide the best chance of accurate embeddings for semantic search, a small LLM is utilized to generate and store summaries of each chunk. This is easily done with a Claude Haiku type model. It feels counterintuitive to use an LLM to index symbols for LLM based search, but wdpkr is finding its footing. We will continue to experiment with variations of this.

3. Embedding

Semantic search requires source data to be converted to vectors for storage. To achieve this, we reach for an off-the-shelf model. Our current preference is voyage-code-3 which is optimized for code retrieval. This model is built to do exactly what we enable with wdpkr. The added benefit of this VoyageAI model is that it provides 200 million free tokens. We have not yet made a dent in this free allotment.

4. Storing

Vector storage is the final piece of enabling semantic search over a codebase. Our default vector database provider is turbopuffer providing high availability with incredibly fast warm cache performance.

Vector storage is intended to be configurable and swappable. We are actively building support for locally hosted databases, which may be a better fit for single-contributor codebases.

How it's going

My initial struggle with wdpkr was getting Claude to actually use the CLI tool. I've watched several tasks where it chose to either skip semantic search altogether or gave up after one attempt. I've resorted to interviewing Claude in my sessions to understand its reasoning for using wdpkr in the way that it does.

Today, I'm seeing some success. Semantic search is the first tool used by my Claude Code early in sessions. The time to "Explore" relevant parts of the codebase feels faster.

However I don't have the data to back this up. Evaluating wdpkr's effectiveness in token reduction is going to be our main focus next. Saving time and money is a big claim to make for a tool, we need to back that up with proof.

What is next

Today's release of wdpkr is entirely an MVP with a working implementation of all components. It's available to install and configure for your own codebase (BYO Anthropic, VoyageAI, and turbopuffer keys). Come find installation and usage directions on our repo. If you do use wdpkr please let me know your thoughts. We'd love the feedback. This tool was thoroughly planned, and quickly vibe coded. We will continue to experiment with wdpkr to understand both its effectiveness today and where we can make improvements for tomorrow.