Skip to main content

Codebase Indexing

Codebase Indexing transforms how Roo Code understands your project by creating a semantic search index using AI embeddings. Instead of searching for exact text matches, it understands the meaning of your queries, helping Roo find relevant code even when you don't know specific function names or file locations.

Codebase Indexing Status Icon

What It Doesโ€‹

When enabled, the indexing system:

  1. Parses your code using Tree-sitter to identify semantic blocks (functions, classes, methods)
  2. Creates embeddings of each code block using AI models
  3. Stores vectors in a Qdrant database for fast similarity search
  4. Provides the codebase_search tool to Roo for intelligent code discovery

This enables natural language queries like "user authentication logic" or "database connection handling" to find relevant code across your entire project.


Quick Start Guideโ€‹

๐Ÿ’ฐ Completely Free Setup Available

You can set up codebase indexing at zero cost by using:

  • Qdrant Cloud (free tier) or Docker Qdrant (completely free)
  • Google Gemini (currently free)

This gives you professional-grade semantic search without any subscription fees!

Step 1: Choose Your Setupโ€‹

Before enabling codebase indexing, you'll need two components:

  1. An Embedding Provider - to convert code into searchable vectors
  2. A Vector Database - to store and search those vectors

Step 2: Set Up Qdrant (Vector Database)โ€‹

  1. Sign up at Qdrant Cloud (free tier available)
  2. Create a cluster
  3. Copy your URL and API key

Option B: Local Setup - FREEโ€‹

Using Docker:

docker run -p 6333:6333 qdrant/qdrant

Using Docker Compose:

version: '3.8'
services:
qdrant:
image: qdrant/qdrant
ports:
- "6333:6333"
volumes:
- qdrant_storage:/qdrant/storage
volumes:
qdrant_storage:

Step 3: Set Up an Embedding Providerโ€‹

  1. Get an API key from Google AI Studio (currently free)
  2. In Roo Code settings:
    • Provider: Google Gemini
    • API Key: Your Google AI Studio key
Other Providers Available

While this guide focuses on Google Gemini since it's currently free, Roo Code also supports OpenAI, Ollama, and OpenAI-compatible providers. You can explore these options in the configuration dropdown.

Step 4: Saveโ€‹

  1. Click Save and Start Indexing

The status indicator will show:

  • Yellow (Indexing): Currently processing files
  • Green (Indexed): Ready for searches
  • Red (Error): Check troubleshooting section

Managing and Configuring the Indexerโ€‹

You can monitor the status and manage all configuration for the codebase indexer directly from the Roo Code chat interface.

The Status Iconโ€‹

At the bottom-right corner of the chat input, you'll find the Codebase Indexing status icon. This icon provides a quick, at-a-glance overview of the indexer's current state.

Codebase Indexing Status Icon

The color of the icon indicates the state:

  • ๐ŸŸข Green: Indexed. The index is up-to-date and ready for search.
  • ๐ŸŸก Yellow: Indexing. The system is actively processing files. Searches can still be performed, but results may be incomplete.
  • ๐Ÿ”ด Red: Error. An issue has occurred (e.g., failed to connect to Qdrant or the embedding provider). See the Troubleshooting section for help.
  • โšช Gray: Standby. The indexer is waiting for configuration or has been disabled.

The Configuration Popoverโ€‹

Clicking the status icon opens the main configuration popover. Here, you can view the detailed status and manage all settings.

Codebase Indexing Popover
  • Status: A detailed message showing the current state, such as "Indexed - File watcher started" or the progress of an ongoing scan.
  • Setup: Contains the primary fields for connecting to your embedding provider and vector database.
  • Advanced Configuration: Allows you to fine-tune search parameters like the similarity threshold.
  • Clear Index Data: Deletes all data from the Qdrant collection and clears the local file cache. Use this when you want to re-index your entire project from scratch. This action cannot be undone.
  • Save: Applies your configuration changes. If a critical setting (like an API key or a model) is changed, the indexer will automatically restart.

Detailed Configuration Fieldsโ€‹

This guide explains each setting available in the configuration popover.

Codebase Indexing Configuration Details

Setup Fieldsโ€‹

  • Embedder Provider

    • Purpose: To select your source for generating AI embeddings.
    • Behavior: This dropdown menu determines which configuration fields are shown. Your options are OpenAI, Google Gemini, Ollama, and OpenAI Compatible.
  • API Key (for OpenAI, Gemini, OpenAI Compatible)

    • Purpose: The secret key to authenticate with your chosen provider.
    • Behavior: This input is required for all cloud-based providers and is stored securely in your VS Code secret storage.
  • Base URL (for Ollama, OpenAI Compatible)

    • Purpose: The endpoint for connecting to the provider's API.
    • Behavior: For Ollama, this is typically http://localhost:11434. For OpenAI Compatible providers like Azure, this is the full deployment URL.
  • Model

    • Purpose: To select the specific embedding model you want to use.
    • Behavior: The list of available models changes based on the selected provider. The model's vector dimension (e.g., 1536 dimensions) is displayed, as changing dimensions requires a full re-index.
  • Qdrant URL

    • Purpose: The connection endpoint for your Qdrant vector database.
    • Behavior: This must be a valid URL pointing to your local or cloud-based Qdrant instance (e.g., http://localhost:6333).
  • Qdrant API Key

    • Purpose: The authentication key for a secured Qdrant instance.
    • Behavior: This field is optional and should only be used if your Qdrant deployment requires an API key.

Advanced Configuration Fieldsโ€‹

  • Search Score Threshold

    • Purpose: Controls the minimum similarity score required for a code snippet to be considered a match.
    • Behavior: Use the slider to set a value between 0.0 and 1.0. A lower value returns more (but potentially less relevant) results, while a higher value returns fewer, more precise results.
    • Recommended Settings:
      • Low (0.15-0.3): Broader results, good for exploration
      • Medium (0.4-0.5): Balanced precision and recall (default: 0.4)
      • High (0.6-0.8): Precise matches only
  • Maximum Search Results

    • Purpose: Sets the maximum number of code snippets returned by a single codebase_search.
    • Behavior: Use the slider to adjust the limit. This helps control the amount of context provided to the AI.

Key Benefitsโ€‹

  • Semantic Search: Find code by meaning, not just keywords
  • Enhanced AI Understanding: Roo can better comprehend and work with your codebase
  • Cross-Project Discovery: Search across all files, not just what's open
  • Pattern Recognition: Locate similar implementations and code patterns

How Files Are Processedโ€‹

Smart Code Parsingโ€‹

The system uses a sophisticated parsing strategy:

  1. Tree-sitter First: For supported languages, it uses AST parsing to identify semantic code blocks (functions, classes, methods)
  2. Markdown Support: Indexes Markdown files by treating headers as semantic entry points
  3. Intelligent Fallback: For unsupported file types, it falls back to line-based chunking

Block Sizing:

  • Minimum: 100 characters
  • Maximum: 1,000 characters
  • Large functions are split intelligently at logical boundaries

File Filteringโ€‹

The indexer respects your project's ignore patterns:

  • Files matching .gitignore patterns
  • Files matching .rooignore patterns
  • Binary files and images
  • Files larger than 1MB

Important: Ensure your .gitignore includes common dependency folders like node_modules, vendor, target, etc., as the system relies exclusively on these patterns for filtering.

Incremental Updatesโ€‹

  • File Watching: Monitors your workspace for changes in real-time
  • Smart Updates: Only reprocesses modified files
  • Branch Aware: Automatically handles Git branch switches
  • Hash-based Caching: Avoids reprocessing unchanged content

Best Practicesโ€‹

Writing Effective Queriesโ€‹

Instead of searching for exact syntax:

  • โŒ const getUser
  • โœ… function to fetch user from database

Use natural language descriptions:

  • "authentication middleware"
  • "error handling for API requests"
  • "database connection setup"

Security Considerationsโ€‹

  • API Keys: Stored securely in VS Code's encrypted storage
  • Code Privacy: Only small code snippets sent for embedding
  • Local Processing: All parsing happens locally
  • Access Control: Respects file permissions and ignore patterns

Troubleshootingโ€‹

Connection Issuesโ€‹

"Connection to Qdrant failed"

  • Ensure Qdrant is running (docker ps to check)
  • Verify URL matches (default: http://localhost:6333)
  • Check firewall/network policies
  • For cloud instances, confirm URL and API key

"Invalid API Key" or "401 Unauthorized"

  • Double-check your API key is correct
  • Ensure the key has necessary permissions
  • For Ollama, verify the service is running

Model Issuesโ€‹

"Model Not Found"

  • For Google Gemini: Ensure the model name is correct (e.g., text-embedding-004)
  • For other providers: Consult their documentation for available models and proper naming

Indexing Issuesโ€‹

"Stuck in Error State"

  1. Check connection issues first
  2. Click "Clear Index & Re-index" in settings
  3. This resolves corrupted cache or collection issues

"Indexing Taking Too Long"

  • Normal for large codebases (10k+ files)
  • Check .gitignore includes large directories
  • Consider adding patterns to .rooignore

Using the Search Featureโ€‹

Once indexed, Roo can use the codebase_search tool:

Example Natural Language Queries:

  • "How is user authentication handled?"
  • "Database connection setup"
  • "Error handling patterns"
  • "API endpoint definitions"
  • "Component state management"

The tool provides:

  • Relevant code snippets
  • File paths with line numbers
  • Similarity scores
  • Direct navigation links

Privacy & Data Securityโ€‹

Your code stays private:

  • Only small code chunks (100-1000 chars) sent for embedding
  • Embeddings are one-way mathematical representations
  • Local parsing means full files never leave your machine
  • Use Ollama for completely offline operation

Data Storage:

  • Vectors stored in your chosen Qdrant instance
  • You control where data lives (local/cloud)
  • Easy to delete: just clear the index

Current Limitationsโ€‹

  • File Size: 1MB maximum per file
  • Single Workspace: One workspace indexed at a time
  • External Dependencies: Requires embedding provider + Qdrant
  • Language Support: Best results with Tree-sitter supported languages

Future Enhancementsโ€‹

Planned improvements:

  • Additional embedding providers
  • Multi-workspace indexing
  • Enhanced filtering options
  • Team collaboration features
  • VS Code native search integration
  • Incremental re-indexing optimizations