Skip to content
Go back

Understanding Database Indexes: B-Trees vs Hash Indexes

Indexes are the most powerful tool for database performance optimization, yet they’re often misunderstood. This guide explains how indexes work internally and when to use different index types.

What is a Database Index?

A database index is a data structure that improves the speed of data retrieval operations at the cost of additional writes and storage space. Think of it like a book’s index - instead of scanning every page to find a topic, you jump directly to the relevant pages.

The Cost-Benefit Tradeoff

Benefits:

Costs:

B-Tree Indexes: The Workhorse

B-tree (Balanced Tree) indexes are the default in most databases (PostgreSQL, MySQL, SQLite) for good reason - they’re versatile and efficient.

How B-Trees Work

A B-tree maintains sorted data in a tree structure:

                [50, 100]
               /    |    \
         [20,30]  [75]  [110,120]
         /  |  \   |  \   /  |  \
      [10] [25] [35] [70] [90] [105] [115] [125]

Each node contains:

B-Tree Operations Complexity

When to Use B-Tree Indexes

B-trees excel at:

  1. Equality searches:
SELECT * FROM users WHERE id = 12345;
  1. Range queries:
SELECT * FROM orders WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31';
  1. Prefix matching:
SELECT * FROM products WHERE name LIKE 'iPhone%';
  1. Sorting operations:
SELECT * FROM users ORDER BY last_name, first_name;

B-Tree Index Example

-- Create a B-tree index (default type)
CREATE INDEX idx_users_email ON users(email);

-- PostgreSQL uses B-tree by default
-- MySQL InnoDB also uses B-tree (called B+tree)

-- Verify index usage
EXPLAIN SELECT * FROM users WHERE email = 'john@example.com';

Output shows Index Scan using idx_users_email - the optimizer chose the index.

Composite B-Tree Indexes

B-trees support multi-column indexes with a crucial property: leftmost prefix rule.

CREATE INDEX idx_users_name ON users(last_name, first_name, middle_name);

-- These queries use the index:
SELECT * FROM users WHERE last_name = 'Smith';
SELECT * FROM users WHERE last_name = 'Smith' AND first_name = 'John';
SELECT * FROM users WHERE last_name = 'Smith' AND first_name = 'John' AND middle_name = 'A';

-- This query DOES NOT use the index efficiently:
SELECT * FROM users WHERE first_name = 'John';

The index is sorted by last_name first, so querying only by first_name can’t leverage the index structure.

Hash Indexes: Fast Equality Checks

Hash indexes use a hash function to map keys to buckets, enabling O(1) average-case lookup time.

How Hash Indexes Work

Hash Function: h(key) = key mod 10

Keys: [23, 45, 12, 67, 34]
Hash Table:
[0] -> null
[1] -> null
[2] -> 12 -> 23
[3] -> 34
[4] -> 45
[5] -> null
[6] -> null
[7] -> 67
[8] -> null
[9] -> null

When searching for key 45:

  1. Compute hash: h(45) = 45 mod 10 = 5
  2. Look up bucket 5
  3. Find value directly (or scan bucket if collision)

Hash Index Limitations

Hash indexes have significant restrictions:

  1. Only equality comparisons:
-- Works with hash index
SELECT * FROM users WHERE id = 12345;

-- Does NOT work with hash index
SELECT * FROM users WHERE id > 12345;
SELECT * FROM users WHERE id BETWEEN 10000 AND 20000;
  1. No sorting support: Can’t use hash indexes for ORDER BY

  2. No prefix matching: Can’t match partial strings

  3. No collision resistance: Multiple keys may hash to the same bucket, requiring additional lookups

When to Use Hash Indexes

Hash indexes are optimal for:

  1. Exact match lookups on high-cardinality columns
  2. Equality joins in data warehouses
  3. In-memory databases where disk I/O isn’t a concern

Hash Index Example (PostgreSQL)

-- Create a hash index
CREATE INDEX idx_sessions_token ON sessions USING hash(session_token);

-- Good use case: session token lookup
SELECT * FROM sessions WHERE session_token = 'abc123def456';

-- Bad use case: range query (won't use index)
SELECT * FROM sessions WHERE session_token > 'abc';

Important: PostgreSQL hash indexes were not WAL-logged until version 10, meaning they weren’t crash-safe. Most use cases are better served by B-tree indexes.

MySQL and Hash Indexes

MySQL’s InnoDB storage engine doesn’t support explicit hash indexes, but uses adaptive hash indexing internally:

Performance Comparison

Let’s compare B-tree vs Hash indexes with a benchmark on 10 million rows:

Benchmark Setup

-- Create test table
CREATE TABLE benchmark (
    id SERIAL PRIMARY KEY,
    lookup_key VARCHAR(50),
    data TEXT
);

-- Insert 10 million rows
INSERT INTO benchmark (lookup_key, data)
SELECT
    md5(random()::text),
    repeat('x', 100)
FROM generate_series(1, 10000000);

-- Create indexes
CREATE INDEX idx_btree ON benchmark(lookup_key);
CREATE INDEX idx_hash ON benchmark USING hash(lookup_key);

ANALYZE benchmark;

Results: Equality Lookup

EXPLAIN ANALYZE
SELECT * FROM benchmark WHERE lookup_key = 'abc123';
Index TypeQuery TimeIndex Size
No Index1,234 msN/A
B-Tree0.08 ms214 MB
Hash0.06 ms198 MB

Winner: Hash index is 25% faster for equality lookups.

Results: Range Query

EXPLAIN ANALYZE
SELECT * FROM benchmark
WHERE lookup_key BETWEEN 'aaa' AND 'bbb';
Index TypeQuery Time
No Index1,456 ms
B-Tree45 ms
Hash1,398 ms

Winner: B-tree index. Hash index provides no benefit for range queries.

Results: ORDER BY

EXPLAIN ANALYZE
SELECT * FROM benchmark ORDER BY lookup_key LIMIT 100;
Index TypeQuery Time
No Index2,345 ms
B-Tree0.12 ms
Hash2,298 ms

Winner: B-tree index. Hash index can’t help with sorting.

Other Index Types

Modern databases offer specialized indexes for specific use cases:

GIN (Generalized Inverted Index)

For full-text search and array/JSON queries:

-- Full-text search
CREATE INDEX idx_articles_content ON articles USING gin(to_tsvector('english', content));

SELECT * FROM articles WHERE to_tsvector('english', content) @@ to_tsquery('database & performance');

-- JSONB queries
CREATE INDEX idx_users_metadata ON users USING gin(metadata);

SELECT * FROM users WHERE metadata @> '{"premium": true}';

GiST (Generalized Search Tree)

For geometric data, full-text, and range types:

-- Geometric queries
CREATE INDEX idx_locations_coords ON locations USING gist(coordinates);

SELECT * FROM locations WHERE coordinates && box '((0,0),(1,1))';

BRIN (Block Range Index)

For very large tables with naturally ordered data:

-- Efficient for time-series data
CREATE INDEX idx_logs_created ON logs USING brin(created_at);

-- Tiny index size (1000x smaller than B-tree)
-- Works well when data is mostly sorted by indexed column

Index Maintenance Best Practices

Monitor Index Usage

Find unused indexes:

SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan as scans,
    pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC, pg_relation_size(indexrelid) DESC;

Indexes with zero scans are candidates for removal.

Rebuild Bloated Indexes

Over time, indexes become fragmented:

-- Check index bloat
SELECT
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE schemaname = 'public';

-- Rebuild without locking (PostgreSQL 12+)
REINDEX INDEX CONCURRENTLY idx_users_email;

Avoid Over-Indexing

Each index has a cost:

-- Bad: Too many single-column indexes
CREATE INDEX idx_users_first_name ON users(first_name);
CREATE INDEX idx_users_last_name ON users(last_name);
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_city ON users(city);
CREATE INDEX idx_users_state ON users(state);

-- Better: Strategic composite index for common query patterns
CREATE INDEX idx_users_location ON users(state, city);
CREATE INDEX idx_users_email ON users(email); -- Keep for unique constraint

Aim for 3-5 indexes per table in most cases.

Choosing the Right Index Type

Use this decision tree:

  1. Do you need range queries or sorting?

    • Yes → B-tree index
    • No → Continue
  2. Is it an equality-only lookup on a high-cardinality column?

    • Yes → Hash index (if your database supports it well)
    • No → Continue
  3. Is it full-text search?

    • Yes → GIN or full-text index
    • No → Continue
  4. Is it geometric/spatial data?

    • Yes → GiST or spatial index
    • No → Continue
  5. Is it a very large table with naturally sorted data?

    • Yes → BRIN index
    • No → Default to B-tree

Rule of thumb: When in doubt, use B-tree. It’s the most versatile and well-optimized index type.

Conclusion

Database indexes are essential for performance, but they’re not magic bullets:

The key to effective indexing is understanding your query patterns, measuring performance, and optimizing strategically. Don’t create indexes blindly - profile your queries, identify bottlenecks, and index deliberately.

Start with B-tree indexes for your most common query patterns, measure their impact, and iterate from there.


Share this post on:

Previous Post
PostgreSQL Performance Tuning: A Comprehensive Guide
Next Post
MongoDB Aggregation Pipeline: Advanced Patterns and Optimization