Understanding Database Indexes: B-Trees vs Hash Indexes

Indexes are the most powerful tool for database performance optimization, yet they’re often misunderstood. This guide explains how indexes work internally and when to use different index types.

What is a Database Index?

A database index is a data structure that improves the speed of data retrieval operations at the cost of additional writes and storage space. Think of it like a book’s index - instead of scanning every page to find a topic, you jump directly to the relevant pages.

The Cost-Benefit Tradeoff

Benefits:

Dramatically faster SELECT queries
Efficient ORDER BY operations
Quick uniqueness checks

Costs:

Slower INSERT, UPDATE, DELETE operations
Additional disk space (10-30% overhead)
Memory usage for index cache

B-Tree Indexes: The Workhorse

B-tree (Balanced Tree) indexes are the default in most databases (PostgreSQL, MySQL, SQLite) for good reason - they’re versatile and efficient.

How B-Trees Work

A B-tree maintains sorted data in a tree structure:

                [50, 100]
               /    |    \
         [20,30]  [75]  [110,120]
         /  |  \   |  \   /  |  \
      [10] [25] [35] [70] [90] [105] [115] [125]

Each node contains:

Keys: Sorted values from the indexed column(s)
Pointers: References to child nodes or table rows

B-Tree Operations Complexity

Search: O(log n) - Traverse from root to leaf
Insert: O(log n) - Find position + potential rebalancing
Delete: O(log n) - Find node + rebalancing
Range scan: O(log n + k) - Where k is the number of results

When to Use B-Tree Indexes

B-trees excel at:

Equality searches:

SELECT * FROM users WHERE id = 12345;

Range queries:

SELECT * FROM orders WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31';

Prefix matching:

SELECT * FROM products WHERE name LIKE 'iPhone%';

Sorting operations:

SELECT * FROM users ORDER BY last_name, first_name;

B-Tree Index Example

-- Create a B-tree index (default type)
CREATE INDEX idx_users_email ON users(email);

-- PostgreSQL uses B-tree by default
-- MySQL InnoDB also uses B-tree (called B+tree)

-- Verify index usage
EXPLAIN SELECT * FROM users WHERE email = 'john@example.com';

Output shows Index Scan using idx_users_email - the optimizer chose the index.

Composite B-Tree Indexes

B-trees support multi-column indexes with a crucial property: leftmost prefix rule.

CREATE INDEX idx_users_name ON users(last_name, first_name, middle_name);

-- These queries use the index:
SELECT * FROM users WHERE last_name = 'Smith';
SELECT * FROM users WHERE last_name = 'Smith' AND first_name = 'John';
SELECT * FROM users WHERE last_name = 'Smith' AND first_name = 'John' AND middle_name = 'A';

-- This query DOES NOT use the index efficiently:
SELECT * FROM users WHERE first_name = 'John';

The index is sorted by last_name first, so querying only by first_name can’t leverage the index structure.

Hash Indexes: Fast Equality Checks

Hash indexes use a hash function to map keys to buckets, enabling O(1) average-case lookup time.

How Hash Indexes Work

Hash Function: h(key) = key mod 10

Keys: [23, 45, 12, 67, 34]
Hash Table:
[0] -> null
[1] -> null
[2] -> 12 -> 23
[3] -> 34
[4] -> 45
[5] -> null
[6] -> null
[7] -> 67
[8] -> null
[9] -> null

When searching for key 45:

Compute hash: h(45) = 45 mod 10 = 5
Look up bucket 5
Find value directly (or scan bucket if collision)

Hash Index Limitations

Hash indexes have significant restrictions:

Only equality comparisons:

-- Works with hash index
SELECT * FROM users WHERE id = 12345;

-- Does NOT work with hash index
SELECT * FROM users WHERE id > 12345;
SELECT * FROM users WHERE id BETWEEN 10000 AND 20000;

No sorting support: Can’t use hash indexes for ORDER BY
No prefix matching: Can’t match partial strings
No collision resistance: Multiple keys may hash to the same bucket, requiring additional lookups

When to Use Hash Indexes

Hash indexes are optimal for:

Exact match lookups on high-cardinality columns
Equality joins in data warehouses
In-memory databases where disk I/O isn’t a concern

Hash Index Example (PostgreSQL)

-- Create a hash index
CREATE INDEX idx_sessions_token ON sessions USING hash(session_token);

-- Good use case: session token lookup
SELECT * FROM sessions WHERE session_token = 'abc123def456';

-- Bad use case: range query (won't use index)
SELECT * FROM sessions WHERE session_token > 'abc';

Important: PostgreSQL hash indexes were not WAL-logged until version 10, meaning they weren’t crash-safe. Most use cases are better served by B-tree indexes.

MySQL and Hash Indexes

MySQL’s InnoDB storage engine doesn’t support explicit hash indexes, but uses adaptive hash indexing internally:

Automatically creates in-memory hash indexes for frequently accessed B-tree index pages
Completely transparent to users
Disabled with innodb_adaptive_hash_index = OFF

Performance Comparison

Let’s compare B-tree vs Hash indexes with a benchmark on 10 million rows:

Benchmark Setup

-- Create test table
CREATE TABLE benchmark (
    id SERIAL PRIMARY KEY,
    lookup_key VARCHAR(50),
    data TEXT
);

-- Insert 10 million rows
INSERT INTO benchmark (lookup_key, data)
SELECT
    md5(random()::text),
    repeat('x', 100)
FROM generate_series(1, 10000000);

-- Create indexes
CREATE INDEX idx_btree ON benchmark(lookup_key);
CREATE INDEX idx_hash ON benchmark USING hash(lookup_key);

ANALYZE benchmark;

Results: Equality Lookup

EXPLAIN ANALYZE
SELECT * FROM benchmark WHERE lookup_key = 'abc123';

Index Type	Query Time	Index Size
No Index	1,234 ms	N/A
B-Tree	0.08 ms	214 MB
Hash	0.06 ms	198 MB

Winner: Hash index is 25% faster for equality lookups.

Results: Range Query

EXPLAIN ANALYZE
SELECT * FROM benchmark
WHERE lookup_key BETWEEN 'aaa' AND 'bbb';

Index Type	Query Time
No Index	1,456 ms
B-Tree	45 ms
Hash	1,398 ms

Winner: B-tree index. Hash index provides no benefit for range queries.

Results: ORDER BY

EXPLAIN ANALYZE
SELECT * FROM benchmark ORDER BY lookup_key LIMIT 100;

Index Type	Query Time
No Index	2,345 ms
B-Tree	0.12 ms
Hash	2,298 ms

Winner: B-tree index. Hash index can’t help with sorting.

Other Index Types

Modern databases offer specialized indexes for specific use cases:

GIN (Generalized Inverted Index)

For full-text search and array/JSON queries:

-- Full-text search
CREATE INDEX idx_articles_content ON articles USING gin(to_tsvector('english', content));

SELECT * FROM articles WHERE to_tsvector('english', content) @@ to_tsquery('database & performance');

-- JSONB queries
CREATE INDEX idx_users_metadata ON users USING gin(metadata);

SELECT * FROM users WHERE metadata @> '{"premium": true}';

GiST (Generalized Search Tree)

For geometric data, full-text, and range types:

-- Geometric queries
CREATE INDEX idx_locations_coords ON locations USING gist(coordinates);

SELECT * FROM locations WHERE coordinates && box '((0,0),(1,1))';

BRIN (Block Range Index)

For very large tables with naturally ordered data:

-- Efficient for time-series data
CREATE INDEX idx_logs_created ON logs USING brin(created_at);

-- Tiny index size (1000x smaller than B-tree)
-- Works well when data is mostly sorted by indexed column

Index Maintenance Best Practices

Monitor Index Usage

Find unused indexes:

SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan as scans,
    pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC, pg_relation_size(indexrelid) DESC;

Indexes with zero scans are candidates for removal.

Rebuild Bloated Indexes

Over time, indexes become fragmented:

-- Check index bloat
SELECT
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE schemaname = 'public';

-- Rebuild without locking (PostgreSQL 12+)
REINDEX INDEX CONCURRENTLY idx_users_email;

Avoid Over-Indexing

Each index has a cost:

-- Bad: Too many single-column indexes
CREATE INDEX idx_users_first_name ON users(first_name);
CREATE INDEX idx_users_last_name ON users(last_name);
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_city ON users(city);
CREATE INDEX idx_users_state ON users(state);

-- Better: Strategic composite index for common query patterns
CREATE INDEX idx_users_location ON users(state, city);
CREATE INDEX idx_users_email ON users(email); -- Keep for unique constraint

Aim for 3-5 indexes per table in most cases.

Choosing the Right Index Type

Use this decision tree:

Do you need range queries or sorting?
- Yes → B-tree index
- No → Continue
Is it an equality-only lookup on a high-cardinality column?
- Yes → Hash index (if your database supports it well)
- No → Continue
Is it full-text search?
- Yes → GIN or full-text index
- No → Continue
Is it geometric/spatial data?
- Yes → GiST or spatial index
- No → Continue
Is it a very large table with naturally sorted data?
- Yes → BRIN index
- No → Default to B-tree

Rule of thumb: When in doubt, use B-tree. It’s the most versatile and well-optimized index type.

Conclusion

Database indexes are essential for performance, but they’re not magic bullets:

B-tree indexes are the default choice for good reason - they handle almost every use case efficiently
Hash indexes provide marginal benefits for equality-only lookups but sacrifice versatility
Specialized indexes (GIN, GiST, BRIN) solve specific problems brilliantly

The key to effective indexing is understanding your query patterns, measuring performance, and optimizing strategically. Don’t create indexes blindly - profile your queries, identify bottlenecks, and index deliberately.

Start with B-tree indexes for your most common query patterns, measure their impact, and iterate from there.