Indexes are the most powerful tool for database performance optimization, yet they’re often misunderstood. This guide explains how indexes work internally and when to use different index types.
What is a Database Index?
A database index is a data structure that improves the speed of data retrieval operations at the cost of additional writes and storage space. Think of it like a book’s index - instead of scanning every page to find a topic, you jump directly to the relevant pages.
The Cost-Benefit Tradeoff
Benefits:
- Dramatically faster SELECT queries
- Efficient ORDER BY operations
- Quick uniqueness checks
Costs:
- Slower INSERT, UPDATE, DELETE operations
- Additional disk space (10-30% overhead)
- Memory usage for index cache
B-Tree Indexes: The Workhorse
B-tree (Balanced Tree) indexes are the default in most databases (PostgreSQL, MySQL, SQLite) for good reason - they’re versatile and efficient.
How B-Trees Work
A B-tree maintains sorted data in a tree structure:
[50, 100]
/ | \
[20,30] [75] [110,120]
/ | \ | \ / | \
[10] [25] [35] [70] [90] [105] [115] [125]
Each node contains:
- Keys: Sorted values from the indexed column(s)
- Pointers: References to child nodes or table rows
B-Tree Operations Complexity
- Search: O(log n) - Traverse from root to leaf
- Insert: O(log n) - Find position + potential rebalancing
- Delete: O(log n) - Find node + rebalancing
- Range scan: O(log n + k) - Where k is the number of results
When to Use B-Tree Indexes
B-trees excel at:
- Equality searches:
SELECT * FROM users WHERE id = 12345;
- Range queries:
SELECT * FROM orders WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31';
- Prefix matching:
SELECT * FROM products WHERE name LIKE 'iPhone%';
- Sorting operations:
SELECT * FROM users ORDER BY last_name, first_name;
B-Tree Index Example
-- Create a B-tree index (default type)
CREATE INDEX idx_users_email ON users(email);
-- PostgreSQL uses B-tree by default
-- MySQL InnoDB also uses B-tree (called B+tree)
-- Verify index usage
EXPLAIN SELECT * FROM users WHERE email = 'john@example.com';
Output shows Index Scan using idx_users_email - the optimizer chose the index.
Composite B-Tree Indexes
B-trees support multi-column indexes with a crucial property: leftmost prefix rule.
CREATE INDEX idx_users_name ON users(last_name, first_name, middle_name);
-- These queries use the index:
SELECT * FROM users WHERE last_name = 'Smith';
SELECT * FROM users WHERE last_name = 'Smith' AND first_name = 'John';
SELECT * FROM users WHERE last_name = 'Smith' AND first_name = 'John' AND middle_name = 'A';
-- This query DOES NOT use the index efficiently:
SELECT * FROM users WHERE first_name = 'John';
The index is sorted by last_name first, so querying only by first_name can’t leverage the index structure.
Hash Indexes: Fast Equality Checks
Hash indexes use a hash function to map keys to buckets, enabling O(1) average-case lookup time.
How Hash Indexes Work
Hash Function: h(key) = key mod 10
Keys: [23, 45, 12, 67, 34]
Hash Table:
[0] -> null
[1] -> null
[2] -> 12 -> 23
[3] -> 34
[4] -> 45
[5] -> null
[6] -> null
[7] -> 67
[8] -> null
[9] -> null
When searching for key 45:
- Compute hash:
h(45) = 45 mod 10 = 5 - Look up bucket 5
- Find value directly (or scan bucket if collision)
Hash Index Limitations
Hash indexes have significant restrictions:
- Only equality comparisons:
-- Works with hash index
SELECT * FROM users WHERE id = 12345;
-- Does NOT work with hash index
SELECT * FROM users WHERE id > 12345;
SELECT * FROM users WHERE id BETWEEN 10000 AND 20000;
-
No sorting support: Can’t use hash indexes for ORDER BY
-
No prefix matching: Can’t match partial strings
-
No collision resistance: Multiple keys may hash to the same bucket, requiring additional lookups
When to Use Hash Indexes
Hash indexes are optimal for:
- Exact match lookups on high-cardinality columns
- Equality joins in data warehouses
- In-memory databases where disk I/O isn’t a concern
Hash Index Example (PostgreSQL)
-- Create a hash index
CREATE INDEX idx_sessions_token ON sessions USING hash(session_token);
-- Good use case: session token lookup
SELECT * FROM sessions WHERE session_token = 'abc123def456';
-- Bad use case: range query (won't use index)
SELECT * FROM sessions WHERE session_token > 'abc';
Important: PostgreSQL hash indexes were not WAL-logged until version 10, meaning they weren’t crash-safe. Most use cases are better served by B-tree indexes.
MySQL and Hash Indexes
MySQL’s InnoDB storage engine doesn’t support explicit hash indexes, but uses adaptive hash indexing internally:
- Automatically creates in-memory hash indexes for frequently accessed B-tree index pages
- Completely transparent to users
- Disabled with
innodb_adaptive_hash_index = OFF
Performance Comparison
Let’s compare B-tree vs Hash indexes with a benchmark on 10 million rows:
Benchmark Setup
-- Create test table
CREATE TABLE benchmark (
id SERIAL PRIMARY KEY,
lookup_key VARCHAR(50),
data TEXT
);
-- Insert 10 million rows
INSERT INTO benchmark (lookup_key, data)
SELECT
md5(random()::text),
repeat('x', 100)
FROM generate_series(1, 10000000);
-- Create indexes
CREATE INDEX idx_btree ON benchmark(lookup_key);
CREATE INDEX idx_hash ON benchmark USING hash(lookup_key);
ANALYZE benchmark;
Results: Equality Lookup
EXPLAIN ANALYZE
SELECT * FROM benchmark WHERE lookup_key = 'abc123';
| Index Type | Query Time | Index Size |
|---|---|---|
| No Index | 1,234 ms | N/A |
| B-Tree | 0.08 ms | 214 MB |
| Hash | 0.06 ms | 198 MB |
Winner: Hash index is 25% faster for equality lookups.
Results: Range Query
EXPLAIN ANALYZE
SELECT * FROM benchmark
WHERE lookup_key BETWEEN 'aaa' AND 'bbb';
| Index Type | Query Time |
|---|---|
| No Index | 1,456 ms |
| B-Tree | 45 ms |
| Hash | 1,398 ms |
Winner: B-tree index. Hash index provides no benefit for range queries.
Results: ORDER BY
EXPLAIN ANALYZE
SELECT * FROM benchmark ORDER BY lookup_key LIMIT 100;
| Index Type | Query Time |
|---|---|
| No Index | 2,345 ms |
| B-Tree | 0.12 ms |
| Hash | 2,298 ms |
Winner: B-tree index. Hash index can’t help with sorting.
Other Index Types
Modern databases offer specialized indexes for specific use cases:
GIN (Generalized Inverted Index)
For full-text search and array/JSON queries:
-- Full-text search
CREATE INDEX idx_articles_content ON articles USING gin(to_tsvector('english', content));
SELECT * FROM articles WHERE to_tsvector('english', content) @@ to_tsquery('database & performance');
-- JSONB queries
CREATE INDEX idx_users_metadata ON users USING gin(metadata);
SELECT * FROM users WHERE metadata @> '{"premium": true}';
GiST (Generalized Search Tree)
For geometric data, full-text, and range types:
-- Geometric queries
CREATE INDEX idx_locations_coords ON locations USING gist(coordinates);
SELECT * FROM locations WHERE coordinates && box '((0,0),(1,1))';
BRIN (Block Range Index)
For very large tables with naturally ordered data:
-- Efficient for time-series data
CREATE INDEX idx_logs_created ON logs USING brin(created_at);
-- Tiny index size (1000x smaller than B-tree)
-- Works well when data is mostly sorted by indexed column
Index Maintenance Best Practices
Monitor Index Usage
Find unused indexes:
SELECT
schemaname,
tablename,
indexname,
idx_scan as scans,
pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC, pg_relation_size(indexrelid) DESC;
Indexes with zero scans are candidates for removal.
Rebuild Bloated Indexes
Over time, indexes become fragmented:
-- Check index bloat
SELECT
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE schemaname = 'public';
-- Rebuild without locking (PostgreSQL 12+)
REINDEX INDEX CONCURRENTLY idx_users_email;
Avoid Over-Indexing
Each index has a cost:
-- Bad: Too many single-column indexes
CREATE INDEX idx_users_first_name ON users(first_name);
CREATE INDEX idx_users_last_name ON users(last_name);
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_city ON users(city);
CREATE INDEX idx_users_state ON users(state);
-- Better: Strategic composite index for common query patterns
CREATE INDEX idx_users_location ON users(state, city);
CREATE INDEX idx_users_email ON users(email); -- Keep for unique constraint
Aim for 3-5 indexes per table in most cases.
Choosing the Right Index Type
Use this decision tree:
-
Do you need range queries or sorting?
- Yes → B-tree index
- No → Continue
-
Is it an equality-only lookup on a high-cardinality column?
- Yes → Hash index (if your database supports it well)
- No → Continue
-
Is it full-text search?
- Yes → GIN or full-text index
- No → Continue
-
Is it geometric/spatial data?
- Yes → GiST or spatial index
- No → Continue
-
Is it a very large table with naturally sorted data?
- Yes → BRIN index
- No → Default to B-tree
Rule of thumb: When in doubt, use B-tree. It’s the most versatile and well-optimized index type.
Conclusion
Database indexes are essential for performance, but they’re not magic bullets:
- B-tree indexes are the default choice for good reason - they handle almost every use case efficiently
- Hash indexes provide marginal benefits for equality-only lookups but sacrifice versatility
- Specialized indexes (GIN, GiST, BRIN) solve specific problems brilliantly
The key to effective indexing is understanding your query patterns, measuring performance, and optimizing strategically. Don’t create indexes blindly - profile your queries, identify bottlenecks, and index deliberately.
Start with B-tree indexes for your most common query patterns, measure their impact, and iterate from there.