Indexing in PostgreSQL (B-Tree)
Indexing in data bases:
Indexing is common strategy used by databases to reduce the query times. PostgreSQL supports multiple types of indexing like B-Tree, Generalized Inverted Index (GIN), Generalized Inverted Search Tree (GiST), Hash to name few. The basic idea in all of them is that query performance can be improved by minimizing the number of disk accesses required to execute the query. In this blog we will be focusing on B-Tree index type.
Before going further, let look at some definitions:
Index: An index is a specific structure that organizes a reference to your data that makes it easier to look up. In PostgreSQL it is a copy of the item you wish to index combined with a reference to the actual data location.
Page: A page, memory page, or virtual page is a fixed-length contiguous block of virtual memory, described by a single entry in the page table. It is the smallest unit of data for memory management.
Block: A block is the smallest unit of data that an operating system can either write to a file or read from a storage devices. Typically it is equal to page size.
B-Tree: B-Tree is a self-balancing search tree. Unlike most other self-balancing search trees (like AVL and Red-Black Trees), the main use case for B-Tree’s is when the data is too much to keep all of it in memory and is read from disk in the form of blocks. This reading from disk is very expensive.
Important things about B-Tree:
- B-Tree is a balanced tree hence operations like search, insert, delete are O(h) complex where h is the height of the tree.
- In the case of B-Tree the complexity is measured in terms of disk accesses.
- In general B-Tree’s have huge number of branching factor to keep their height as small as possible. ( As b^h= N)
- The height of B-Trees is kept low by putting maximum possible keys in a B-Tree node. Generally, a B-Tree node size is kept equal to the disk block size.
- Since h is low for B-Tree, total disk accesses for most of the operations are reduced significantly compared to balanced Binary Search Trees like AVL Tree, Red-Black Tree etc.
Query: The look-up in a B-Tree are straight forward. We start at the root node and the follow the pointer till we reach a leaf node. Within each node we can use binary search to determine which pointer to take next. In leaf node a binary search will result in the required element if present.
Insertion: Insertion in a B-Tree is also fairly simple. Similar to search we will find the root node where the data should be inserted and:
- If the node has space in the node and redistribute the node.
2. If the node is full then break the node into two other nodes with median as parent and push the new node to the previous parent node and repeat step 1 in parent.
This ensures that the entire tree is balanced.
B-Tree Indexing in PostgreSQL:
Now that the definitions are clear lets understand B-Tree Indexing in PostgreSQL. As the name suggests the primary data structure used in B-Tree index is a B-Tree. In PostgreSQL the node size of the B-Tree is kept same as the page/block size which is default size 8KB. PostgreSQL uses Efficient Locking for Concurrent Operations on B-Trees version of B-Tree’s.
B-Tree Index Implementation in PostgreSQL:
- The order of all the nodes in index tree need not be same unlike the convention as each node is limited by total size of the node rather than total number of key in it. Meaning if you have variable-size data, each node in your index tree can have different number of children.
- All the non-leaf nodes are only useful for traversal. No information regarding the actual data is stored in them.
- Each key in leaf node stores the information of pointer to block in which actual record is stored.
The O(h) insert and search capability provided by the B-Tree is the reason behind the magical promise of indexes to tremendously decrease of query time while only marginally affecting the insert time.
Some interesting facts about Indexes in PostgreSQL:
- Even when there is an index present PostgreSQL will sometimes not use it if the query results in large percentage of data from the table. This is because it is easier to scan the entire table only once than using the index getting the block addresses for almost all of the table, and then actually accessing the data from table.
- When all the required information for the query is in index PostgreSQL (>9.2) will not access the table at all.
- Having index on columns which has smaller individual data entries can increase performance very slightly compared to having index on column with larger data as more number of index entries can be fit in a single disk page, which reduces the number of disk accesses required for getting index.