Deep Dive on Database Indexing
Database indexing is a strategy for optimizing data retrieval and querying from a database. A database index is a data structure that allows DBMS to identify certain rows in a table rapidly.
Types of Database Indexing
B- tree Index: It is appropriate for data having a low cardinality (number of unique values), such as cities, dates, or status codes, and that may be ordered. We can use B-tree index whenever an indexed column is involved in a comparison using one of these operators:
< <= = >= >
A B-tree index search is also used to develop constructs that are comparable to combinations of these operators, such as BETWEEN and IN. A B-tree index can also be employed with an IS NULL or IS NOT NULL condition on an index column.
Hash: Hash indexes hold a 32-bit hash code obtained from the indexed column’s value. As a result, these indexes can only perform simple equality (=) comparisons. Hash indexes are ideally suited for handling point lookups on large cardinality columns.
Bitmap Index: Bitmap indexes are useful for columns with a high cardinality (a large number of distinct values). They describe the presence or absence of a value in the indexed column using bitmap vectors. They perform well when numerous criteria are combined using AND or OR operators.
GiST Index: GiST indexes are not a single type of index, but rather a framework that allows for the implementation of many distinct indexing algorithms. GIST indexes are excellent for storing complicated data structures such as JSONs. They are useful when we know anything about the internal structure of the things. For instance, PostgreSQL’s standard 423 Indexes package contains GiST operator classes for various two-dimensional geometric data types, which allow indexed queries using these operators:
<< &< &> >> <<| &<| |&> |>> @> <@ ~= &&
GiST indexes are also capable of optimizing “nearest-neighbor” searches.
SP-GiST: SP-GiST indexes, like GiST indexes, provide a foundation for many types of searches. For example, the PostgreSQL standard distribution contains SP-GiST operator classes for two-dimensional points, which allow indexed queries with these operators:
<< >> ~= <@ <<| |>>
GIN: GIN indexes are “inverted indexes” that are useful for data values with many component values, such as arrays. PostgreSQL’s standard distribution contains a GIN operator class for arrays, which allows indexed queries with these operators:
<@ @> = &&
BRIN: BRIN indexes (Block Range INdexes) offer summaries of the values recorded in a table’s sequential physical block ranges. As a result, they perform best for columns whose values are closely tied to the actual order of the table rows. The indexed data corresponds to the lowest and maximum values of the values in the column for each block range for data types with a linear sort order. This supports indexed queries with the following operators:
< <= = >= >
Trigram Index: Trigram indexes serve the purpose to search for strings in huge datasets quickly without providing an explicit search word. Trigram indexes are ideal for determining the context of words that appear together. They can be used to boost the efficiency of the LIKE operator with wildcards.
How to identify where to applicable index?
Identify Frequently Accessed Columns: Analyze the queries frequently executed in the application and identify the columns involved in those queries.If a column is frequently used in the WHERE clause for filtering, it’s a good candidate for indexing.
Slow Queries: Analyze slow-performing queries and consider whether adding indexes could optimize them. Use EXPLAIN (EXPLAIN your_query) or EXPLAIN ANALYZE to see the query performance -
EXPLAIN (FORMAT JSON) SELECT * FROM vrs_analytics LIMIT 1000;
If we want to show in YAML format —
EXPLAIN (FORMAT YAML) SELECT * FROM vrs_analytics LIMIT 1000;
High Cardinality Columns: Columns with high cardinality (many unique values) are often good candidates for indexing. Indexes on such columns can be more selective.
Composite Indexes: Consider constructing composite indexes that cover all of the columns used in the query for queries with multiple columns in the WHERE clause. This aids in decreasing the number of necessary indexes and optimising query speed.
Problems of Database Indexing
Indexes, like most things in life, are not problem free. Indexing accelerates read operations, resulting in quicker and more efficient queries but indexes can slow down write operations (INSERT, UPDATE, DELETE). Because when an INSERT, UPDATE, or DELETE is performed on a table having an index, all of the indexes connected with the table must be refreshed. These updates take time, and the performance of a process that alters tables in bulk might suffer when the table has several indexes.
Indexes take up more storage space. This can add up, especially when working with big databases or several indexes. Moreover, indexes may use memory for caching, impacting the overall memory use of the system.
When many indexes are present, the database query planner may spend more time establishing an optimal query plan.
Indexes may cause greater lock contention during write operations, potentially resulting in blocking difficulties in a multi-user scenario. Indexes that lack proper design can lead to deadlocks, which occur when two or more transactions wait for each other to release locks.
Creating indexes for each column may result in declining benefits and greater maintenance costs.
How could rescue from these problems?
Firstly, when developing an index, we must weigh the costs and benefits of having them.
To reduce the impact on indexes while carrying out bulk inserts or updates, consider applying batch operations rather than individual transactions. Consider table partitioning as an approach for managing and optimising the performance of write-intensive tables.
Regularly monitor the usage of indexes and remove those that are not providing significant benefits.
Implement regular maintenance tasks to address index fragmentation and optimise their performance. Schedule tasks such as rebuilding or reorganising indexes based on usage patterns.
Implement effective concurrency control mechanisms to handle contention for locks during write operations. Choose isolation levels that strike a balance between consistency and performance.
Use database management tools that provide query analysers or profiling tools to identify inefficient queries and suggest potential indexes.Some databases have advisor tools that can recommend indexes based on query patterns.