Monday, January 25, 2016

A Cache of Identities and a Sequence of Events

Bits and Bytes of
Sequences
Recently I received an interesting question: DB2 and other database systems have a feature to automatically generate numbers in a specified order. For DB2 this generator is called a sequence. When it is directly used to assign values to a column of a table, the term identity column is used. The next value of a sequence or identity column is derived by incrementing the last value by a specified amount. This works well, but sometimes there is a gap. Why?



Indeed is it possible to experience gaps for the generated numbers. An example would be that numbers from 1 to 187 have been generated and the next time a new value from a sequence or for an identity column is obtained, the number 200 is returned. The reason - and the possible gap and its properties - can be guessed by taking a look at the syntax, either for a sequence (CREATE SEQUENCE) or for an identity column (as part of CREATE TABLE): Both have a caching option with the default of "CACHE 20".

For each new value the database system needs to increment the last assigned value and return it. Calls can come from possibly hundreds of concurrent transactions. Because the database system needs to be fault-tolerant, the last assigned value needs to be persisted in the sequence's metadata (system catalog SYSCAT.SEQUENCES). After a possible system crash, the database system could obtain the last assigned value and start from there. Writing out the current number to disk for each request is really expensive, hence a chunk of new numbers is cached by DB2. The default is 20 values (CACHE 20). When the database shuts down ("is deactivated") the last number assigned off that cache is not stored. When the sequence or identity column is used the next time after the database is started or activated, the number for the next value is read from the system catalog which possibly causes a gap between the previously generated to the new value.


Caching is used for performance reasons, not for beauty. And hence users have the choice of changing the sequence or identity column behavior to either "no cache" (causes disk I/O for each new value) or far bigger cache sizes with possibly larger gaps. With this and other options users can give sequences their own identity...