GUID v Auto-increment IDs

by mandel on July 10th, 2009

I’m currently reading the rough cuts version of CouchDB: The Definitive Guide and at the very beginning of the book I have found the following remark:

Some SQL databases provide a feature where each new row in a table is assigned a unique numerical ID as it is inserted. These automatically incrementing row IDs, or autoincrement IDs as they are commonly known, can tempt us into taking a shortcut around the uniqueness problem. Autoincrement IDs can be useful for records without a natural key, but using them for all tables without proper consideration, as is common, can lead to problems. Unfortunately, this practice has been mistakenly enshrined as a core part of the way current databases work.

The above paragraph got me thinking, have I been doing that in macaco-contacts? specially in the ORM mapping and table definition.

I have been thinking on why I have being using auto-incremented IDs for the tables and the reason to do so. Should I remove the auto-incremented IDs to use more natural keys? Well lets first think about the options:

  • Use an auto-incremented ID for the PK.
  • Use a GUI for the PK
  • Use natural PKs (use columns that have data as PKs)

Auto-increment IDs

This is currently my choice. but is it the correct one? Well using auto-incremented IDs has the following pros:

  • A 4 byte integer is a small unique column to search is computationally easier than the use of the other means.
  • Is easy, the db does it.
  • I will for sure be unique (without replication).

Unfortunately I do understand that using a synthetic PKs is not the best practice, also the use of local auto-incremented IDs gives the following problems:

  • ID are not uniquely generated when using replication (foreseen in the future).
  • The key is not related with the data, why did you use 1 and not 2 as PK?
  • The ID column is most cases is redundant.

Well the most clear way to remove the auto-increment keys it would be using the columns present in the tables as the natural PKs. After some careful thinking I have the following conclusions:

  • Contacts cannot be easily uniquely identify using their name (the name in the company or the combination of name, middle-name and surname in a person). This is uite obvious, I’m sure I am not the only Manuel de la Pena or that their is just one Apple company in the world.
  • Numbers, addresses, emails and other contact data are unique, but it will give me a problem when adding them in the Db. Do I have to check for the number before adding it and later check it. Well that in SQLAlchemy is not that easy.

This lives be believe that using natural keys in this precise case will make my development a lot harder and there is no real advantage for the software (at least in this point of time).

GUIDs

GUIDs are a nice way to keep a unique identifier in a distributes database. GUIDs have the following advantages:

  • Unique across every table, every database, every server
  • Allows easy merging of records from different databases
  • Allows easy distribution of databases across multiple servers
  • You can generate IDs anywhere, instead of having to roundtrip to the database
  • Most replication scenarios require GUID columns anyway

Ofcourse the use of a GUID has the following cons.

  • It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you’re not careful

Conclusion

I believe that changing the primary keys from auto-incremented keys is the right way to go, specially if I want to (at some point) integrate the library with coachDB to allow users have their database replicated over the net and that way be able to access it from anywhere.

I will be working with SQLite for know, but I am very tempted to switch to coachDB in the near future. I’m thinking about doing this after the first alpha release of the app ;)

From Python

  • Roland Kaufmann

    Having been in this situation myself, I have one advice: Benchmark it! Write a little application that creates a table, inserts ten thousand records with nothing else that an identifier and read them back again in some random order. One version with autonumber, and one version with GUID. Check the time used and the size of the database on disk. I think you’ll be surprised.

    In my case, the speed overhead of the GUID version was around 10%. And that is not considering that the data that you are storing will take a lot more space than the identifiers alone.

    Another thing is that I find the remark “…instead of having to *roundtrip* to the database” somewhat troubling; most DBMSes have a way to retrieve the last generated autonumber inserted in that session and return that as a query result.

    There is really no way of using autonumbers in a portable manner; you should always use the facility that come with your DBMS.

  • http://mandel.themacaque.com mandel

    Have you tried using the auto ID in a distributed DB. My idea was to initially created a SQLite database in the desktop, laptop etc… Use that locally when not connected to the Internet and later update the new data to a server DB (Postgre) .

    All my thinking has started because if I have the app in a laptop and in the desktop I might have a problem with the ID given in the different machines…

    1. The laptop creates a new contact and does not sycn
    2. The deskop creates a new contact and does not sycn

    In this situation it might happen that I#ll have the same id for different contacts when they sync to the centralized repo….

    I hope I make sense :P