GUID v Auto-increment IDs
I’m currently reading the rough cuts version of CouchDB: The Definitive Guide and at the very beginning of the book I have found the following remark:
Some SQL databases provide a feature where each new row in a table is assigned a unique numerical ID as it is inserted. These automatically incrementing row IDs, or autoincrement IDs as they are commonly known, can tempt us into taking a shortcut around the uniqueness problem. Autoincrement IDs can be useful for records without a natural key, but using them for all tables without proper consideration, as is common, can lead to problems. Unfortunately, this practice has been mistakenly enshrined as a core part of the way current databases work.
The above paragraph got me thinking, have I been doing that in macaco-contacts? specially in the ORM mapping and table definition.
I have been thinking on why I have being using auto-incremented IDs for the tables and the reason to do so. Should I remove the auto-incremented IDs to use more natural keys? Well lets first think about the options:
- Use an auto-incremented ID for the PK.
- Use a GUI for the PK
- Use natural PKs (use columns that have data as PKs)
This is currently my choice. but is it the correct one? Well using auto-incremented IDs has the following pros:
- A 4 byte integer is a small unique column to search is computationally easier than the use of the other means.
- Is easy, the db does it.
- I will for sure be unique (without replication).
Unfortunately I do understand that using a synthetic PKs is not the best practice, also the use of local auto-incremented IDs gives the following problems:
- ID are not uniquely generated when using replication (foreseen in the future).
- The key is not related with the data, why did you use 1 and not 2 as PK?
- The ID column is most cases is redundant.
Well the most clear way to remove the auto-increment keys it would be using the columns present in the tables as the natural PKs. After some careful thinking I have the following conclusions:
- Contacts cannot be easily uniquely identify using their name (the name in the company or the combination of name, middle-name and surname in a person). This is uite obvious, I’m sure I am not the only Manuel de la Pena or that their is just one Apple company in the world.
- Numbers, addresses, emails and other contact data are unique, but it will give me a problem when adding them in the Db. Do I have to check for the number before adding it and later check it. Well that in SQLAlchemy is not that easy.
This lives be believe that using natural keys in this precise case will make my development a lot harder and there is no real advantage for the software (at least in this point of time).
GUIDs are a nice way to keep a unique identifier in a distributes database. GUIDs have the following advantages:
- Unique across every table, every database, every server
- Allows easy merging of records from different databases
- Allows easy distribution of databases across multiple servers
- You can generate IDs anywhere, instead of having to roundtrip to the database
- Most replication scenarios require GUID columns anyway
Ofcourse the use of a GUID has the following cons.
- It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you’re not careful
I believe that changing the primary keys from auto-incremented keys is the right way to go, specially if I want to (at some point) integrate the library with coachDB to allow users have their database replicated over the net and that way be able to access it from anywhere.
I will be working with SQLite for know, but I am very tempted to switch to coachDB in the near future. I’m thinking about doing this after the first alpha release of the app