[buug] normalization

Wed Oct 17 22:18:42 PDT 2001

at the last meeting we discussed the advantages and disadvantages of
database normalization.  i just came across some notes that i had written
up last year when i was trying to convince the powers-that-be at my
company that we were in severe need of a database redesign.

(originally, i had just cut-and-pasted the notes into this email message.
then i decided to edit out the company-specific pieces.  as i was editing
out those pieces, i started updating the notes to incorporate what i've
learned since i originally wrote those notes.  in the end, this turned
into a completely new write up.  things sometimes get away from
you.  hopefully, i haven't wasted too much time and someone will get
something of value out of this.)

first things first:

my experience has been primarily with Microsoft's SQL Server.  everything
that i've read suggests that, for the most part, they should generalize to
other high-end relational databases such as Oracle and Sybase.  they'll
generalize less to mid-range relational databases such as PostgreSQL.  
and, of course, they will only generalize to a limited extent to
non-relational databases such as FilePro, Access, and MySQL.  (and i don't
care what people say, MySQL is not a relational database.  it's a fine
product -- i use it myself -- and it has relational features.  but it's
not a relational database.)

benefits of normalization:

A. data integrity

the primary thing that normalization gives you is data integrity.  with a
fully normalized database design, you're not only assured that your
transactions will complete or rollback, you're also assured that inserts,
updates, and deletes -- that is, any type of modification to the data 
-- can't compromise your data.  to my mind, it is this latter attribute
that is the central benefit of relational databases.  there are a variety
of programmatical ways to check that a transaction has fully committed.
but only a fully normalized database with referential integrity can
guarantee the absence of insertion, update, and deletion anomalies.

B. data retrieval

many people will tell you that it's more difficult to build queries
against a fully normalized database.  to a certain extent, that is true.
a fully normalized database will have many tables, each with a handful of
columns.  to retrieve any single piece of data, then, requires joining
together multiple tables.  the key here then is simply to learn SQL -- SQL
was specifically designed to facilitate the retrieval of data from a fully
normalized database.  the problem is breaking out of the 
procedural-oriented programming mentality that most of us are familiar 
with.  indeed, SQL is set oriented which makes it a whole new ballgame.

i've found that with a fully normalized database, i can retrieve any piece
of data with a single query.  and, generally, the queries aren't that
complex.  this isn't to say that i never make use of multiple-statement 
queries; indeed, i often do.  but only under particular circumstances such
as calculations where i'll often need to write one statement per variable
in the calculation.  the important point, however, is that i never need to
run multiple statements to retrieve a given piece of information.

C. performance

anytime you read about normalization, you'll read that normalization
decreases performance.  the idea is that joins between tables are
expensive; since queries of a highly normalized database tend to require
joins between a greater number of tables, performance will suffer.  the
solutions, then, is careful denormalization of the database.

my experience is that this isn't true unless your tables have millions of
records.  a fully normalized database should exhibit greater performance
because (1) your queries and shorter and simpler (as discussed above), (2)
your tables have fewer records since duplicate information has been
eliminated through the normalization process, and (3) indexes are more
efficient because each table only contains a limited amount of
non-redundant information.  this last point can't be overstated as indexes
are a -- if not the -- primary key to performance.  with a properly
normalized database, your queries will almost always hit the clustered
index.

disadvantages of normalization:

A. normalization procedures

despite what i may say sometimes, the proper normalization of a database
is a challenging process.  there are CASE tools which purport to assist in
this process but i've never been particularly impressed with them.  i do
all of my normalization by hand.  the risk of course, is that an
improperly normalized database can cause significant problems.  i once
over-normalized a database such that the query ended up returning false
data.  pretty impressive, i thought.

furthermore, planning is even more important because databases don't lend
themselves to rapid development. the unix philosophy of "build a prototype
early" is a sure recipe for disaster when applied to databases.  i know
this from a very, very painful experience (which i'm still living with
today.  'nuff said).  i now spend a great deal of time researching my
projects before i even begin the design.  

B. database restructuring

proper database design is also important because it can be very difficult
to modify the structure of a database after the fact.  modifying a
properly normalized database is easier than modifying an improperly 
normalized database (primarily because you're more confident in the 
integrity of your data) but modifying a database structure is never
pleasant.

database restructuring is also problematic since the database generally
serves as the foundation of an larger application.  changes to the
database risk breaking the application.  (you can alleviate some of this
risk by implementing abstraction layers between the database and the
application but it's an imperfect solution) 

C. data insertion and maintenance

while a fully normalized database makes it easy to retrieve data, it can
be a serious pain in the ass to insert new records and update existing
records.  i have a handful of tables with inter-table dependencies (i.e.,
they have referential integrity on themselves).  these tables are locked
down so tightly that you can only insert, update, or delete data by
acting on all related records; it's impossible to update individual
records.  this is an extreme example (i've never seen it anywhere else)
but it does occur.  that data maintenance is tightly regulated isn't a bad
thing -- it guarantees the integrity of the data.  but it definitely can
be a pain.

one final thought about relational databases -- they're not always
appropriate.  i've come to believe that relational databases are used far
more often than they need to be.  when people come to me asking about
implementing a database system, more often than not, i end up recommending
against implementing a relational database.  for many applications, text
files are perfectly suitable.  if your needs are limited, the
relationships between your data are not particularly complex, and your
queries are  straightforward, MySQL or FilePro is the way to go.  and if
you find that you're going to need to embed a lot of complex business
rules into your database structure, take a look at an object-oriented
database system.  many of today's complex web applications -- particularly
banking and high-volume community sites such as Slashdot -- would be
better served by an OODBMS.  unlike an RDBMS which segregates the data
from the application, in an OODBMS, the data is the application.
unfortunately, at present, object-oriented database systems tend to be
quite expensive; and i'm not aware of any open source projects.