Database Design and Normalisation

This area is one which many students find difficult at first, because very often there is no one right answer.

In order to normalise data, you must apply the rules as described in chapter 5 of the Course Text:

However, your decision as to whether an attribute (i.e. column) is dependent upon another attribute is based on your understanding of the relationship or business process which you are modelling. There is always an element of subjective judgement

Secondly, the exercises we use in class are always a simplification of real-world examples. In practice, databases used in business today may have many hundreds of different tables. Our exercises in class have just a handful of tables.

You may find it helpful sometimes to note down the assumptions which you make in your database design.

An example - the British Army

As an example, I have chosen to design a simple database to show where individual soldiers are stationed. I want to store the name, rank and serial number of each soldier, their company and regiment, and where they are stationed.

The following table of data is a first attempt to illustrate what my database might contain:

Name Rank Serial_No Company Battalion Location
Aaron Private

12345678

1st Rifle Company Devonshire & Dorset Regiment Belfast
Brooks Sergeant

54321765

No 7 Company 1st Battalion, Coldstream Guards Basra
Waters Private

45326789

Headquarters Company 1st Battalion, Queens Lancashire Cyprus

Now the first thing to notice is that I am immediately in trouble regarding companies, regiments and battalions. I have learned what little I know about the organisation of the army from the Ministry of Defence web site, and I can tell you it is quite complicated. You can see for yourself at http://www.army.mod.uk/unitsandorgs/ and http://www.army.mod.uk/infantry/org_role_loc/infantry_locations_uk.htm for example.

Generally speaking, soldiers belong to a company which is part of a battalion. Army units are usually deployed to particular locations as battalions, although it is possible to send a company on its own, or indeed to post an individual soldier somewhere!

Furthermore, some regiments only have one battalion, whereas other regiments consist of four or five different battalions.

I have decided to get around this by making the assumption, for the purpose of this exercise, that it is always the battalion which is sent to a particular location. I know this isn't always true in real life, but I have to make such an assumption otherwise the exercise would become too complicated for use in class.

As always in database design, drawing an E-R diagram helps me to visualise my model and will be useful to me subsequently when I am normalising my data.

So at each location I may have one or more battalions which may come from different regiments. For example I may have a battalion of infantry and a battalion of engineers to support them. And I'm going to completely ignore platoons, divisions, brigades, corps and other organisational units which exist in the real-life army!

The next step is to work on my table of data. I will have to introduce a new column to distinguish between battalion and regiment.

As described in Richie, I want to start by splitting my table into two tables by removing a repeating group. But I will only see a repeating group if I have put repeating data into my table in the first place! When you populate your data table for the purpose of a normalisation exercise, you must go out of your way to create repeating data where it is possible to do so!

Name Rank Serial_No Company Regiment Battalion Location
Aaron Private

12345678

1st Rifle Company Devonshire & Dorset 2nd Battalion Belfast
Brooks Sergeant

54321765

No 7 Company Coldstream Guards 1st Battalion Basra
Kendrick Private

45888789

Headquarters Company Coldstream Guards 1st Battalion Basra
Kelly Corporal

25111643

Headquarters Company Coldstream Guards 1st Battalion Basra
Khan Private

77323289

3rd Rifle Company Coldstream Guards 2nd Battalion Catterick
Morris Corporal

12126733

2nd Rifle Company Coldstream Guards 2nd Battalion Catterick
Nunn Private

48446789

B Company Queens Lancashire 1st Battalion Cyprus
Wall Private

85744789

B Company Queens Lancashire 1st Battalion Cyprus
Waters Private

99326789

Headquarters Company Queens Lancashire 1st Battalion Cyprus

Notice how, in the table above, the names of battalions are repeated, but the 2nd Battalion of the Coldstream Guards is a different entity to the 2nd Battalion of the Devonshire and Dorset.

Similarly, most battalions have a "Headquarters Company". But the two instances of "Headquarters Company" in the above table represent two different entities.

I now have various potential repeating groups for my initial "two table" normalization. These include:

This makes sense when compared with my E-R diagram because

But I could also start with Company, Regiment, Battalion, Location as my repeating group. It really doesn't matter where I split the table because I will be applying the normalization rules after I split the table, and I should arrive at the same result whichever route I take.

The table we form from the repeating group must have a unique primary key, and the remaining table must have a foreign key which allows us to maintain the connection between the two sets of data. (It may be appropriate to create a new field for the primary key, but let's leave that for now.) The primary key is a column or group of columns which uniquely identifies each instance of the entity. There must be only one row in the table with a particular value in the primary key column, or a particular combination of values if the key is more than one column.

Let's start with Company, Regiment, Battalion,Location. This gives us one table defining a company as an entity with four attributes, three of which combine to make a composite primary key, plus a second table where each soldier has their Company, Regiment and Battalion as the foreign key.

We need to identify one or more columns as a unique primary key in the second table as well. We can see that for the Soldier table, Serial_No is obviously a unique primary key. (If a primary key does not obviously present itself, it may again be appropriate to create a new column.)

Table CompanyTable Soldier
RegimentSerial_No
BattalionName
CompanyRank
LocationRegiment
Battalion
Company

Since we are not keeping any information about locations other than the location name, we can simplify matters further by considering location to be just an attribute of a battalion, rather than being an entity in itself as well as an attribute of a battalion, as was shown in the E-R diagram. The danger in doing so is that we risk having different spellings of the same location. This would then cause a problem when querying the database, as we could select different data by using different spellings of the same location name. In practice we would overcome this problem by having a table of valid location names, and only permitting data values for the location attribute which had matching entries in the validation table. This is quite a common issue when trying to determine whether something is an entity in itself - and therefore requires a separate database table - or is merely an attribute of another entity. In this example, we will proceed with location as an attribute rather than an entity.

We now ask is "Location" dependent on the whole of the primary key, or just a part of it. Earlier we made the assumption that only an entire battalion could be sent to a particular location. Hence location is dependent only on the part of the key which uniquely identifies the battalion:

Table CompanyTable SoldierTable Deployment
RegimentSerial_NoRegiment
BattalionNameBattalion
CompanyRankLocation
Regiment
Battalion
Company

Note that if my initial two-table split had put Location into the Soldier table, then it would fail the other test as an attribute which is dependent on another non-key attribute.

Whilst normalized, the above is potentially a clumsy solution in that the Company table has a lengthy key composed of text fields, and a copy of one of these lengthy keys appears in each record of the soldier table. We can overcome this by creating an additional Company_ID field as a unique primary key for the Company table.

Table CompanyTable SoldierTable Deployment
Company_IDSerial_NoRegiment
RegimentNameBattalion
BattalionRankLocation
CompanyCompany_ID

Taking this approach further yields a more complex, but arguably more elegant database, with perhaps greater scope for enforcing data integrity. And there are many other possible solutions.

Table Regiment_NamesTable Battalion_NamesTable Company_Names Table Companies
Regiment_CodeBattalion_CodeCompany_Code Company_ID
Regiment_NameBattalion_NameCompany_NameBattalion_ID
Company_Code
Table BattalionsTable Soldier
Battalion_IDSerial_No
Battalion_CodeName
Regiment_CodeRank
LocationCompany_ID

The effect of time

This exercise is also quite a good example of how history can be lost by the relational database approach. This database tells us where a soldier is now, but not necessarily where they were six months ago. So, as a finishing touch I will remove Location from the Battalions table and create a new table.

Table Tours_Of_Duty
Battalion_ID
Start_Date
End_Date
Location

PT - May 2005 updated Dec 2006, Apr 2008