Module 4: Data Modeling

Data Normalization

Normalization is the process of designing or refactoring database tables for maximum consistency and minimum redundancy.

With objects, we're used to denormalized data, stored with ease of use and speed in mind. Non-normalized tables are considered ineffective in relational databases.

How to Build It

Data normalization is a deep topic in database design. To begin thinking about it, we'll explore a few basic guidelines and some data examples that violate these rules.

Normalization Guidelines

  • Each record has a primary key
  • No fields are repeated
  • All fields relate directly to the key data
  • Each field entry contains a single data point
  • There are no redundant entries

Denormalized Data

farm_name animal1 animal2 animal3
Beech Ranch pigs chickens goats
Morton Farms horses chickens cows

This table has two issues. There is no proper id field (as multiple farms may have the same name) and multiple fields are representing the same type of data: animals.

farm_id farm_name animals
1 Beech Ranch pigs, chickens, goats
2 Morton Farms horses, chickens, cows

While we have now eliminated the first two issues, we now have multiple entries in one field, separated by commas. This isn't good either, as it's another example of denormalization. There is no "array" data type in a relational database, so each field must contain only one data point.

animal_id animal farm_name ann_revenue
1 pig Beech Ranch 65000
2 chicken Beech Ranch 65000
3 goat Beech Ranch 65000

Now we've solved the multiple fields issue, but we created repeating data (the farm field), which is also an example of denormalization. We can also see that if we were tracking additional ranch information (such as annual revenue), that field is only vaguely related to the animal information.

When these issues begin arising in your schema design, it means that you should separate information into two or more tables.

Anomalies

Obeying the above guidelines prevent anomalies in your database when inserting, updating, or deleting. For example, imagine if the revenue of Beech Ranch changed. With our denormalized schema, it may get updated in some records but not others:

animal_id animal farm_name ann_revenue
1 pig Beech Ranch 45000
2 chicken Beech Ranch 65000
3 goat Beech Ranch 65000

Similarly, if Beech Ranch shut down, there would be three (if not more) records that needed to be deleted to remove a single farm.

Thus a denormalized table opens the door for contradictory, confusing, and unusable data.

Table Relationships

There are three types of relationships:

  • One to one
  • One to many
  • Many to many

Determining how data is related can provide guidelines for table representation and guide the use of foreign keys to connect said tables.

How to Build It

One to One Relationships

Imagine we are storing the financial projections for a series of farms.

We may wish to attach fields like farm name, address, description, projected revenue, and projected expenses. We could divide these fields into two categories: information related to the farm directly (name, address, description) and information related to the financial projections (revenue, expenses).

We would say that farms and projections have a one-to-one relationship. This is to say that every farm has exactly one projection, and every project corresponds to exactly one farm.

This data can be represented in two tables: farms and projections

id farm_name
1 Beech Ranch
2 Morton Farms
id farm_id revenue
1 1 65000
2 2 105000

The farm_id is the foreign key that links farms and projections together.

Notes about one-to-one relationships:

  • The foreign key should always have a unique constraint to prevent duplicate entries. In the example above, we wouldn't want to allow multiple projections records for one farm.
  • The foreign key can be in either table. For example, we may have had a projection_id in the farms table instead. A good rule of thumb is to put the foreign key in whichever table is more auxiliary to the other.
  • You can represent one-to-one data in a single table without creating anomalies. However, it is sometimes prudent to use two tables as shown above to keep separate concerns in separate tables

One to Many Relationships

Now imagine, we are storing the full-time ranchers employed at each farm. Each rancher would only work at one farm however, each farm may have multiple ranchers.

This is called a one-to-many relationship.

This is the most common type of relationship between entities. Some other examples:

  • One customer can have many orders.
  • One user can have many posts.
  • One post can have many comments.

Manage this type of relationship by adding a foreign key on the "many" table of the relationship that points to the primary key on the "one" table. Consider the farms and ranchers tables.

id farm_name
1 Beech Ranch
2 Morton Farms
id rancher_name farm_id
1 John Doe 1
2 Jane Doe 1
3 Jim Done 2
4 Jay Dow 2
5 Jen Dunn 1

In a many-to-many relationship, the foreign key (in this case farm_id) should not be unique.

Many to Many Relationships

If we want to track animals on a farm as well, we must explore the many-to-many relationship. A farm has multiple animals, and multiple types of animals are present at multiple different farms.

Some other examples:

  • an order can have many products and the same product will appear in many orders.
  • a book can have more than one author, and an author can write more than one book.

We need to introduce an intermediary table that holds foreign keys that reference the primary key on the related tables to model this relationship. We now have a farms, animals, and farm_animals table.

id farm_name
1 Beech Ranch
2 Morton Farms
id animal
1 pig
2 chicken
3 goat
farm_id animal_id
1 1
1 2
1 3
2 2

While each foreign key on the intermediary table is not unique, the combinations of keys should be unique.

Create Table Relationships Using Knex

The Knex query builder library also allows us to create multi-table schemas that include foreign keys. However, when designing a multi-table schema, there are a few extra things to keep in mind, such as using the correct order when creating tables, dropping tables, seeding data, and removing data.

We have to consider the way that delete and updates through our API will impact related data.

How to Build It

Foreign Key Setup

In Knex, foreign key restrictions don't automatically work. Whenever using foreign keys in your schema, add the following code to your knexfile. This will prevent users from entering bad data into a foreign key column.

development: {
  client: 'sqlite3',
  useNullAsDefault: true,
  connection: {
    filename: './data/database.db3',
  },
  // needed when using foreign keys
  pool: {
    afterCreate: (conn, done) => {
      // runs after a connection is made to the sqlite engine
      conn.run('PRAGMA foreign_keys = ON', done); // turn on FK enforcement
    },
  },
},

Migrations

Let's look at how we might track our farms and ranchers using Knex. In our migration file's up function, we would want to create two tables:

exports.up = function(knex, Promise) {
  return knex.schema
    .createTable('farms', tbl => {
      tbl.increments();
      tbl.string('farm_name', 128)
        .notNullable();
    })
    // we can chain together createTable
    .createTable('ranchers', tbl => {
      tbl.increments();
      tbl.string('rancher_name', 128);
      tbl.integer('farm_id')
        // forces integer to be positive
        .unsigned()
        .notNullable()
        .references('id')
        // this table must exist already
        .inTable('farms')
    })
};

Note that the foreign key can only be created after the reference table.

In the down function, the opposite is true. We always want to drop a table with a foreign key before dropping the table it references.

exports.down = function(knex, Promise) {
  // drop in the opposite order
  return knex.schema
    .dropTableIfExists('ranchers')
    .dropTableIfExists('farms')
};

In the case of a many-to-many relationship, the syntax for creating an intermediary table is identical, except for one additional piece. We need a way to make sure our combination of foreign keys is unique.

.createTable('farm_animals', tbl => {
  tbl.integer('farm_id')
    .unsigned()
    .notNullable()
    .references('id')
    // this table must exist already
    .inTable('farms')
  tbl.integer('animal_id')
    .unsigned()
    .notNullable()
    .references('id')
    // this table must exist already
    .inTable('animals')

  // the combination of the two keys becomes our primary key
  // will enforce unique combinations of ids
  tbl.primary(['farm_id', 'animal_id']);
});

Seeds

Order is also a concern when seeding. We want to create seeds in the same order we created our tables. In other words, don't create a seed with a foreign key until that reference record exists.

In our example, make sure to write the 01-farms seed file and then the 02-ranchers seed file.

However, we run into a problem with truncating our seeds, because we want to truncate 02-ranchers before 01-farms. A library called knex-cleaner provides an easy solution for us.

Run knex seed:make 00-cleanup and npm install knex-cleaner. Inside the cleanup seed, use the following code.

const cleaner = require('knex-cleaner');

exports.seed = function(knex) {
  return cleaner.clean(knex, {
    mode: 'truncate', // resets ids
    ignoreTables: ['knex_migrations', 'knex_migrations_lock'], // don't empty migration tables
  });
};

This removes all tables (excluding the two tables that track migrations) in the correct order before any seed files run.

Cascading

If a user attempt to delete a record that is referenced by another record (such as attempting to delete Morton Ranch when entries in our ranchers table reference that record), by default, the database will block the action. The same thing can happen when updating a record when a foreign key reference.

If we want that to override this default, we can delete or update with cascade. With cascade, deleting a record also deletes all referencing records, we can set that up in our schema.

.createTable('ranchers', tbl => {
    tbl.increments();
    tbl.string('rancher_name', 128);
    tbl.integer('farm_id')
    .unsigned()
    .notNullable()
    .references('id')
    .inTable('farms')
    .onUpdate('CASCADE');
    .onDelete('CASCADE')
})

Module 4 Project: Data Modeling

The module project contains advanced problems that will challenge and stretch your understanding of the module's content. The project has built-in tests for you to check your work, and the solution video is available in case you need help or want to see how we solved each challenge, but remember, there is always more than one way to solve a problem. Before reviewing the solution video, be sure to attempt the project and try solving the challenges yourself.

Instructions

The link below takes you to Bloom's code repository of the assignment. You'll need to fork the repo to your own GitHub account, and clone it down to your computer:

Starter Repo: Data Modeling

  • Fork the repository,
  • clone it to your machine, and
  • open the README.md file in VSCode, where you will find instructions on completing this Project.
  • submit your completed project to the BloomTech Portal

Solution

Additional Resources