Module 4 - Data Modeling | Web Unit 4 Sprint 14

Data Normalization

Normalization is the process of designing or refactoring database tables for maximum consistency and minimum redundancy.

With objects, we're used to denormalized data, stored with ease of use and speed in mind. Non-normalized tables are considered ineffective in relational databases.

How to Build It

Data normalization is a deep topic in database design. To begin thinking about it, we'll explore a few basic guidelines and some data examples that violate these rules.

Normalization Guidelines

Each record has a primary key
No fields are repeated
All fields relate directly to the key data
Each field entry contains a single data point
There are no redundant entries

Denormalized Data

farm_name	animal1	animal2	animal3
Beech Ranch	pigs	chickens	goats
Morton Farms	horses	chickens	cows

This table has two issues. There is no proper id field (as multiple farms may have the same name) and multiple fields are representing the same type of data: animals.

farm_id	farm_name	animals
1	Beech Ranch	pigs, chickens, goats
2	Morton Farms	horses, chickens, cows

While we have now eliminated the first two issues, we now have multiple entries in one field, separated by commas. This isn't good either, as it's another example of denormalization. There is no "array" data type in a relational database, so each field must contain only one data point.

animal_id	animal	farm_name	ann_revenue
1	pig	Beech Ranch	65000
2	chicken	Beech Ranch	65000
3	goat	Beech Ranch	65000

Now we've solved the multiple fields issue, but we created repeating data (the farm field), which is also an example of denormalization. We can also see that if we were tracking additional ranch information (such as annual revenue), that field is only vaguely related to the animal information.

When these issues begin arising in your schema design, it means that you should separate information into two or more tables.

Anomalies

Obeying the above guidelines prevent anomalies in your database when inserting, updating, or deleting. For example, imagine if the revenue of Beech Ranch changed. With our denormalized schema, it may get updated in some records but not others:

animal_id	animal	farm_name	ann_revenue
1	pig	Beech Ranch	45000
2	chicken	Beech Ranch	65000
3	goat	Beech Ranch	65000

Similarly, if Beech Ranch shut down, there would be three (if not more) records that needed to be deleted to remove a single farm.

Thus a denormalized table opens the door for contradictory, confusing, and unusable data.

Table Relationships

There are three types of relationships:

One to one
One to many
Many to many

Determining how data is related can provide guidelines for table representation and guide the use of foreign keys to connect said tables.

How to Build It

One to One Relationships

Imagine we are storing the financial projections for a series of farms.

We may wish to attach fields like farm name, address, description, projected revenue, and projected expenses. We could divide these fields into two categories: information related to the farm directly (name, address, description) and information related to the financial projections (revenue, expenses).

We would say that farms and projections have a one-to-one relationship. This is to say that every farm has exactly one projection, and every project corresponds to exactly one farm.

This data can be represented in two tables: farms and projections

id	farm_name
1	Beech Ranch
2	Morton Farms

id	farm_id	revenue
1	1	65000
2	2	105000

The farm_id is the foreign key that links farms and projections together.

Notes about one-to-one relationships:

The foreign key should always have a unique constraint to prevent duplicate entries. In the example above, we wouldn't want to allow multiple projections records for one farm.
The foreign key can be in either table. For example, we may have had a projection_id in the farms table instead. A good rule of thumb is to put the foreign key in whichever table is more auxiliary to the other.
You can represent one-to-one data in a single table without creating anomalies. However, it is sometimes prudent to use two tables as shown above to keep separate concerns in separate tables

One to Many Relationships

Now imagine, we are storing the full-time ranchers employed at each farm. Each rancher would only work at one farm however, each farm may have multiple ranchers.

This is called a one-to-many relationship.

This is the most common type of relationship between entities. Some other examples:

One customer can have many orders.
One user can have many posts.
One post can have many comments.

Manage this type of relationship by adding a foreign key on the "many" table of the relationship that points to the primary key on the "one" table. Consider the farms and ranchers tables.

id	farm_name
1	Beech Ranch
2	Morton Farms

id	rancher_name	farm_id
1	John Doe	1
2	Jane Doe	1
3	Jim Done	2
4	Jay Dow	2
5	Jen Dunn	1

In a many-to-many relationship, the foreign key (in this case farm_id) should not be unique.

Many to Many Relationships

If we want to track animals on a farm as well, we must explore the many-to-many relationship. A farm has multiple animals, and multiple types of animals are present at multiple different farms.

Some other examples:

an order can have many products and the same product will appear in many orders.
a book can have more than one author, and an author can write more than one book.

We need to introduce an intermediary table that holds foreign keys that reference the primary key on the related tables to model this relationship. We now have a farms, animals, and farm_animals table.

id	farm_name
1	Beech Ranch
2	Morton Farms

id	animal
1	pig
2	chicken
3	goat

farm_id	animal_id
1	1
1	2
1	3
2	2

While each foreign key on the intermediary table is not unique, the combinations of keys should be unique.

Create Table Relationships Using Knex

The Knex query builder library also allows us to create multi-table schemas that include foreign keys. However, when designing a multi-table schema, there are a few extra things to keep in mind, such as using the correct order when creating tables, dropping tables, seeding data, and removing data.

We have to consider the way that delete and updates through our API will impact related data.

How to Build It

Foreign Key Setup

In Knex, foreign key restrictions don't automatically work. Whenever using foreign keys in your schema, add the following code to your knexfile. This will prevent users from entering bad data into a foreign key column.

development: {
  client: 'sqlite3',
  useNullAsDefault: true,
  connection: {
    filename: './data/database.db3',
  },
  // needed when using foreign keys
  pool: {
    afterCreate: (conn, done) => {
      // runs after a connection is made to the sqlite engine
      conn.run('PRAGMA foreign_keys = ON', done); // turn on FK enforcement
    },
  },
},

Migrations

Let's look at how we might track our farms and ranchers using Knex. In our migration file's up function, we would want to create two tables:

exports.up = function(knex, Promise) {
  return knex.schema
    .createTable('farms', tbl => {
      tbl.increments();
      tbl.string('farm_name', 128)
        .notNullable();
    })
    // we can chain together createTable
    .createTable('ranchers', tbl => {
      tbl.increments();
      tbl.string('rancher_name', 128);
      tbl.integer('farm_id')
        // forces integer to be positive
        .unsigned()
        .notNullable()
        .references('id')
        // this table must exist already
        .inTable('farms')
    })
};

Note that the foreign key can only be created after the reference table.

In the down function, the opposite is true. We always want to drop a table with a foreign key before dropping the table it references.

exports.down = function(knex, Promise) {
  // drop in the opposite order
  return knex.schema
    .dropTableIfExists('ranchers')
    .dropTableIfExists('farms')
};

In the case of a many-to-many relationship, the syntax for creating an intermediary table is identical, except for one additional piece. We need a way to make sure our combination of foreign keys is unique.

.createTable('farm_animals', tbl => {
  tbl.integer('farm_id')
    .unsigned()
    .notNullable()
    .references('id')
    // this table must exist already
    .inTable('farms')
  tbl.integer('animal_id')
    .unsigned()
    .notNullable()
    .references('id')
    // this table must exist already
    .inTable('animals')

  // the combination of the two keys becomes our primary key
  // will enforce unique combinations of ids
  tbl.primary(['farm_id', 'animal_id']);
});

Seeds

Order is also a concern when seeding. We want to create seeds in the same order we created our tables. In other words, don't create a seed with a foreign key until that reference record exists.

In our example, make sure to write the 01-farms seed file and then the 02-ranchers seed file.

However, we run into a problem with truncating our seeds, because we want to truncate 02-ranchers before 01-farms. A library called knex-cleaner provides an easy solution for us.

Run knex seed:make 00-cleanup and npm install knex-cleaner. Inside the cleanup seed, use the following code.

const cleaner = require('knex-cleaner');

exports.seed = function(knex) {
  return cleaner.clean(knex, {
    mode: 'truncate', // resets ids
    ignoreTables: ['knex_migrations', 'knex_migrations_lock'], // don't empty migration tables
  });
};

This removes all tables (excluding the two tables that track migrations) in the correct order before any seed files run.

Cascading

If a user attempt to delete a record that is referenced by another record (such as attempting to delete Morton Ranch when entries in our ranchers table reference that record), by default, the database will block the action. The same thing can happen when updating a record when a foreign key reference.

If we want that to override this default, we can delete or update with cascade. With cascade, deleting a record also deletes all referencing records, we can set that up in our schema.

.createTable('ranchers', tbl => {
    tbl.increments();
    tbl.string('rancher_name', 128);
    tbl.integer('farm_id')
    .unsigned()
    .notNullable()
    .references('id')
    .inTable('farms')
    .onUpdate('CASCADE');
    .onDelete('CASCADE')
})

Guided Project

Data Modeling Starter Code

Data Modeling Solution

Module 4 Project: Data Modeling

The module project contains advanced problems that will challenge and stretch your understanding of the module's content. The project has built-in tests for you to check your work, and the solution video is available in case you need help or want to see how we solved each challenge, but remember, there is always more than one way to solve a problem. Before reviewing the solution video, be sure to attempt the project and try solving the challenges yourself.

Instructions

The link below takes you to Bloom's code repository of the assignment. You'll need to fork the repo to your own GitHub account, and clone it down to your computer:

Starter Repo: Data Modeling

Fork the repository,
clone it to your machine, and
open the README.md file in VSCode, where you will find instructions on completing this Project.
submit your completed project to the BloomTech Portal

Module 4: Data Modeling

Data Normalization

How to Build It

Normalization Guidelines

Denormalized Data

Anomalies

Table Relationships

How to Build It

One to One Relationships

One to Many Relationships

Many to Many Relationships

Create Table Relationships Using Knex

How to Build It

Foreign Key Setup

Migrations

Seeds

Cascading

Guided Project

Module 4 Project: Data Modeling

Instructions

Solution

Additional Resources