Module 3 - Introduction to Relational Databases
Module Overview
Get started with relational databases, including normalization and associations, to build robust data storage solutions.
Learning Objectives
- Recall that tables in a relational database have a fixed schema that every record must satisfy
- Recall that relational databases use NULL to specify that a record does not have a value for a specific column
- Recall that relational databases include numeric, date, varchar, and boolean representations of data
- Identify a unique record in a provided table by its primary key's value
- Recall that normalization reduces redundant data while enabling data consistency
- Recall that relational databases use foreign keys to enforce referential integrity
- Outline when an insert will fail in a relational database due to referential integrity constraints
- Recall that relational databases, unlike non-relational databases, have a fixed schema, maintain referential integrity, and enable joining across tables
- Outline referential integrity in relational databases
- Identify foreign and primary key pairs from a given entity relationship diagram
- Write an SQL statement that retrieves rows satisfying specified conditions from a single table
- Write an SQL statement that counts the number of rows satisfying specified conditions in a table
- Write an SQL statement that describes a table's columns
- Determine the results of executing a provided SQL select statement on a given table
- Write an inner join SQL statement that retrieves rows satisfying specified conditions
- Predict the results of executing a provided inner join SQL statement on a set of given tables
- Implement an SQL statement to answer a provided business question, given an entity relationship diagram
- Write an SQL statement that retrieves only specified columns from a single table
- Explain inner join in relational databases
Introduction to Relational Databases
A relational database stores an item's data across multiple related tables, unlike key-value databases like DynamoDB that store all information for an item in a single table.
Key RDB Concepts
- Records: Rows in a table (similar to DynamoDB items)
- Columns: Record attributes with fixed structure and data types
- Primary Key: Unique identifier for each record
- Schema: Fixed structure that defines the columns and their data types
- NULL: Special value indicating that a record doesn't have a value for a specific column
Common Data Types in Relational Databases
- numeric: For storing numbers
- date: For storing formatted dates
- varchar(x): Variable characters with a maximum length of x (similar to String)
- boolean: For storing true/false values
Here's an example of an Employees
table in a relational database:
id | leave | level | officeCode | officeName | dateJoined | state |
---|---|---|---|---|---|---|
dac041be | true | 5 | SEA24 | Fiona | 2016-07-23 | WA |
dac04434 | false | 6 | SEA41 | Day 1 | 2019-12-03 | WA |
dac045a6 | true | 4 | SEA32 | Nessie | 2015-09-15 | WA |
Unlike DynamoDB, adding a new column to an RDB table requires modifying the table schema and backfilling data, making careful upfront design important.
Overview
Up to this point, you've had numerous lessons teaching you about one of Amazon's database systems: DynamoDB. In this lesson, we're going to introduce relational databases, which are a different type of database! An understanding of relational databases is a good tool for any developer to have, and it'll help you grow as an engineer to think about data in a new and different way.
In this reading, you'll get a broad introduction to relational databases and the ways they compare and contrast to DynamoDB databases.
In the next reading, you'll look at the similarities and differences between Entity Relationship Diagrams in DynamoDB and relational databases. Once we've worked through an ERD example, we'll introduce some basic SQL commands.
Relational databases (RDBs)
As you may remember, we first introduced databases when discussing the need to organize the shoes in our closet. We chose to use a database because of its increased control over data entry and management. Databases allow us to store large amounts of data and easily scale upwards if we need to add more data.
AWS provides Amazon Relational Database Service (RDS) for managing databases in the cloud. Just as DynamoDB is one possible engine for running a key-value database, RDS supports many engines that can run a relational database. We'll focus on Amazon Aurora, an RDB that wraps and enhances other RDB engines. We'll use the MySQL flavor of Amazon Aurora for all our lessons.
RDB Tables
DynamoDB and other key-value databases store all the information for a single item in a single table. A relational database stores an item's data in many related tables.
First, let's go over the terminology. We call each row in the table a record; it's like a DynamoDB item. We call a record's attributes columns.
Each record must have a unique ID, called the primary key, like a DynamoDB hash key. Where DynamoDB requires that the hash key and sort key (if any) make a unique value for each item, an RDB table has no sort key, but can combine many columns to build its primary key.
While DynamoDB tables can include different attributes for each item, relational tables have a fixed structure. The RDB enforces the existence and type of each column for each record.
While SQL ignores capitalization, at ATA we will always make SQL commands ALL CAPS, use CapitalizedWords for table names, and write data types in lowercase.
numeric columns hold numbers. A date column has a formatted date. varchar(x) stands for "variable characters"; it's like a String because it can hold a variable number of characters, but it has a maximum length of x. Boolean columns have a value of either 'true' or 'false'.
SQL defines other data types, and individual RDB engines may add their own additions. If you're interested, w3schools has a good SQL reference, and there are many others.
Since an RDB table schema is fixed, you create the columns when you create the table. To add a new column, you have to modify the table schema and backfill every record with valid data. This is significant work; we avoid it as much as we can through careful design before table creation.
Columns may also be defined as nullable. This means that we can use the value NULL to indicate that we don't know an actual value to specify. If a column is not nullable, then it must contain a value other than NULL.
The following Employees table uses all the data types to keep track of employees and their offices. The id column is the primary key, and is a varchar(8). The leave column is a boolean telling us whether the employee is currently on leave. The employee's level is a numeric, their joinDate is a date, their officeCode is varchar(6), and their officeName is varchar(30). We also store their office state in a varchar(2). If any employees's office name is longer than 30 characters, it will be truncated. We could use a varchar to represent dates, but date guarantees that every date has the same format.
id | leave | level | officeCode | officeName | dateJoined | state |
---|---|---|---|---|---|---|
dac041be | true | 5 | SEA24 | Fiona | 2016-07-23 | WA |
dac04434 | false | 6 | SEA41 | Day 1 | 2019-12-03 | WA |
dac045a6 | true | 4 | SEA32 | Nessie | 2015-09-15 | WA |
dac046f0 | false | 5 | SEA24 | Fiona | 2018-01-25 | WA |
dac04a56 | true | 4 | SJC11 | SJC11 | 2019-03-08 | CA |
Normalization
Normalization is the process of organizing data to minimize redundancy. Normalizing a table generally involves dividing it into two or more tables, each containing information about only one type of entity. We then define the relationships between those tables.
When we store multiple kinds of data in a single table, we often find ourselves repeating redundant data. Redundant data not only takes up unnecessary space, it causes three specific anomalies.
- We can't insert new information until we have all the data for the record. Above, where we tracked our employees and their office information in a single table, we can't add a new office until we know at least one of the employees that work in it. We also can't add a new employee until we know which office they'll work in and its state. If we normalize the table to track employees, offices, and the "works at" relationship separately, we can add new employees and offices without waiting for other information.
- We have to update the same information in several records. Keeping everything consistent is difficult. In the employee / office example, when we change the name of an office, we have to find all the employees who work there and update the office's name in all their records. With normalized tables, the office data is in a single table, and we can update it in one record.
- Deleting a record removes more than one set of facts, so you lose additional information you may need. In the employee / office example, if an employee leaves the company and you delete their record from the table, you'll also lose information about the office they worked in. If they were the only employee in that location, you'd lose the office's name and state, too.
To normalize the office / employee example, we isolate all the employee data in a single table, and all the office data in a separate table. This takes less space, since we no longer duplicate the office information for each employee, and avoids all the anomalies we described.
Employees:
id | leave | level | dateJoined |
---|---|---|---|
dac041be | true | 5 | 2016-07-23 |
dac04434 | false | 6 | 2019-12-03 |
dac045a6 | true | 4 | 2015-09-15 |
dac046f0 | false | 5 | 2018-01-25 |
dac04a56 | true | 4 | 2019-03-08 |
Offices:
id | name | state |
---|---|---|
SEA24 | Fiona | WA |
SEA41 | Day 1 | WA |
SEA32 | Nessie | WA |
SJC11 | SJC11 | CA |
Defining Relationships
Since we try to focus each RDB table to a single purpose, we don't want to keep relationship data in a table with other data. Instead, we store each relationship type in its own table!
For the employee / office example, we'd create an EmployeeOffice table like this:
employee | office |
---|---|
dac041be | SEA24 |
dac04434 | SEA41 |
dac045a6 | SEA32 |
dac046f0 | SEA24 |
dac04a56 | SJC11 |
Foreign keys
Normalization isolates data into individual tables, but some data can't be isolated. Relationship tables are especially dependent on data in other tables.
In our employee / office example, suppose the last employee dac04a56 leaves the company. We delete their record from the Employees table. But the EmployeeOffice table still shows they work in SJC11. Someone who wants to find all the employees at SJ11 will discover an invalid relationship.
Relational databases use foreign keys to address this issue.
A foreign key is a set of one or more columns in one table (the "primary table") that refers to the primary key of another table (the "related table"). Unlike primary keys, foreign keys can contain duplicate values.
The RDB enforces that all records in the primary table have foreign key values that refer to a valid record in the related table. In our example, the EmployeeOffice table would set the employee and the office as foreign keys. Then the RDB would prevent us from deleting the employee unless we also deleted all their EmployeeOffice relationship records. It would also prevent us from adding a record that didn't refer to a valid employee and office, and from updating a record with an invalid value.
We use the term referential integrity to refer to the condition where all references are valid.
RDB Data consistency
As we've discussed, DynamoDB exhibits eventual consistency: data is not necessarily available immediately after writing. There's no way to guarantee that writes to multiple tables will succeed or fail together, so we can't guarantee consistency between tables.
Since relational databases write records in multiple tables and enforce referential integrity, they keep the tables up to date with consistent data that's immediately available. Any write that updates multiple tables will either succeed if all the writes succeed, or fail if any of the writes fails. On a failure, the RDB won't update any tables.
RDBs provide strong consistency: any data written is immediately available for reading.
Summary
In this reading, we gave you a broad overview of relational databases and their properties. You leared that relational databases store records in multiple related tables, and that normalization helps us to reduce redundant data and enforce referential integrity. Relational databases may not be used by every team, but they're still important. Familiarizing yourself with them is an asset in case you do need to work with them in the future, and they'll expand the way you think about data management.
Next Up
In the next reading, you're going to create an Entity Relationship Diagram for a relational database. You'll also start working with Structured Query Language!
SQL provides a way to write and query data across many tables at once. SQL can be pronounced as either "s-q-l" or "sequel". You'll likely hear it pronounced both ways!
Introduction to Accessing RDS
Normalization is the process of organizing data to minimize redundancy by dividing it into tables with each containing information about only one type of entity.
Why Normalize Data?
Redundant data not only takes up unnecessary space but causes three specific anomalies:
- Insertion Anomaly: Can't insert new information until you have all the data for the record
- Update Anomaly: Need to update the same information in several records to maintain consistency
- Deletion Anomaly: Deleting a record removes more facts than intended, potentially losing important information
Normalization Example
To normalize the employee/office example from the previous section, we would isolate employee data in one table and office data in another:
Employees (Normalized):
id | leave | level | dateJoined |
---|---|---|---|
dac041be | true | 5 | 2016-07-23 |
dac04434 | false | 6 | 2019-12-03 |
dac045a6 | true | 4 | 2015-09-15 |
Offices (Normalized):
id | name | state |
---|---|---|
SEA24 | Fiona | WA |
SEA41 | Day 1 | WA |
SEA32 | Nessie | WA |
EmployeeOffice (Relationship Table):
employee | office |
---|---|
dac041be | SEA24 |
dac04434 | SEA41 |
dac045a6 | SEA32 |
This approach allows adding new employees or offices independently and updates to office information only need to happen in one place.
Overview
We're continuing to explore Relational Databases to grow as engineers by thinking about data in different ways.
In the last reading, we gave you a broad introduction to relational databases and the ways they compare and contrast to key-value databases. We paid special attention to the differences in terminology and the way data is stored.
In this reading, you'll look at the similarities and differences between Entity Relationship Diagrams in DynamoDB and RDS. Once we've worked through an ERD example, we'll introduce some basic SQL commands to access information about a table's structure and retrieve data.
Entity Relationship Diagrams
An Entity Relationship Diagram (ERD) is a diagram used to map and display entities and their relationships. Since it describes data, we can make an ERD for an RDB table, not just for DynamoDB tables! We'll examine the ERD for this business problem:
You work for a software company that creates and sells software. You're tasked with creating an application that keeps track of all your products and the sales of those products. You have the following use cases:
- Retrieve all the products an employee has worked on.
- Retrieve all the sales for a product, including the date and cost of the sale.
- Get all the employees that have worked on a product.
- Get the total number of sales for a product.
In DynamoDB, we'd try to keep all the product details in a single table, with productId as the hash key and saleId as the sort key. Since we want to query products for an employee and employees for a product, we'd probably store a StringList of employees and use a scan() for one query. We'd be duplicating the employee list and product data for each sale.
For an RDB, we normalize the data into three entities and their relationships.
Let's look at the Entity Relationship Diagram (ERD) for this RDB:

Figure 1: ERD for our product sales RDB, with entities for Employee, Product, Sale, and a ProductEmployee relationship indicating which employees worked on a product.
Remember that an ERD describes the structure and relationship of entities, not of tables. For an RDB, an entity corresponds roughly with a record, so each entity usually does represent a table, but with a plural name.
Note that the dot to the left of a column indicates that it is not nullable. We also introduce (pk) to say the column is part of the table's primary key, and (fk) to show that the column is a foreign key in a related table.
In particular, note how the Sales table isn't entirely normalized: it includes information about a product, not just about a sale. This decision was justified because a sale cannot exist without a product. We maintain referential integrity by making the productId a foreign key.
Products:
productId | employeeId | productManager | totalSales |
---|---|---|---|
PR_4781237 | EM_7427832 | EM_7427832 | 245 |
PR_4781237 | EM_7478321 | EM_7427832 | 245 |
PR_4781237 | EM_7465219 | EM_7427832 | 245 |
Introduction to SQL
In the last reading we mentioned SQL, which stands for Structured Query Language. SQL is the standard language to write and query data in a relational database.
We'll be using the Query Editor in the RDS console to perform our queries. The Query Editor is a convenient way to connect to whatever database you are using and have a clear place to run your queries and retrieve results.
To run queries in the query editor, first login to the AWS Management Console. Next, navigate to Databases and choose the Aurora Serverless DB cluster where you want to run SQL queries. If you haven't connected to the database before, the 'Connect to database' page opens where you'll choose the database you're connecting to and enter in your user information. At the bottom of the popup is a field "Enter the name of the database or schema (optional)." In that field, enter the name of the database you're using. The query editor will open. In the query editor, enter the SQL query you want to run on the database and then hit 'Run'! Your output will be given below the editor window in the 'Output' tab. (For more detailed information about accessing and using the query editor, see the AWS documentation for Using the query editor for Aurora Serverless.
We have a database for you to experiment with in the ATACurriculum_SharedParticipantResources AWS account. You can connect to the unit6rds (yes, really Unit 6) RDS instance and query its TEST_SCHEMA database.
Describing the database
Many SQL tools will show you the tables, columns, and types in any database you connect to. If they don't, you can use SHOW TABLES to list the tables in a database, and DESC to describe a table's columns.
To describe the columns in the Products table, we'd use this SQL:
DESC Products;
Which produces the following output:
Field | Type | Null | Key |
---|---|---|---|
productId | varchar(10) | NO | PRI |
employeeId | varchar(10) | NO | PRI |
productManager | varchar(10) | YES | --- |
totalSales | int | NO | --- |
Retrieving with SELECT
The SQL command that retrieves data is the SELECT statement.
Let's return to the Products table that we set-up previously:
productId | employeeId | productManager | totalSales |
---|---|---|---|
PR_4781237 | EM_7427832 | EM_7427832 | 245 |
PR_4781237 | EM_7478321 | EM_7427832 | 245 |
PR_4781237 | EM_7465219 | EM_7427832 | 245 |
For this sample data, employeeId is the only column that is different. We can retrieve all the values that are stored in the employeeId column of the Products table using this SELECT command:
SELECT employeeId FROM Products;
This command produces the following output:
employeeId |
---|
EM_7427832 |
EM_7478321 |
EM_7465219 |
The SELECT command can retrieve data from multiple columns. To retrieve both the productId and the employeeId from the Products table, we can use the following command:
SELECT productId, employeeId FROM Products;
We use commas to separate as many columns as we want to retrieve! The command produces the following output:
productId | employeeId |
---|---|
PR_4781237 | EM_7427832 |
PR_4781237 | EM_7478321 |
PR_4781237 | EM_7465219 |
To retrieve all the columns in a table, use the * symbol. In SQL, the * symbol is a wildcard. When used with SELECT, it means 'select all columns'. To retrieve data from all the columns in the Products table, we use the following command:
SELECT * FROM Products
This command retrieves the following output:
productId | employeeId | productManager | totalSales |
---|---|---|---|
PR_4781237 | EM_7427832 | EM_7427832 | 245 |
PR_4781237 | EM_7478321 | EM_7427832 | 245 |
PR_4781237 | EM_7465219 | EM_7427832 | 245 |
Narrowing results with LIMIT
The LIMIT statement sets a maximum limit to the number of items you retrieve from the table, similar to the withLimit() method in DynamoDB. This limits the amount of data you access when you have a really large data set. It's also a good way to get sample data from a table. The LIMIT statement returns the number of rows you specify starting with the first row in the table.
We can retrieve all the columns of the first 2 records in the Products table with the following command:
SELECT * FROM Products LIMIT 2
This command retrieves the following output:
productId | employeeId | productManager | totalSales |
---|---|---|---|
PR_4781237 | EM_7427832 | EM_7427832 | 245 |
PR_4781237 | EM_7478321 | EM_7427832 | 245 |
Filtering results
We use the WHERE clause to select matching rows from a large table, using this syntax:
SELECT * FROM table WHERE predicate;
The 'predicate' is an expression that evaluates to TRUE, FALSE, or UNKNOWN (if a NULL value makes it impossible to determine whether the predicate is satisfied). Only the rows for which the predicate is TRUE are selected.
We use the SQL conditional and boolean operators to build predicates.
Operator | Java Equivalent |
---|---|
= | == |
<> | != |
> | > |
< | < |
>= | >= |
<= | <= |
AND | && |
OR | ǀǀ |
NOT | ! |
For instance, we might find all products with total sales over 9000 with this SQL statement:
SELECT * FROM Products WHERE totalSales > 9000;
SQL conditions have to deal with NULL. If a column's value is NULL, a WHERE clause will never apply the query to that row because the predicate will always evaluate to UNKNOWN. Even NOT evaluates to UNKNOWN if there is a NULL anywhere in its arguments.
In situations where you need to test whether a value is NULL, the operators IS NULL and IS NOT NULL are used:
- IS NULL returns TRUE when the value of a column is NULL
- IS NOT NULL returns TRUE when the value of a column is anything other than NULL
Aggregate functions
SQL allows us to do some calculations on sets of values and determine a single aggregate result:
- MIN calculates the minimum value of a column
- MAX calculates the maximum value of a column
- AVG finds the mean value of a column
- SUM adds up all the values in a column
- COUNT returns the number of rows that meet specific criteria.
To find the number of employees at the company whose last name is between 'E' and 'I', we could use this SQL statement:
SELECT COUNT(*) FROM Employees WHERE lastName > 'E' AND lastName < `I`;
SQL with Relationships
The queries we've described so far only touch one table. To query data in multiple tables, we must join them together.
There are four type of SQL joins, but we're only going to discuss the default, inner join. Inner joins retrieve rows only if they are shared by the joined tables.
The JOIN keyword takes the place of the table in a SELECT statement. To find the ID product manager for each individual sale over 9000, we could use this statement:
SELECT Products.productManager, Sales.cost FROM Sales JOIN Products ON Sales.productId = Products.productId WHERE Sales.cost > 9000;
Note how Sales JOIN Products indicates the tables to be joined, and ON Sales.productId = Products.productId specifies how to match rows in the tables.
When working with multiple tables, each column must be named with its table. We cannot write SELECT * FROM x JOIN y ON a = b, since both tables could each contain unrelated columns named a and b. The same principle holds for selecting specific columns.
To determine the names of the employees who worked on products managed by 'EM_7427832', we would need to join three tables. Here's what that SQL looks like:
SELECT Products.productManager, Employees.employeeId FROM ProductEmployee JOIN Products ON ProductEmployee.productId = Products.productId JOIN Employees ON ProductEmployee.productId = Employees.employeeId WHERE Products.productManager = 'EM_7427832';
Conclusion
In this reading, we examined an ERD for a relational database. Through this example, you saw the similarities and differences between DynamoDB and RDS, and the ways they access and store data. We learned some SQL commands allowing you to describe and access data from the example tables. This was just an introduction to the many ways you can access data using SQL!
RDBMS Associations
Associations define relationships between tables in a relational database. The primary mechanism for creating associations is through foreign keys.
Foreign Keys and Referential Integrity
A foreign key is a column (or set of columns) in one table that refers to the primary key in another table. It enforces referential integrity, ensuring that relationships between tables remain consistent.
For example, in the EmployeeOffice
table above:
- The
employee
column is a foreign key referencing theid
column in theEmployees
table - The
office
column is a foreign key referencing theid
column in theOffices
table
Referential integrity enforces rules like:
- You cannot add an employee-office relationship for an employee or office that doesn't exist
- If you try to delete an office that has employees assigned to it, the operation will fail unless you specify a cascading action
Types of Associations
- One-to-One: One record in a table is associated with exactly one record in another table
- One-to-Many: One record in a table can be associated with multiple records in another table (e.g., one office can have many employees)
- Many-to-Many: Multiple records in a table can be associated with multiple records in another table (requires a junction/relationship table)
Basic SQL for Querying Data
-- Select all employees SELECT * FROM Employees; -- Select specific columns SELECT id, level FROM Employees; -- Filter with conditions SELECT * FROM Employees WHERE level > 5; -- Count employees SELECT COUNT(*) FROM Employees; -- Join tables SELECT e.id, e.level, o.name, o.state FROM Employees e INNER JOIN EmployeeOffice eo ON e.id = eo.employee INNER JOIN Offices o ON eo.office = o.id;
Guided Project
In today's activity, you'll be taking on the role of a hard-bitten detective in SQL City.
There's been a murder, but you've misplaced your notes. Use your SQL skills to reassemble the evidence and find the killer!
Additional Resources
Key Concepts to Remember
- Tables in relational databases have fixed schemas that all records must follow
- Primary keys uniquely identify records in a table
- Foreign keys establish relationships between tables
- Normalization reduces data redundancy and helps maintain data integrity
- SQL provides a powerful way to query and manipulate data across multiple related tables