Module 1: Intro to DynamoDB

Module Overview

Learn the fundamentals of DynamoDB, including database concepts, partition and sort keys, and data types.

Learning Objectives

  • Outline the role a database serves in an application
  • Outline some of the benefits of distributed over non-distributed data stores
  • Identify a unique item in a provided DynamoDB table by its partition key
  • Identify a unique item in a provided DynamoDB table by its partition and sort keys
  • Identify whether a given Java type should be represented by a DynamoDB Number
  • Identify whether a given Java type should be represented by a DynamoDB String
  • Identify whether a given Java type should be represented by a DynamoDB Boolean
  • Identify whether a given Java type should be represented by a DynamoDB NumberSet
  • Identify whether a given Java type should be represented by a DynamoDB StringSet
  • Outline the cases in which each of create, read, update, and delete operations of a database is used
  • Understand the core concepts of NoSQL databases
  • Learn about DynamoDB's data model, including tables, items, and attributes
  • Master the use of partition keys and sort keys for efficient data access
  • Explore DynamoDB's supported data types, including scalar types and sets
  • Learn the principles of designing effective DynamoDB tables

Introduction to Databases

Key Concepts

A database is an organized collection of data, stored and accessed electronically from a computer system. Databases are essential for storing and retrieving data reliably and efficiently. They serve critical roles in applications by:

  • Providing a structured way to store and organize large amounts of data
  • Enabling efficient retrieval of information based on specific attributes
  • Ensuring data consistency and integrity across applications
  • Supporting concurrent access by multiple users

In DynamoDB, data is organized into tables, items (similar to rows in relational databases), and attributes (similar to columns). Each item in a DynamoDB table can be uniquely identified by its primary key.

Consider a practical example: if you organized 30 pairs of shoes in your closet, you would create a system where each shoe has attributes like color, style, and occasion. Similarly, databases organize data with attributes that allow for easy retrieval based on specific criteria.


// Example of data representation in a DynamoDB table
Table: ShoeOrganizer
{
  "shoe_id": "SN01",  // Partition key
  "cubby_location": 1,
  "color": "grey",
  "style": "sneaker",
  "occasion": "athletic"
}
                

Distributed database systems like DynamoDB offer additional benefits:

  • Ability to store much larger datasets across multiple machines
  • Higher availability with geographically distributed data storage
  • Increased fault tolerance - if one server fails, others can handle requests
  • Better support for concurrent requests through distributing load

What is a Database?

A database is an organized collection of data, stored and accessed electronically from a computer system. Databases are popular because of their ability to reliably store and retrieve data in various ways and because of the large amount of data they can store.

Databases are used for so many different applications that you likely interact with one every day. Whether you're buying something online, logging into social media, or accessing a webpage, all those applications likely use a database. While this section gives a general overview of databases, in later sections you will learn how to use the Amazon DynamoDB database service in particular.

Organizing your Closet

Imagine you're cleaning your room and you have 30 pairs of shoes that you need to put away. If you just throw all your shoes into the closet, shut the door, and walk away, then tomorrow when you're looking for a specific pair of shoes, it's going to be difficult to find them. Instead of throwing your shoes in a closet, you buy a shoe organizer and label each cubby with the pair of shoes it contains---completely organizing your closet!

This works great if you always know the exact pair of shoes you want to wear, but what if you want to wear grey shoes and aren't sure exactly which pair? If you keep track of certain attributes that your shoes have then you can pull them out based on that attribute. Examine the following table as a brief example:

shoe_id cubby_location color style occasion
SN01 1 grey sneaker athletic
BO01 2 grey boot work
SN02 3 black sneaker casual

Now if you want a grey pair of shoes, you could take out the shoes located in cubbies 1 and 2. If you wanted sneakers you could take out shoes in cubbies 1 and 3. If you want to go to the gym, you know that the only athletic shoes you own are located in cubby 1. These are a few examples of common database use cases: retrieving items from a large set based on the items' characteristics.

Organizing your shoes this way feels unnecessary, but when you're dealing with a dataset of thousands, or millions, or billions of entries, it's critical to have a reliable way to organize and retrieve data. Databases usually comprise multiple tables, each table representing an entity. Entities represent people, places, or other things that capture and store data. Items represent an instance of an entity, and attributes are the characteristics or properties of an entity. Entities are analogous to Java classes in that they represent a kind of thing. Items are analogous to instances of classes, in that they represent individual things of that kind.

In our shoe example, the entity is shoe. The items are each individual shoe entry in our table and the shoe's attributes are shoe_id, cubby_location_shoe, color, style, and occasion. The first item, shoe_id SN01, has the following attributes: SN01, 1, grey, sneaker, athletic.

Moving our Example to DynamoDB

Now that we've laid out the data for our database table and learned some of the concepts, we can map our example to a specific database service. As mentioned previously, we'll be using Amazon's DynamoDB database service, as it's a good bet that you'll be using DynamoDB before too long in your SDE career at Amazon.

In DynamoDB, each attribute must have a data type assigned to it---there are numerous types that DynamoDB supports, but the main ones you will use are Strings, Numbers, and Booleans. In our above example, the cubby_location_shoe attribute uses a Number type, and the other four attributes use a String type. We'll talk more about attribute types in a later section.

Each table in DynamoDB must have an attribute or pair of attributes that is its primary key. Primary keys are unique identifiers, a value that represents only that item in the table. The item can be unambiguously identified and retrieved from DynamoDB by that primary key. In our shoe example, the primary key is the shoe_id --- each shoe gets a unique ID. The shoe_id contains the first two letters of the style of shoe and a number that distinguishes different shoes of the same style. For example, the shoe with key SN01 is a sneaker and is a different pair from SN02, another pair of sneakers. We will go into more detail about keys in a later section.

Retrieving information from a database is known as a query. A query can retrieve one item or record (e.g. the shoe with ID SN01) or can retrieve multiple pieces of data based on specified attributes (e.g. all shoes that are the color grey). If we retrieve all grey shoes, we end up with both shoe 1 and shoe 2. A database's ability to retrieve multiple pieces of data based on attribute values is one of the reasons databases are so useful. We will go more into depth about queries in a later unit.

Our shoe example is one table we could have in our database. We could create tables for shirts and pants, and a table to represent complete outfits.

The shirts table might be set up in the following way:

shirt_id cubby_location color style occasion
LS01 8 black long-sleeve work

The pants table:

pants_id cubby_location color style occasion
JE01 20 blue jeans casual

The outfit table:

outfit_id shoe_id shirt_id pants_id
OU01 SN02 LS01 JE03
OU02 SN01 TS02 JE01
OU03 SN01 LS01 JE03

The attributes in our outfit table correspond to the primary keys for each clothing item. In our example, we've decided to make the primary key for each outfit OU followed by a unique number.

In this example, the outfit attributes are keys referring to other tables. The outfit table's attributes shoe_id, shirt_id, and pants_id, correspond to ids from the shoe, shirt, and pants tables, respectively. Looking at the same example as before, to create outfit OU01, we would get the value for shoe_id, which is SN02, and then query the shoe_id table for the item with a matching id. The same process is followed in the shirts table and pants table. The outfit points to its related items by storing their keys as its own attributes.

Figure 1: Diagram showing SN01 pointing to both OU02 and OU03

Figure 1: Diagram showing SN-1 pointing to both OU02 and OU03.

Tables can be used to represent a relationship where an item in a table can be associated with several items in a different table, which is called a 1 to N, or one-to-many, relationship. Our clothing and outfits tables represent a one-to-many relationship because one article of clothing can be worn with many outfits. For example, the shoe SN01 is used for both outfit OU02 and OU03, showing that the shoes and outfit table have a one-to-many relationship (shown in the diagram above). The same is true for the shirts and outfits tables and the pants and outfits table.

It doesn't always make sense for tables to have a one-to-many relationship, however, such as tables where multiple entities couldn't have the same attribute value (for example, if you were relating citizens to driver's license IDs).

Benefits of a Database

Protection

A well-written database will ensure that your data is still there tomorrow, so luckily you don't need to worry much about protecting it. Each change, often called a transaction or a commit, is stored in a transaction log, which is a history of the actions executed by the database system. If the database runs into trouble, it can often recover all data by reprocessing the transaction log from a last known good state. A database ensures that few, if any, transactions will be lost.

When an error occurs, databases won't perform partial writes. A transaction will be fully completed, or it won't happen at all. The simplest case is of writing a single item: either the entire item is added to the database, or none of it is. The database will not allow part of the item to be written, even in the case of an unexpected error. You can also compare this to a transaction you make at a store: when you purchase something, there are two possible outcomes: you give your money and receive a product in exchange, or you don't give your money and receive no product in exchange. You wouldn't give your money without getting your product and they wouldn't give you the product if you didn't pay. Either both pieces of the transaction occur or none of it occurs.

A backup is an additional protection against data loss that most databases provide. A database backup stores a copy of the data so that it can be recovered later in case something goes wrong. Backup and recovery facilities are used to create and store copies of data to minimize data loss and are important to ensure the safety of your data. There are many backup and recovery facilities available, and usage usually depends on a team's preference.

Scalability

One of the other benefits of databases is that they can be used by many users at the same time. When you log into Amazon to order something, many other customers are logged in at the same time. The ability to handle multiple simultaneous requests is called concurrency, which describes multiple computations happening at the same time. Databases are one technology that employs concurrency. Many other examples exist, such as multiple applications running on a single computer and multiple computers in a network.

Concurrency can lead to issues, however, such as multiple users making edits to a document at the same time and accidentally overwriting each other's edits. Databases are useful in dealing with this as they handle the complexity that concurrency inherently creates so that we don't have to! Concurrency in databases helps manage users modifying data by restricting other users from making changes until the edits are committed. A database commit refers to updating an item in a database and making it available for other users to see the changes. Commits are permanent saves and are necessary to prevent uncertainty about whether the available data is correct. You will learn more about concurrency in later units (when you'll be using concurrency in your own code).

Databases are also designed to store large amounts of data. Usually large enough that they cannot store all the data on one machine. Modern databases handle this issue by storing data across many machines that work together, referred to as distributed data storage.

Benefits of Distributed Datastores

Distributed datastores are databases that store data across multiple machines. These allow a much larger set of data to be stored in a single database. As data grows, additional machines can be added to store it all.

Distributed databases can also support more concurrent requests to access data because each machine can handle data requests. If many requests arrive at the same time to access the same data, they can distribute access across many machines to prevent any one machine from having to process all the requests themselves.

Distributed databases can also run across multiple geographic locations to keep handling requests even when an issue occurs in one location. If you are using a non-distributed datastore model and the one location where you have your database loses power from a natural disaster, then no users will be able to access the data until your database gets power again. In a distributed datastore model, however, if one location goes down or is experiencing errors, then the other locations can usually provide access to the same data with minimal performance loss.

Partition and Sort Keys

Primary Keys in DynamoDB

Primary keys in DynamoDB uniquely identify each item in a table. DynamoDB supports two types of primary keys:

  1. Partition Key Only - A simple primary key with a single attribute called the partition key
  2. Composite Primary Key - A composite primary key consisting of a partition key and a sort key

The partition key determines the partition where your data is stored. It's used by DynamoDB's internal hash function to distribute data across partitions for scalability. For example, in a shoe table, "shoe_id" might be a good partition key.

The sort key allows you to organize items with the same partition key. This is especially useful for related items that need to be retrieved together.


// Example of a table with composite primary key
Table: MusicLibrary
{
  "artist": "Black Eyed Peas",  // Partition key
  "song_title": "I Gotta Feeling",  // Sort key
  "genre": "pop",
  "year": 2009
}

{
  "artist": "Black Eyed Peas",  // Same partition key
  "song_title": "Pump It",  // Different sort key
  "genre": "pop",
  "year": 2006
}
                

Understanding Primary Keys

In the previous reading we briefly discussed primary keys and how they provide a unique identifier for each record in your table, but let's dive a little deeper.

Let's return to the shoe organizer table:

shoe_id cubby_location color style occasion
SN01 1 grey sneaker athletic
BO01 2 grey boot work
SN02 3 black sneaker casual

The primary key in this table is shoe_id. We've formed our keys by using both the first two letters of the style of shoe and a number that increments for each repeat style. The particular scheme we use is somewhat arbitrary; the important thing is that our key, shoe_id, is unique for each item in the table.

To show the need for primary keys we'll consider what would happen if our table was missing the shoe_id attribute.

cubby_location color style occasion
1 grey sneaker athletic
2 grey boot work
3 black sneaker casual

Now take a look at our outfit table from the previous reading:

id shoe_id shirt_id pants_id
OU01 SN02 LS01 JE03
OU02 SN01 TS02 JE01
OU03 SN01 LS01 JE03

Our outfit table used the unique IDs for shoes, shirts, and pants to identify which specific article of clothing, but if we remove the shoe_id from the shoe organizer table, then what will we use to identify the shoes here in the outfits table?

Let's see if we can replace shoe_id with one of the other attributes in our shoe table, such as 'style.'

id style shirt_id pants_id
OU01 sneaker LS01 JE03
OU02 boot TS02 JE01
OU03 sneaker LS01 JE03

If you want to retrieve the shoes for outfit OU02, this happens to work out, because we only have one pair of boots in the table. However, if we want to retrieve the shoes for outfit OU01 or OU03, we run into a problem. Looking at our shoe table we can see that we own two pairs of sneakers, but it's not clear which pair of shoes complete outfits OU01 and OU03. We have no other way to distinguish which pair of sneakers we are looking for in our outfit table. We also can't wear two pairs of sneakers with one outfit.

If we changed the attribute to 'occasion' instead of 'style' it would work for now, but as soon as we buy another pair of athletic, work, or casual shoes, we're back to the same issue. Since only one pair of shoes can fit in each cubby, we could use cubby_location as a unique identifier, however, the id values would get messed up if we misplaced a shoe or reorganized our closet.

We nearly always want our key values to be immutable, meaning that the values never change no matter what other attributes change on the item, so that we don't have to worry about changing the key and messing up any other entities that refer to that key. The best way to ensure that the key value is immutable is by picking an attribute that is meaningless and won't ever have to be changed, such as our shoe_id of SN01. That's why we structured our shoe table in the following way:

shoe_id cubby_location color style occasion
SN01 1 grey sneaker athletic
BO01 2 grey boot work
SN02 3 black sneaker casual

Primary keys can also be used to distinguish between multiple items that are exactly the same, such as owning two identical shirts. Without a primary key, it would be impossible to differentiate between the two. It may seem unimportant to differentiate between two identical shirts, but this can become more important with databases for online stores to avoid mistakenly trying to send the same exact box to five different addresses. Stores can assign a unique product ID to easily keep track of the individual items. It will depend on the business requirements whether distinguishing individual identical items is required. This will be a necessary part of designing the database tables (more in a later unit!).

Partition and Sort Keys

DynamoDB supports two kinds of keys: partition keys and sort keys. A partition key is a single value. Our shoe organizer table is a table with just a partition key, shoe_id. A sort key can be used to order values that have the same partition key value.

DynamoDB tables define their primary key with either just a partition key (with no sort key defined), or a partition + a sort key. In the first case, the primary key is the partition key, and in the second case, the primary key is the pair of partition + sort key, which is also referred to as a composite primary key. In order to identify a unique item in a partition + sort key table, you must specify both the partition and the sort key (it's possible---even likely---that multiple items will match the partition or the sort key alone).

With a composite primary key, the partition key itself does not have to be unique. It is possible for multiple items to have the same partition key, but among these items with the same partition key, each has to have a unique sort key. The following table compares partition key-only primary keys and composite primary keys:

Partition key-only primary key Composite primary key
Include a partition key? yes yes
Include a sort key? no yes
What is unique? partition key partition + sort key pairings
When is it used? when each item is completely independent, and we only ever need to look items up by their primary identifier when items naturally group together in some way and we might order items within their natural group
Example ASIN artist + song

All the examples you've seen so far used a partition key-only primary key. Let's look at an example that uses a partition + sort key.

Examine the following table:

artist song_title genre year
Black Eyed Peas I Gotta Feeling pop 2009
Linkin Park Numb rock 2003
Black Eyed Peas Pump It pop 2005
Missy Elliot Work It rap 2002
Daddy Yankee Gasolina latin pop 2004

Our songs table uses a composite primary key: 'artist' is the partition key and 'song_title' is the sort key. As you can see, our partition key in this example is not a unique identifier, as 'Black Eyed Peas' is listed as the artist for two songs. The sort key differentiates the two items by having different values for song_title. We could even have two songs with the same song_title but by different artists in this table, and they'll have different primary keys, as they will have different artist + song_title pairings.

It would be possible to create a unique ID for each artist/song combination and just use a single partition key, so what's the benefit of creating the composite primary key? When you use a composite primary key, the partition key allows you to easily retrieve commonly needed groups of related items, such as all the songs by a certain artist. This provides additional flexibility when querying data, including the ability to specify an order for the items within the same partition. With our artist + song_title table, we can list the songs from each artist alphabetically. The choice for the sort key will depend on how you plan to use the data. We'll go into more detail in a later unit when we design DynamoDB tables.

Additionally, using a composite primary key also provides a data storing benefit. Partition keys determine where an item is stored in the database. When you create a table with only a partition key, each item is stored in a separate location. With a composite primary key, however, items with the same partition key are stored together on the same computer, sorted by their sort value, which can make related items faster to retrieve. Think of a filing cabinet---the partition key determines which folder each item is stored in, and the sort key sorts each item in individual folders.

DynamoDB Scalar and Set Types

When creating a database, it's important to understand the different data types used for attributes, and when you would want to use the different types. DynamoDB accepts both scalar types, a type representing exactly one value such as a String, and sets, a type representing multiple scalar values.

Scalar Types

Figure 1: A DynamoDB table named ClubMember, with attributes MemberId, Active, Age, and LastName.

Figure 1: A DynamoDB table named ClubMember, with attributes MemberId, Active, Age, and LastName.

Figure 1: A DynamoDB table named ClubMember, with attributes MemberId, Active, Age, and LastName. Download ClubMember table CSV export 1Links to an external site.

A scalar type represents exactly one value such as the String, Boolean, and Number types. This table includes all three of these scalar types.

DynamoDB type BOOL (Boolean type), maps to the Java types, boolean or Boolean. Boolean values are actually represented by a 1 or 0, but in DynamoDB we represent them in English terms as true (corresponding to 1) or false (corresponding to 0). The Boolean type is therefore useful for attributes that can be represented only as true or false, such as whether a club member is active (the Active attribute).

DynamoDB type S (String type) maps to a Java String. In ClubMember, the attribute LastName is of type S. A String is a sequence of characters, able to represent text or alphanumeric (containing numbers and letters) data, such as IDs that consist of mixed numbers and letters.

DynamoDB type N (Number type) maps to "all Number types" in Java. Java Number types include both primitives and primitive wrapper classes, including Long/long, Integer/int, Double/double, Float/float, BigDecimal and BigInteger. All numbers are sent across the network to DynamoDB as Strings to maximize compatibility across languages and libraries; however, DynamoDB treats them as Number type attributes for mathematical operations. Numbers can be positive, negative, zero, and anything in between. When reading from DynamoDB to Java, it is on the developer to ensure that the value retrieved from DynamoDB is valid for the Java type they are trying to read (e.g. an int can't have decimal places). In ClubMember, Age is type N and maps to Java type Integer.

Set Type

Figure 2: A DynamoDB table named ClubMember, with attributes MemberId, Active, Age, LastName, Committees and YearsActive.

Figure 2: A DynamoDB table named ClubMember, with attributes MemberId, Active, Age, LastName, Committees and YearsActive.

Figure 2: A DynamoDB table named ClubMember, with attributes MemberId, Active, Age, LastName, Committees and YearsActive. Download ClubMember table CSV export 2.

Set types represent multiple scalar values. Sets are used when an attribute can have multiple, unique values, just like a Java Set. But unlike a Java Set, which can hold any kind of object, a DynamoDB Set can only be a Set of Strings or a Set of Numbers.

We've updated the ClubMember table from the previous section. It now has an attribute for Committees, which shows all the committees the member is currently a part of. Each member is a part of one committee, several different committees, or no committees. Since the data in our Committees attribute is represented by text, the attribute is of DynamoDB type SS (String Set), which maps to Java type Set<String>.

Our ClubMember table also features an attribute for YearsActive which lists all the years the member has been active in the club. Each member has been active for one or more years. Since the data in our YearsActive attribute is represented by numbers, the attribute is of DynamoDB type NS (Number Set), which maps to Java type Set<T>. For the DynamoDB type NS, the Java Type is depicted as a generic type representing any of the primitive wrapper classes. The most commonly used numeric types in ATA will be Integer and BigDecimal.

It's important to remember that all the values in a Set must be unique, so they should only be used when you know all the data for that attribute will be different. For example, Sets are useful to keep track of the YearsActive for a club member, as it wouldn't make sense to list a year twice. Additionally, Sets are not an ordered collection, meaning there are no guarantees for the position of elements in the Set. This is the same as the Java Set type, which differs from the Java List type where order is guaranteed.

Guided Project

Preparedness Tasks

Task 1: Setup Sprint 13 Challenge Repo

This Sprint culminates in a Sprint Challenge project. You should begin by forking and cloning the Sprint Challenge starter repo:

Sprint 13 Challenge Starter Repo

This will be your project repo for Sprints 13 - 15.

This resource is also visible under the Sprint Challenge section of the course page. After each module, you will be assigned a mastery task with instructions on adding to or modifying the starter code for the challenge.

Upon completion of all mastery tasks, the Sprint Challenge project will be complete and ready for you to submit to CodeGrade. The CodeGrade submission page is available under the Sprint Challenge section on the modules page.

Task 2: Create and Initialize Your DynamoDB Tables

For this project, you will need to create two DynamoDB Tables, playlists and album_tracks.

First, make sure you have the aws cli installed on your machine. You can find instructions to do so from the AWS documentation.

You have been provided a Cloudformation template which will create the two tables for you. You are encouraged to read through this file inside of the configurations directory within this project. In doing so, you will see the schema for each table.

Run the following command to create these tables on DynamoDB

aws cloudformation create-stack --region us-west-2 --stack-name musicplaylistservice-createtables --template-body file://configurations/tables.template.yml --capabilities CAPABILITY_IAM

Once you've verified your tables exist on AWS it is time to populate the album_tracks table. We'll use a JSON file which you should read over first in order to get an idea of how it works and what it looks like. When you are ready, run the following command:

aws dynamodb batch-write-item --request-items file://configurations/AlbumTracksData.json

Verify once again that your album_tracks table on AWS has now been populated.

Additional Resources