Module 4 - DynamoDB Scan

Module Overview

Learn about DynamoDB scan operations and best practices for efficiently querying and filtering data from your DynamoDB tables.

Learning Objectives

  • Design and implement functionality that uses scan with a filter to retrieve a subset of items from a provided DynamoDB table
  • Design and implement functionality that uses paginated scan with a limit to retrieve a subset of items from a provided DynamoDB table
  • Outline when to scan a DynamoDB table
  • Outline how to use DynamoDBMapper's scan() method

DynamoDB Scan

Lesson overview

You've learned several methods for retrieving data from DynamoDB. In this lesson, we're going to expand upon this knowledge with the DynamoDBMapper scan() method. The scan() method grants you maximum flexibility over the data you can access from a table.

You're going to learn about the DynamoDBMapper scan() method, which searches the table and retrieves the values that meet a condition on any attribute. This method gives you greater flexibility over the data you retrieve and the questions you can ask your table. The scan() method, however, costs a lot more to read items than the query() method does.

DynamoDBMapper scan() method

At this point, you've learned several DynamoDBMapper methods to write items to a table (save() and delete()) and to read items from a table (load() and query()). The query() method allows you to search the table and retrieve multiple items when the partition key is equal to some value and the sort key matches some condition. However, you can only perform a query on the table's key attributes. The partition key must always be equal to some value and you can additionally set a condition on the sort key. This specification means that you are unable to use the query() method to retrieve data based on non-key values, which limits the questions you can ask.

Fortunately, DynamoDBMapper also has the scan() method, which allows you greater flexibility when retrieving information from a table. Before we get into how to use the scan() method, however, let's revisit an example you've previously worked with.

In the last unit we worked with the OnlineOrders table, which has a partition key of userId, a sort key of orderId, and the additional attribute, totalCost.

The OnlineOrders table looks like the following:

userId orderId totalCost
62da86ba 20121207-54297 24.89
62da86ba 20140425-03576 45.87
62da86ba 20140425-54879 32.01
62da86ba 20170923-54129 103.87
62da86ba 20190305-78159 209.13
62da86ba 20200105-42158 89.65
d1a93bea 20091204-78216 34.23
d1a93bea 20130817-45789 17.03

Previously, we could only query items using the key attributes, meaning we could only retrieve results that had the same userId value. What if we want to return all the orders in the table and not just items with the same partition key value? We can do this using the DynamoDBMapper scan() method!

The scan() method shares some similarities with the query() method, but it reads every item in a table or secondary index. The scan() method also doesn't require a composite primary key (partition and sort key). You can use it on tables that have just a partition key. While the query() method requires you to give a value for the partition key, the scan() method doesn't require any key values and by default returns every item in the table.

Let's look at the Java class, Order:

@DynamoDBTable(tableName = "OnlineOrders")
public class Order {

    private String userId;
    private String orderId;
    private BigDecimal totalCost;

    @DynamoDBHashKey(attributeName = "userId")
    public String getUserId() {
        return userId;
    }
    public void setUserId(String userId) {
        this.userId = userId;
    }

    @DynamoDBRangeKey(attributeName = "orderId")
    public String getOrderId() {
        return orderId;
    }

    public void setOrderId(String orderId) {
        this.orderId = orderId;
    }

    @DynamoDBAttribute(attributeName = "totalCost")
    public BigDecimal getTotalCost() {
        return totalCost;
    }

    public void setTotalCost(BigDecimal totalCost) {
        this.totalCost = totalCost;
    }
}

Using the scan() method and the above class, returning all the items in the OnlineOrders table looks like the following:

DynamoDBMapper mapper = new DynamoDBMapper(DynamoDbClientProvider.getDynamoDBClient());
DynamoDBScanExpression scanExpression = new DynamoDBScanExpression();
PaginatedScanList<Order> orderList = mapper.scan(Order.class, scanExpression);

The above code looks very similar to the code for the query() method, with a few key differences. Firstly, the scan() method uses the DynamoDBScanExpression instead of DynamoDBQueryExpression. The scan() method returns a PaginatedScanList instead of PaginatedQueryList, but still paginates data the same way the query() method does. scan() and query() take advantage of DynamoDB's auto-pagination feature. Results are loaded on demand when the user takes an action that requires them. For explicit pagination, we can use scanPage() which we will cover later in the reading. The scan() method requires a scanExpression as an argument, but since we want all the values in the table, we aren't applying any filters and don't need to specify anything in the expression. We similarly don't need to specify a partition key value because we're retrieving all the items in the table regardless of their partition key value.

DynamoDBMapper scan() with filter expression

Retrieving all the items from a table works well with a small table, but many tables have hundreds or thousands of entries. It's more likely that you'll want to retrieve all the items that meet a certain condition, like we do with the query() method.

Let's say we want to want to get all the orders that are between a certain cost range. Since totalCost isn't a key attribute, we can't perform this search using query() and instead need to use scan(). The filter expression functions very similarly to the withKeyConditionExpression() method you learned when querying. It uses the same comparators, functions, and logical operators. The filter on a scan can be applied to any attribute, not just key attributes. The filter is applied after the scan finishes, but before the results are returned. The scan still looks at each item in the table, but only returns the items that meet the filter condition. This means that you still pay the price (time and money) of reading the entire table whether you apply a filter expression or not. However, it is faster to send back less data, and your code does not need to apply the filtering! Note: filter expressions are not unique to scans and can be applied to queries as well. When applied to a query the filter can not be applied to a key attribute.

The code to return all the orders that cost between $30 and $70 looks like the following:

DynamoDBMapper mapper = new DynamoDBMapper(DynamoDbClientProvider.getDynamoDBClient());
Map<String, AttributeValue> valueMap = new HashMap<>();
valueMap.put(":minCost", new AttributeValue().withN("30.00"));
valueMap.put(":maxCost", new AttributeValue().withN("70.00"));
DynamoDBScanExpression scanExpression = new DynamoDBScanExpression()
    .withFilterExpression("cost between :minCost and :maxCost")
    .withExpressionAttributeValues(valueMap);
PaginatedScanList<Order> orderList = mapper.scan(Order.class, scanExpression);

It's likely much of this looks familiar to you, as it is very similar to the query() method. Let's break it down a bit further to make sure you understand all the pieces.

DynamoDBMapper mapper = new DynamoDBMapper(DynamoDbClientProvider.getDynamoDBClient());

The first line creates a new mapper instance just like we've done for all the DynamoDBMapper methods.

Map<String, AttributeValue> valueMap = new HashMap<>();
valueMap.put(":minCost", new AttributeValue().withN("30.00"));
valueMap.put(":maxCost", new AttributeValue().withN("70.00"));

The next section creates the HashMap, valueMap, which contains the values we will use in our filter expression. This looks the same as the one we used for the query() method when creating key condition expressions. Notice that we still need to include the ':' in the names of the attributes.

DynamoDBScanExpression scanExpression = new DynamoDBScanExpression()
.withFilterExpression("totalCost between :minCost and :maxCost")
.withExpressionAttributeValues(valueMap);

The next section declares the DynamoDBScanExpression, sets the filter expression condition, and provides the expression attribute value map we created above. Since we want to return all the items that have a totalCost between $30 and $70, we want to use the 'between' comparator. Again, the biggest difference is that we do not need to set a value for the partition key and apply the filter on non-key attributes.

PaginatedScanList<Order> orderList = mapper.scan(Order.class, scanExpression);

The final line calls the scan() method, which accepts the Order class and the scanExpression we just created.

The above code returns the following bolded items from the OnlineOrders table:

userId orderId totalCost
62da86ba 20121207-54297 24.89
62da86ba 20140425-03576 45.87
62da86ba 20140425-54879 32.01
62da86ba 20170923-54129 103.87
62da86ba 20190305-78159 209.13
62da86ba 20200105-42158 89.65
d1a93bea 20091204-78216 34.23
d1a93bea 20130817-45789 17.03

Pagination with DynamoDBMapper scanPage()

There are times when you need to paginate results for a client, returning only a subset of the results at a time. For example, you may be presenting your scan results from above (all online orders) on a webpage for a business team to review. The webpage may only show 50 orders at a time, requiring the user to click the next button to load 50 more. You can achieve this by utilizing the scanPage() method. You can provide it a limit of 50, and utilize the exclusiveStartKey to provide your next scan the correct starting location in your table. The withExclusiveStartKey method for scans requires you to provide a Map of the table's key(s).

Let's walk through an example. You want to retrieve all orders, 4 at a time. The code looks like the following:

public List<Order> scanOrdersWithLimit(Order previousEndOrder) {
    DynamoDBScanExpression scanExpression = new DynamoDBScanExpression()
        .withLimit(4);

    if(previousEndOrder != null) {
        Map<String, AttributeValue> startKeyMap = new HashMap<>();
        startKeyMap.put("userId", new AttributeValue().withS(previousEndOrder.getUserId()));
        startKeyMap.put("orderId", new AttributeValue().withS(previousEndOrder.getOrderId()));
        scanExpression.setExclusiveStartKey(startKeyMap);
    }

    ScanResultPage<Order> orderPage = mapper.scanPage(Order.class, scanExpression);
    return orderPage.getResults();
}

This code snippet looks very similar to the previous, with the addition of the withExclusiveStartKey() and withLimit() methods. As you can see, we're passing in an previousEndOrder value, which will be null the first time we call the method. Calling scanOrdersWithLimit() from a main() method looks like the following:

public static void main(String[] args) {
    //scanning the table the first time
    List<Order> orders = scanOrdersWithLimit(null);
}

Since we passed in null for the value of the previousEndOrder, we don't call setExclusiveStartKey(), and this scan will start at the beginning of the table and scan the first four items. The items that are scanned are bolded in the following table:

userId orderId totalCost
62da86ba 20121207-54297 24.89
62da86ba 20140425-03576 45.87
62da86ba 20140425-54879 32.01
62da86ba 20170923-54129 103.87
62da86ba 20190305-78159 209.13
62da86ba 20200105-42158 89.65
d1a93bea 20091204-78216 34.23
d1a93bea 20130817-45789 17.03

Let's say we now want to get the next four items in our table. We can call scanOrdersWithLimit() a second time from our main() method:

public static void main(String[] args) {
    //scanning the table the first time
    List<Order> orders = scanOrdersWithLimit(null);
    //scanning the table a second time using the last Order returned from the first scan as the exclusiveStartKey
    Order previousEndOrder = orders.get(orders.size() - 1);
    List<Order> orders = scanOrdersWithLimit(previousEndOrder);
}

When we call scanOrdersWithLimit() a second time, we pass in the last Order that was returned from the first scan to be converted to our exclusive start key. Now when the table is scanned it will know which item was last scanned and scan the next four values. The items that are scanned are bolded in the following table:

userId orderId totalCost
62da86ba 20121207-54297 24.89
62da86ba 20140425-03576 45.87
62da86ba 20140425-54879 32.01
62da86ba 20170923-54129 103.87
62da86ba 20190305-78159 209.13
62da86ba 20200105-42158 89.65
d1a93bea 20091204-78216 34.23
d1a93bea 20130817-45789 17.03

Filtering with limits is challenging! Limits are applied to the number of items that DynamoDB reads. Filters are then applied before returning results. If we consider our example above, and add in a filter expression so that the totalCost of an order must be greater than $50 we may not retrieve 4 items. On our first scan we will start at the beginning of the table and read the first four items. The items that are read are bolded in the following table:

userId orderId totalCost
62da86ba 20121207-54297 24.89
62da86ba 20140425-03576 45.87
62da86ba 20140425-54879 32.01
62da86ba 20170923-54129 103.87
62da86ba 20190305-78159 209.13
62da86ba 20200105-42158 89.65
d1a93bea 20091204-78216 34.23
d1a93bea 20130817-45789 17.03

Of these four items, only the last one meets our filter expression (totalCost >= 50). Since only one item meets our condition, only that one item will be returned, as shown in the following table:

userId orderId totalCost
62da86ba 20170923-54129 103.87

In order to get four items to return, this requires a bit of extra massaging. We won't go into the details, however at a high level you might utilize scan() with a filter expression and do the limiting in your code rather than in the request to DynamoDB. scan() also allows you to provide an exclusive start key and so on your next call you can skip over the already scanned items and start your scan at the correct location in your table. This also applies to queries, and a similar approach can be followed.

Conclusion

In this reading, we discussed the DynamoDBMapper method scan(). The scan() method is useful because it's a more flexible way to retrieve data from tables. Unfortunately, the scan() method is very expensive because it defaults to searching all the items in a table. You will want to design your tables in such a way that you minimize the need for scans! To help filter your scan, we learned how to use filter expressions. We also learned how to do pagination using scanPage().

Guided Project

Resources