Module 2: Metrics

Learning Objectives

  • Describe the lifecycle of a metric value from emission to expiration
  • Implement a method that emits values for a metric
  • Implement a method that emits zeros for a given count metric to enable the calculation of statistics
  • Locate a metric in CloudWatch emitted to a defined metric namespace by a specific service
  • Examine which aggregation period to use in CloudWatch for a given scenario
  • Examine which statistic out of min, max, sum, avg, n to use in CloudWatch to answer a given question
  • Manually identify the p-th percentile value of a small dataset
  • Use CloudWatch to identify the p-th percentile value of a large dataset
  • Design and implement a custom metric to satisfy a given business requirement
  • Explain a way in which metrics are used in software development
  • Explain the CloudWatch concept: namespace
  • Explain the CloudWatch concept: metric
  • Explain the CloudWatch concept: dimension
  • Explain the CloudWatch concept: statistics
  • Explain the CloudWatch concept: period
  • Explain the CloudWatch concept: latency
  • Explain the CloudWatch concept: alarm
  • Explain the CloudWatch concept: aggregation period
  • Explain the metrics percentile concept: percentiles
  • Explain the metrics percentile concepts: p50, p90, p99
  • Participate in a metric design process

Key Topics

  • AWS CloudWatch service and its features
  • Standard vs. custom metrics
  • Implementing metrics collection in Java applications
  • Setting up CloudWatch dashboards
  • Creating and managing CloudWatch alarms
  • Correlation between metrics and application performance

Introduction to Metrics and CloudWatch

Understanding CloudWatch Concepts

AWS CloudWatch is a monitoring and observability service that provides data and actionable insights for your applications. Key CloudWatch concepts include:

  • Namespace: A container for metrics that share a common purpose or source
  • Metric: A time-ordered set of data points representing values being measured
  • Dimension: A name/value pair that uniquely identifies a metric
  • Statistics: Aggregated data points for a specified period (sum, average, min, max, count)
  • Period: The time interval over which statistics are applied
  • Alarm: A resource that watches a single metric and performs actions based on the metric's value

Metrics

Your code has been written, reviewed, deployed, and tested. It is now running in production. Your job is done, right? Not quite. Once code is moved to production, there is still work to be done monitoring the application as well as reacting to busy and slow periods. There is also a need to collect data to help debug issues along with using data to help make business and operations decisions. Logs are one way to track what is going on in production, but is there an easier, more visual way to track how your code is performing? Yes! This is where metrics come in.

In programming, metrics are just measurement values that are collected over time. A business might track particular resource usage. At another point, they may wish to track the performance of specific pieces of code. Or, there may be something completely custom to track that can help the business make better decisions. At Amazon, Amazon Web Services (AWS) CloudWatch is used to collect and view metrics. This reading will introduce you to CloudWatch and metrics concepts so that you can use them in your own projects.

AWS CloudWatch

As mentioned in the introduction, AWS CloudWatch is a metrics collection and monitoring service for use with applications running on AWS. This section outlines the components and concepts needed to understand how to use CloudWatch.

CloudWatch Components

There are three main components for accessing and using CloudWatch.

Amazon CloudWatch Console

The CloudWatch Console is a browser-based interface for viewing, searching, and managing data in CloudWatch. The CloudWatch Console can be customized to focus in on data you consider most important and can even be used to configure alarms that notify you if thresholds are breached.

CloudWatch API

The CloudWatch API allows developers to publish, monitor, and manage metrics in CloudWatch. The API is available in a variety of languages and frameworks including Java, JavaScript, PHP, Python, Ruby, and Windows .NET.

AWS Command Line Interface

In addition to its many other functions, the AWS Command Line Interface (CLI) also includes the ability to access CloudWatch. The CLI can be used to push metrics to CloudWatch and pull metrics data from CloudWatch.

Built-in AWS Metrics

Many Amazon Services automatically publish metrics to CloudWatch. They show up in CloudWatch with namespaces that begin with "AWS/" and end with the name of the service. For example, Amazon DynamoDB metrics are stored under the namespace "AWS/DynamoDB". CloudWatch namespaces are covered later in this reading. For now, just think of it as a way to organize the data coming into CloudWatch from a variety of sources.

Custom Metrics and the CloudWatch API

In addition to built-in metrics, you can publish custom metrics to CloudWatch. These can track whatever you need to track in your project. There are some differences in how long CloudWatch holds on to custom metric data, but for the most part custom metrics are treated the same as any other metric data in CloudWatch. Most often, you'll use the CloudWatch API in your code to publish your custom metrics.

CloudWatch Concepts

To understand CloudWatch and how to use it effectively, it's important to understand the terminology and concepts used within CloudWatch.

Namespace

Namespaces can be thought of as a container for metrics data in CloudWatch. The namespace is usually just the name of a service or application. As mentioned earlier in this reading, built-in AWS namespaces all begin with "AWS/" and end with a unique service name, such as "AWS/EC2" for Amazon EC2. When you create custom metrics, avoid starting your metrics names with "AWS/". The only other restrictions are that a namespace must be fewer than 256 characters and may contain numbers, letters (uppercase and lowercase), and the special characters period (.), hyphen (-), underscore (_), forward slash (/), hash (#), and colon (:).

Metric

A metric is the fundamental data structure in CloudWatch. A metric represents a single measurement made at a single time. By collecting multiple metrics over a time period, comparisons can be made about the state of an application over time. There are two types of data in a single metric. The first is identifying information. A metric is identified by a namespace, zero to ten dimensions (these are explained below), and a name. A metric name identifies what is being measured, for example, ErrorCount or Latency. The second type of data in a metric is the actual measurement data. The data within the metric includes a timestamp, a value, and an optional unit of measure. The unit is used to give context to the metric. For example, if measuring response times, are the units of the metric seconds, milliseconds, or microseconds? CloudWatch provides a number of allowed values for a unit. If none of them fit the metric, the default value is "None". While CloudWatch does not use the unit itself, it does provide additional information to any person or application reading the data.

Metrics are locked to the region they are created in and cannot be deleted. However, they expire after 15 months. In addition, metric data may be published with timestamps up to two weeks in the past and up to two hours in the future.

Dimension

A dimension is just a name and value pair. Dimensions are used to further identify metric data within a namespace and metric name. For example, if you have a service running on multiple servers, you may want to keep track of which server your metric data is coming from. In this example, you could create a dimension named "server" with the server name as the value, such as "Test" or "Production". Common dimensions used at Amazon include stage (to indicate which deployment stage the data is coming from), country (the country of origin for the data), and operationName (to separate metrics measuring the same thing across different operations in a service).

Below is an example of some error metrics with different dimensions that a service could have in CloudWatch. Here we have a dimension named Stage with values of Gamma and Production, which we use to represent metrics for the different deployment stages of our service. We also have a dimension named Marketplace with values of US, CA, and MX to represent the metrics for the US, Canada, and Mexico marketplaces. Because we have defined separate dimensions for Stage and Marketplace, you can see that we have separate Gamma and Production metrics for each of the different marketplaces.

Figure 1

Figure 1: Examples of dimensions for metrics on CloudWatch on CloudWatch. Here the Kindle Publishing Service has the Stage and Marketplace dimensions to distinguish metrics on the various deployment stages and country marketplaces.

Statistics

CloudWatch aggregates data over a time period into statistics. These statistics describe the data including the minimum, maximum, sum, average, sample count, and percentiles. Percentile data can be retrieved with accuracy up to two decimal places (for example, pNN.NN). See the table below for a brief description of each of the available statistics. Note that the set of values refers to all the values over a given time period.

Statistic Definition
Minimum The lowest value in a set of values.
Maximum The highest value in a set of values.
Sum The total of all the values added together.
Sample Count The number of values in a set.
Average The sum divided by the sample count.
Percentile (pNN.NN) A value that is a higher than a given percentage of the rest of the values and lower than the rest.

Instead of adding individual metric values to CloudWatch, you may also publish pre-calculated statistics using a sample count, minimum, maximum, and sum.

Statistics can be retrieved from CloudWatch instead of metric data. When retrieving statistics, define a period, a start time, and an end time. When doing this, be aware that built-in metrics behave differently than custom metric data. For some built-in AWS services, CloudWatch aggregates data into statistics across different dimensions. However, CloudWatch does not do this for custom metrics. For custom data, statistics are only retrieved using the exact matches for the dimensions.

Period

A period in CloudWatch is the length of time associated with a statistic. In CloudWatch, you can view data over a time window and display that data grouped into aggregation periods. The time window for your data tells CloudWatch how much data to show. For example, you can choose to view one hour of data, one day of data, or even one month of data. Aggregation periods indicate how small or large a time period each data point represents. Think about looking at how much traffic a service receives is by counting requests. You may want to look at the number of requests per hour over the last day. In this case, you would set a time window of 1 day and an aggregation period of 1 hour. When requesting data from CloudWatch, be sure to choose an appropriate time window along with an appropriate aggregation period.

Below is an example metric graph on CloudWatch measuring a service's request count with a time window of 1 day and an aggregation period of 1 hour. Each dot on the graph represents 1 hour of data. You can see that this service gets up to 1.6 million requests per hour. The time window is configured at the top of the graph (the 1d in blue represents 1 day). The aggregation period configured at the bottom, under the Period column.

Figure 2

Figure 2: A metric graph on CloudWatch measuring a service's request count with a time window of 1 day and an aggregation period of 1 hour.

Percentiles

A percentile shows where a value is relative to the rest of the values in a set. A common use of percentiles is to demonstrate how students perform compared to their peers. For example, if a student scored 680 points on a standardized test, and they are told that they are at the 85th percentile in their state. This means their score of 680 is higher than 85 percent of the students in their state. It also means that 15 percent of the students in the state scored 680 or higher. CloudWatch can return percentiles for metrics with two decimal places of accuracy (pNN.NN). There are three percentiles that tend to be used more often than others though. These are p50, p90, and p99.

Uses of p50

A percentile of p50 is the median of a set of values. A median is the middle value out of an ordered set of values. Note that this is not the same as the average which must be computed by adding all the values together and dividing by the number of values in the set. A median only shows that half of the values are below this value and half at or above. Refer to Figure 1 below, which shows how to calculate the median compared to the average for a set of data.

Figure 3

Figure 3: A comparison of p50 and average for a small set of values.

Reducing skewed stats with p90 and p99

The other two commonly used percentiles, p90 and p99, are used to see just how far away from the median extra high values are. Often metrics measure things like response times and how many requests come in at one time. Data sets like this often include some number of extremely high values of noise that can skew the statistics. When talking about these kinds of situations, it is helpful to reference what the typical situation looks like. In this case, you want to remove the extremely high values from consideration as they represent the extraordinary cases. Percentiles like p90 and p99 do just that. The p90 is the value which 90 percent of all values fall under. If you want to make sure an even higher percentage is under a given threshold, p99 will show what value 99 percent of the values are under. The remaining 1 percent could be caused by external factors, noise, or even incorrect measurements.

Latency

When monitoring website performance, you usually want to know how long each request takes to generate a response. This is known as latency. However, latency can be affected by more than just your service. Network issues occasionally make some latency measurements fall outside of the normal range of response times. In order to account for this when looking at the maximum latency times, it's helpful to compare it to the p90 and p99 values. The p90 value for latency over a time period tells you that 90 percent of your requests over that period fell under that time. The p99 says the same for 99 percent of your requests. If your p99 is well within your predefined limits, then you may be able to attribute the remaining 1 percent to external issues, such as the network. If your p99 is over the limit, there may be something problematic happening that needs investigation. If your p90 value is over the limit, then even more requests are seeing slow times and could indicate even more widespread problems that need to be resolved.

Alarms

While not covered here, CloudWatch includes alarms that can take actions or alert people if thresholds are exceeded. Alarms are just one way that the metrics can be put to use by taking advantage of all the capabilities built into CloudWatch.

Metric lifecycle

Like everything else in software development, metrics have a lifecycle as depicted in Figure 2. Knowing the lifecycle can help determine how to collect and store metrics, how often to collect metrics, and what metrics may already be available for a given use case.

Figure 4

Figure 4: The CloudWatch metric lifecycle

Stages

Built-in metrics are emitted from AWS Services as soon as those services are used. Custom metrics must have publishing code implemented and deployed before metrics will begin showing up in CloudWatch. A metric's lifecycle begins in design. Usually business or operations need to track some aspect of an application. A developer writes code to emit the metric and commits it to the code repository. However, this code must be deployed before it can start emitting metric data.

The deployment pipeline takes in the developer's changes, builds the application along with other developers' updates, and deploys the code to a test environment. At this point, data can be emitted. A metric typically has a dimension specifying whether the data is coming from a test environment or from production. These keep the data separate, but still allows for testing to verify that the metric is being emitted properly. It also allows the application to be tested against limits that the metrics may be measuring.

CloudWatch store metrics for a predetermined amount of time before aggregating it. Over time, the minimum period available to search that aggregated data increases until the data is 15 months old. At this point, CloudWatch will clear the expired data. However, anything newer than 15 months is still available.

Aggregation Periods

CloudWatch aggregates data when it is published. As time passes, the minimum available aggregation period grows. For example, if you are viewing data with an aggregation period under 1 minute, that data is available for 3 hours. After that, it can be retrieved as aggregated data with a period as small as 60 seconds. Once the data is 15 days old, it can only be retrieved with an aggregation period of 300 seconds (5 minutes). Refer to the table below which shows how long the various aggregation periods are available.

Aggregation Period Availability
< 60 seconds 3 hours
60 seconds (1 minute) 15 days
300 seconds (5 minutes) 63 days
3600 seconds (1 hour) 455 days (15 months)

Built-in vs custom metric lifecycles

The lifecycles of built-in and custom metrics have a few differences. Some have been mentioned above, but let's review them again. First, only some built-in metrics can be aggregated across dimensions. For custom metrics, the dimensions in a search must match the metric data exactly. Second, only custom metric data with a storage resolution of 1 second supports periods under a minute. Third, custom metrics must be coded and deployed before they start emitting to CloudWatch. Built-in metrics will publish data to CloudWatch when the corresponding AWS Service is used.

Summary

This reading covered how metrics allow you to monitor performance of your code and AWS Services. You learned that CloudWatch is Amazon's metric management service. You also read about terminology and concepts used in CloudWatch, including metrics, aggregations, statistics, periods, and percentiles. In future readings, you will learn about the CloudWatch console and how to create your own metrics in Java.

The AWS CloudWatch Console

What is the CloudWatch console?

The Amazon Web Services (AWS) CloudWatch console is a web interface to view and manage the data in CloudWatch. Using the console, you can create and view dashboards with important metric data. You can also search and filter the metric data. While not covered here, the CloudWatch console also allows you to create alarms to react when metrics exceed thresholds.

The console homepage

You can access the CloudWatch console by first logging into an AWS account through the AWS console. Then select CloudWatch from the Services menu to be dropped at the CloudWatch console homepage. The homepage displays some basic snapshot information for services to which you have access. An example of the homepage is shown below in Figure 1. Note that your homepage may look different based on your customizations, what services you are accessing, and the version of the console you are using.

Figure 1

Figure 1: The AWS CloudWatch Console Homepage

AWS Services

The first thing presented on the homepage is a listing of AWS Services and the alarms that have been triggered by them. For this lesson we will ignore the alarms. You can click on the service names that appear on the homepage to jump to a dashboard for that service.

Metrics

In the toolbar on the left are links to specific types of data available in CloudWatch. One of the most important for searching for specific data is the Metric link. This will take you to an empty graph with a search and filter interface. From here, you can search for metrics and display them on the graph.

Finding a metric by name

This example walks through finding a specific metric for a server on the console.

  1. Log into CloudWatch
  2. Figure 2.1

    Figure 2.1: Step 1, log into CloudWatch

  3. Click on Metrics
  4. Figure 2.2

    Figure 2.2: Step 2, click on Metrics in toolbar

  5. Select the Namespace for your service
  6. Figure 2.3

    Figure 2.3: Step 3, Select the namespace for the service

  7. Select the dimension to search by
  8. Figure 2.4

    Figure 2.4: Step 4, select the dimension to search by

  9. Check the metric name to view its data
  10. Figure 2.5

    Figure 2.5: Step 5, check the metric you want to view

  11. View data, adjust range for graph, if needed
  12. Figure 2.6

    Figure 2.6: Step 6, view data and adjust graph range

Accessing Statistical Data

This example walks through getting specific percentiles for an aggregation period of a large metric data set and changing through the aggregation period.

  1. Log into CloudWatch
  2. Click on Metrics
  3. Select the Namespace for your service
  4. Select the dimension to search by
  5. Select the metric name to view data for. For this example we are looking at the latency for AmazonReturnService. This graph below is currently showing the AmazonReturnService's p90 latency metric over the last 3 hours with a 5 minute aggregation period.
Figure 2.7

Figure 2.7: Viewing p90 latency metric for AmazonReturnService.

We decide we want to see latency metric data aggregated by the minute instead of 5 minutes, so we update the period to be 1 minute.

Figure 2.8

Figure 2.8: Changing the aggregation period to 1 minute.

Below is the updated graph with 1 minute metrics. You'll notice that there are points and spikes in the graph -- that's because since we're looking at metrics aggregated by 1 minute, there are more datapoints in this 3 hour time window than when the metrics were aggregated by 5 minute periods.

Figure 2.9

Figure 2.9: Updated graph with 1 minute aggregation period.

Now we decide that we want to see the p99 latency instead of p90, so we update the statistic of the metric to p99.

Figure 2.10

Figure 2.10: Changing the statistic to p99.

Here is the updated graph with p99 metrics. Note that the values are higher than the p90 metrics, which makes sense since this represents the latency values that 99% of the requests fall under instead of 90%.

Figure 2.11

Figure 2.11: Updated graph with p99 metric.

Summary

This reading covered how to log into the CloudWatch console and access metrics data. It also walked through finding data for a specific service and retrieving statistical data for a metric over a specific aggregation period. The next reading will cover publishing custom metrics in Java.

Creating Custom Metrics

Publishing custom vs. built-in metrics

While metric data for built-in AWS Services is automatically published to CloudWatch just by using a service, you must manually publish custom metrics if you want to track anything not covered by the AWS Services running your applications. This reading walks you through how to use the AWS CloudWatch API for Java to publish metric data to CloudWatch.

How to emit a custom metric in Java

As we learn about emitting metrics in Java, let's follow an example through from requirement to metric collection in CloudWatch.

New business requirement

Mary is a developer on the Amazon Alexa team. She just finished working on a project to create a new Amazon Alexa skill that will translate sentences into another language. Now that the application is running in production, the business owner told the team that he would like weekly reports of how long the service takes to translate. Mary has been tasked with designing and implementing the code to solve this request.

Metric design process

The first thing Mary does is write up the request from the business owner with her proposed solution. Having just worked on the application, Mary knows there is a class SpeechTranslator with a translateSentence method that acts as an entry point for the application. In her proposal, she suggests adding a method to this class to publish a metric to CloudWatch. All she has to do is add code at the beginning and end of the translateSentence method to measure and report the time it takes the service to process a translation. The application already has a namespace assigned to use in CloudWatch (EXAMPLE/ALEXA_TRANSLATOR). So, Mary just needs a name for this metric. She proposes a metric named "TRANSLATION_TIME" to hold this information.

Next, Mary schedules a design review for her proposal with the rest of the team. The business owner says it would be helpful to know the languages being translated and see metrics for each separate language pairing. Upon hearing this, Mary suggests using two dimensions to help separate the data, LANGUAGE_TO and LANGUAGE_FROM. This allows the same metric name to be used while keeping the different translations separate. The team and the business owner approve Mary's design.

Once it is approved, Mary gets to work writing the code, running a code review, and committing her code changes for beta testing. In beta testing, QA verifies that Mary's code is sending a zero-value metric each time the application fails to translate a request. They also test to ensure that this is the only situation that the application reports this metric.

Implement the metric

With her newly approved design, Mary sets out to add the necessary code to the SpeechTranslator class.

Import the CloudWatch classes

First, she adds imports for the CloudWatch classes she needs. These all live under the com.amazonaws.services.cloudwatch package.

import com.amazonaws.services.cloudwatch.AmazonCloudWatch;
import com.amazonaws.services.cloudwatch.AmazonCloudWatchClientBuilder;
import com.amazonaws.services.cloudwatch.model.Dimension;
import com.amazonaws.services.cloudwatch.model.MetricDatum;
import com.amazonaws.services.cloudwatch.model.PutMetricDataRequest;
import com.amazonaws.services.cloudwatch.model.PutMetricDataResult;
import com.amazonaws.services.cloudwatch.model.StandardUnit;
Initialize the CloudWatch client

An AmazonCloudWatch object must be initialized before metrics can be sent to CloudWatch. Building these objects can be resource intensive. With this in mind, Mary sets this up to initialize once when the class is initialized. Then the class can reuse the same object whenever publishing metric data.

final AmazonCloudWatch cw =
  AmazonCloudWatchClientBuilder.defaultClient();
Build the request

Mary also adds a method to the SpeechTranslator class that builds and sends the metric request to CloudWatch. She also uses the previously mentioned namespace, metric name, and dimensions. She also sets the units to milliseconds. This will add some context to the metric in CloudWatch for anyone looking at the numbers and wondering what the units are. Once the metric request is ready, it can be sent using the AmazonCloudWatch object. Note that a response is returned. Using this PutMetricDataResult object, metadata like the HTTP response code is used to verify that the metric data was successfully sent to CloudWatch.

public void reportTranslationTime(double timeInMilliseconds, String languageFrom, string languageTo) {
    Dimension dimension1 = new Dimension()
      .withName("LANGUAGE_TO")
      .withValue(languageTo);

    Dimension dimension2 = new Dimension()
      .withName("LANGUAGE_FROM")
      .withValue(languageFrom);

    MetricDatum datum = new MetricDatum()
      .withMetricName("TRANSLATION_TIME")
      .withUnit(StandardUnit.Milliseconds)
      .withValue(timeInMilliseconds)
      .withDimensions(dimension1, dimension2);

    PutMetricDataRequest request = new PutMetricDataRequest()
      .withNamespace("EXAMPLE/ALEXA_TRANSLATOR")
      .withMetricData(datum);

     PutMetricDataResult response = cw.putMetricData(request);
}
Measuring the data

One last thing Mary has to do is measure the time and add a call to her new reportTranslationTime method. She does this by taking the time at the start of the translateSentence method and just before the return, computes the elapsed time and calls reportTranslationTime.

public String translateSentence(String sentence, String languageFrom, String languageTo) {
  double startTime = System.currentTimeMillis();
  String translatedSentence;

  // Omitted implementation

  double endTime = System.currentTimeMillis();
  double timeInMilliseconds = endTime - startTime;
  reportTranslationTime(timeInMilliseconds, languageTo, languageFrom);

  return translatedSentence;
}

Deploy the metric emitting code

Now that Mary is finished implementing and has committed her code, she can sit back and admire her work, right? Not quite. The metric emitting code is in the codebase, but that codebase must go through the deployment process before it will start sending data to CloudWatch. However, Mary's code passes code review and is deployed through the test environments without incident. The day finally comes where this new metric code is deployed to production. Mary and the business sponsor can log into CloudWatch and verify data has been emitted and collected. Now that the code is working in production and the business sponsor is satisfied, Mary can finally give herself a pat on the back for a job well done. She was able to take a request from the business, turn that into a new metric design, implement that metric, and shepherd the code through the deployment process all the way up to a production release.

Measuring counts

The example above emitted a latency metric for the translateSentence operation. Another common metric to measure is an error count measuring the number of times an operation fails. Let's log an error metric for every time translateSentence encounters a TranslationException.

The team has already built a MetricsPublisher class which sets up a CloudWatch client and has an addMetric method that builds the common metric data for us, so we'll use this to log the count metric. Here's its implementation:

/**
 * Publishes the given metric to CloudWatch.
 *
 * @param metricName name of metric
 * @param value value of metric
 * @param unit unit of metric.
 */
public void addMetric(final String metricName, final double value, final StandardUnit unit) {

    final MetricDatum datum = new MetricDatum()
        .withMetricName(metricName)
        .withUnit(unit)
        .withValue(value)
        .withDimensions(service, marketplace);

    final PutMetricDataRequest request = new PutMetricDataRequest()
        .withNamespace("EXAMPLE/ALEXA_TRANSLATOR")
        .withMetricData(datum); 

    cloudWatch.putMetricData(request);
}

Now, we'll call the addMetric method above to log the error count metric called "TranslateErrorCount" in translateSentence:

public String translateSentence(String sentence, String languageFrom, String languageTo) {
  long startTime = System.currentTimeMillis();
  String translatedSentence = null;
  try {
      // Omitted implementation
  } catch (TranslationException e) {
     metricsPublisher.addMetric("TranslateErrorCount", 1, StandardUnit.Count);
     // Omitted handling of exception
  }
  // Omitted translation time metric implementation
  return translatedSentence;
}

In the code snippet above, we added the metricsPublisher.addMetric call in the catch block to log an error metric whenever translateSentence encounters a TranslationException. For the addMetric call, we passed in "TranslateErrorCount" for the metric name, 1 as the metric value (since the error count should increase by 1 every time we get the TranslationException), and StandardUnit.Count as the unit. If we deploy this code to production, we would be able to see the number of translateSentence errors by viewing our service's "TranslateErrorCount" metric on CloudWatch, using the "Sum" statistic.

Emitting zeros

With the above code change, we now can keep track of the number of errors from translateSentence. However, we commonly need to know the error rate: the percentage of calls that result in an error. Our current implementation of the "TranslateErrorCount" metric currently only logs an error metric when we see an exception in translateSentence. To measure the error rate, we also need to know when translateSentence doesn't throw an exception, i.e. when the operation completes successfully.

We track this using the same "TranslateErrorCount" metric as above, but we emit a zero when translateSentence returns successfully as well as emitting a one when it fails.

public String translateSentence(String sentence, String languageFrom, String languageTo) {
  long startTime = System.currentTimeMillis();
  String translatedSentence = null;
  try {
      // Omitted implementation
  } catch (TranslationException e) {
     metricsPublisher.addMetric("TranslateErrorCount", 1, StandardUnit.Count);
     // Omitted handling of exception
  }
  metricsPublisher.addMetric("TranslateErrorCount", 0, StandardUnit.Count);
  // Omitted translation time metric implementation
  return translatedSentence;
}

In the code snippet above, you can see that we added a 0 count metric before returning to keep track of when translateSentence completes successfully. Now, when we deploy this code to production, we can see the error rate by viewing the "TranslateErrorCount" metric in CloudWatch with the "Average" statistic set. As mentioned in the first reading, the Average statistic sums all the metric values and then divides by the sample count. Since we're logging a 1 count metric for every failed call and a 0 count metric for every successful call, the Average statistic would show translateSentence's error rate. Note that this metric still also tracks the number of errors using the same "Sum" statistic as before.

By simply logging the extra zero metric, we were able to get more useful metric data for our "TranslateErrorCount" metric! Logging both one and zero value metrics like we did above is a common way to measure data such as error rates. Other than the value of the metric, emitting a zero value metric is no different than logging any other metric.

Here's a basic example of creating a custom metric in Java using the AWS SDK:

// Create CloudWatch client
AmazonCloudWatch cloudWatch = AmazonCloudWatchClientBuilder.standard()
        .withRegion(Regions.US_WEST_2)
        .build();

// Define metric data
MetricDatum datum = new MetricDatum()
    .withMetricName("RequestLatency")
    .withUnit(StandardUnit.Milliseconds)
    .withValue(42.0)
    .withDimensions(new Dimension()
        .withName("ServiceName")
        .withValue("LoginService"));

// Create request and publish metric
PutMetricDataRequest request = new PutMetricDataRequest()
    .withNamespace("MyApplication")
    .withMetricData(datum);

PutMetricDataResult response = cloudWatch.putMetricData(request);

When implementing metrics, always consider:

  • Using meaningful namespaces and dimensions for proper organization
  • Selecting appropriate units for your measurements
  • Emitting zeros for count metrics to enable accurate statistics calculation
  • Choosing the appropriate aggregation period based on your monitoring needs

Working with Percentiles

Percentiles are powerful statistics that help understand the distribution of your metric values:

  • p50 (median): 50% of the data points are below this value
  • p90: 90% of the data points are below this value
  • p99: 99% of the data points are below this value

Percentiles are particularly useful for monitoring latency and response times, as they highlight the user experience better than averages.

Summary

In this reading, you followed Mary as she took a new requirement and updated an existing codebase to emit a new metric to meet the requirement. You saw how to use the CloudWatch API to implement emitting a metric in Java. You also learned about zero-value metrics and where you might use them. With this knowledge, you should be ready to implement metrics in your own code.

Guided Project

Resources