Monitoring SageMaker & Bedrock in AWS

Use these guidelines to cover your team from A to Z.

Last week we kicked off our second series on monitoring AI (this time in AWS) amd taking on management of SageMaker and Bedrock. Once again, I’m going to illustrate how it’s not THAT different from what your IT team is doing today.

This week we’ll get more specific, just like when we covererd what you monitor for Cloud Desktops, why - and what to do about it.

So let’s get into it - here are my top 5 things that you’re monitoring in AWS AI environments!

  1. Latency

Why: Latency is the number one (non-cost) killer for workloads in any environment. We’re using AI to make inferences/decisions and take actions on your behalf, so the sooner the model is able to act, the better. If your use case is external, the user experience is also impacted more by latency than anything else.

What to do about it: Look to Model Latency and Endpoint Latency in SageMaker and/or Response Latency in Bedrock. This will identfy where you’re bottlenecked and if you are seeing throttling issues via competition for resournces. Step 2 - set as many CloudWatch alarms as your heart desires! If you see something go wrong once, what do IT pros do? Do their best to make sure it never happens again.

Why else is this important: handling latency without increasing costs proves that you’re managing the resources effectively. This is the equivalent of not simply doubling VM sizes to try to resolve runaway memory issues.

The pace of a poorly built model…

2. Throughput

Why: Throughput is literally output, aka outcomes. Data dork alert; we’re going to talk storage performance…

Everyone focuses on IOPS, but the real metric is throughput. Throughput is the possibility of latency (the reason latency is #1 above). With SageMaker and Bedrock, it’s actually easier to understand than classic storage conundrums - your throughput is how fast your model delivers insights/results.

What to do about it: Step 1 - again, look to your CloudWatch resources. This will identfy where you’re bottlenecked, but WHEN you face throughput challenges is something that you can plan for more than you can with latency. Looking to Samples Processed and dividing by Job Duration will get you your realistic performance data.

Monitoring Invocations would shoul you how many inferences (AI model decisions) are processed per second, when. Logging API calls in CloudTrail will help identify patterns, again looking for patterns and trends within usage for root causes.

Why else is this important: again, handling demand without increasing costs proves that you’re managing the resources effectively. This is the equivalent of tolerating Disk Queue Length piling up rather than buying more expensive storage.

Make sure you have enough “lanes” to keep your model flowing smoothly!

3. Error Rate

Why: It doesn’t matter how fast you move if you’re doing things incorrectly. Failed requests or tasks mean user frustration or inaccurate predictions/returns.

What to do about it: Step 1 - again, look to your CloudWatch resources for SageMaker results like Failed or Stopped within TrainingJobStatus. On the Bedrock side they’re a little more blunt in that you’re looking for named error types like Model Errors, Throttling Errors and/or Client Errors. This will highlight where resource constraints or data/script errors caused the model to stop or fail. Either way, the user (be it an individual human, a Customer or an internal team) definitely isn’t seeing the result they expected.

The real key is to keep tabs on this over time so that you to understand if your model is going in the right or wrong direction.

Why else is this important: If something isn’t working right, you don’t want your external stakeholders to be the ones to find out first. Note: just like with last month this is my single biggest gripe with Copilot in Excel - it’s ALWAYS erring out and it can’t tell me anything about why or what I should be doing about it. If your use case is internal. increased error rates indicate that the team managing the underlying model needs to make some changes. You’re not only keeping things running smoothly; you’re now the AI watchdog your exectives reward.

Namely, by setting up alerts and monitoring so that it doesn’t happen again!

4. TPU/GPU/CPU/RAM Utilization

Why: Sometimes, raw processing power is necessary. It doesn’t matter how well you manage SageMaker or Bedrock if you’re ask a marathon runner to win a powerlifting contest.

What to do about it: This is the part you’re going to be most familiar with - resource consumption. Once again, CloudWatch is your friend! Here’s the breakdown…

TPU: If you’re using AWS Titanium, this is where you’ll be looking in CloudWatch for SageMaker. On the Bedrock side, resource consumption is AWS-managed. The lower the consumption/the more idle it is, the more you’re overpaying.

GPU/vRAM: This is the heavy-lifting, heavy-rendering resource. This is effectively allowing your SageMaker workload to do as much as possible at the same time. Once again, the lower the consumption, the more you’re overpaying. On the Bedrock side, again it’s all managed on the AWS side.

Pro tip re: Bedrock - instead of monitoring as you typically would, use Invocations Per Second and/or AWS Cost Explorer as a quasi-tracking mechanism. Consumption and spend will trend with surges in resource consumption.

CPU/RAM: This is still required - some things never change! There’s no mystery here.

Why else is this important: At the end of the day, business are all about making money. Driving the outcomes you need to is important, but autoscaling models can and will scale up and up indefinitely if allowed. Executives are insisting that organizations use AI to drive their business forward, but cost governance is new and poorly understood. Simply staying on top of this makes you the IT Legend that evolves as quickly as the landscape does.

Don’t be in “head in the sand” mode re: performance!

5. Duration

Why: How long something takes matters. If your use case is an external chatbot, then having a response take an hour is obvious unacceptable. If you’re an internal team, an 8+ hour delay is going to mean anything you do that day won’t be available to you until tomorrow. Odds are that’s not the pace at which you want to get results!

What to do about it: Look to CloudWatch for the Training Job Status in SageMaker, and then look to the status/timestamps. If you’re automating this, you’d subtract the difference between Start Time and Completed. data points and (if necesary) to Cloud Logging for log details like “out of memory” errors. Using hyperparameter turning is doable by looking to Hyperparameter Tuning Job Duraction to help fine-tune the model and improve this, but if your Data Science team has built something custom than they’re going to have a larger exercise on their hands. Anything over 500ms is the realistic threshold for an end user starting to notice performance degradation. If it’s a complex prompt or ask, 2 seconds is a better theshold.

For Bedrock, you can benefit from tracking Prompt Processing Time. The longer the model has to think, the more the user feels it. If Response Latency is also high in CloudWatch, that’s a correlated issue. Bedrock really is more of a PaaS service, so you’re more tracking API and token consumption elements than specific underlying elements.

Why else is this important: A forecasting app that takes more than a day to generate a report will always leave your SVP of Sales frustrated by wait times. An end user facing an issue and forced to wait an hour for a response will spend most of that hour looking up how quickly your competition can get them the same answer.

Your users can’t take it anymore!

At this point you should have a much better undestanding of how SageMaker and Bedrock AI workloads are something that your team can take on in AWS. There’s no time like the present to get started, and just like last week there’s less of a mystery than ever about what you’ll need to do once you get your hands on it!

Next
Next

So You’ve Been Asked To Manage AI in AWS…