Monitoring VertexAI in GCP

Last week we started our series on GCP - specifically taking on management of Vertex AI and how it’s not THAT different from what your IT team is doing today.

This week we’ll get more specific, just like when we covererd what you monitor for Cloud Desktops, why - and what to do about it.

So let’s get into it - here are my top 5 things that you’re monitoring in Vertex AI environments!

Latency

Why: Latency is the number one (non-cost) killer for Vertex AI workloads. We’re using AI to make inferences/decisions and take actions on your behalf, so the sooner the model is able to act, the better. If your use case is external, the user experience is also impacted more by latency than anything else.

What to do about it: Step 1 - look to your Cloud Logging resources. This will identfy where you’re bottlenecked and where you’re facing resource contention (multiple different teams or end users using something at the same time). Step 2 - see where you can drop the complexity. IF (and this is a big IF) you can reduce the layers in your model, this delivers a big impact. Step 3 is more approachable - reduce your batch/sample size or the amount of data being processed at once.

Why else is this important: handling latency without increasing costs proves that you’re managing the resources effectively. This is the equivalent of not simply doubling VM sizes to try to resolve runaway memory issues.

2. Throughput

Why: Throughput is literally output, aka outcomes. In storage performance, the only reason IOPS isn’t equal to Throughput is the possibility of latency (the reason latency is #1 above). With Vertex AI, it’s actually easier to understand - your throughput is how fast your model delivers insights/results.

What to do about it: Step 1 - again, look to your Cloud Logging resources. This will identfy where you’re bottlenecked , but WHEN you face throughput challenges is something that you can plan for more than you can with latency. Are you always facing issues at a certain time of day? If your use case is internal and not Customer-facing, you can consider asynchronous inferencing (basically, forming a queue of things to be processed rather than delivering real-time results to everyone using your model). This is a design decision - another effort reduce your batch/sample size or the amount of data being processed at once.

Why else is this important: again, handling demand without increasing costs proves that you’re managing the resources effectively. This is the equivalent of tolerating Disk Queue Length piling up rather than buying more expensive storage.

How fast can your model go? Make sure you have enough lanes to keep things flowing smoothly!

3. Error Rate

Why: It doesn’t matter how fast you move if you’re doing things incorrectly. Failed requests or tasks mean user frustration or inaccurate predictions/returns.

What to do about it: Step 1 - again, look to your Cloud Logging resources for results like “invalid” or “internal error” or “exhausted”. Tracking this over time will allow you to understand if your model is going in the right or wrong direction.

Why else is this important: If something isn’t working right, you don’t want your external stakeholders to be the ones to find out first. Note: this is my single biggest gripe with Copilot in Excel - it’s ALWAYS erring out. If your use case is internal. increased error rates indicate that the team managing the underlying model needs to make some changes. You’re not only keeping things running smoothly; you’re now the AI watchdog your exectives reward.

4. TPU/GPU/CPU/RAM Utilization

Why: Sometimes, raw processing power is necessary. It doesn’t matter how well you manage Vertex AI if you’re trying to win a NASCAR race on a bicycle.

What to do about it: This is the part you’re going to be most familiar with - resource consumption. Cloud Monitoring is your friend! Here’s the breakdown…

TPU: This is the ML-heavy, data processing component. The lower the consumption/the more idle it is, the more you’re overpaying.

GPU/vRAM: This is the heavy-lifting, heavy-rendering resource. This is effectively allowing your Vertex AI workload to do as much as possible at the same time. The lower the consumption, the more you’re overpaying.

CPU/RAM: This is still required - some things never change! There’s no mystery here.

Why else is this important: At the end of the day, business are all about making money. Driving the outcomes you need to is important, but autoscaling models can and will scale up and up indefinitely if allowed. Executives are insisting that organizations use AI to drive their business forward, but cost governance is new and poorly understood. Simply staying on top of this makes you the IT Legend that evolves as quickly as the landscape does.

5. Duration

Why: How long something takes matters. If your use case is an external chatbot, then having a response take an hour is obvious unacceptable. If you’re an internal team, an 8+ hour delay is going to mean anything you do that day won’t be available to you until tomorrow. Odds are that’s not the pace at which you want to get results!

What to do about it: Look to Cloud Monitoring for the duration data points and (if necesary) to Cloud Logging for log details like “out of memory” errors. Using default hyperparameter turning jobs can help fine-tune the model and improve this, but if your Data Science team has built something custom than they’re going to have a larger exercise on their hands.

Why else is this important: A forecasting app that takes more than a day to generate a report will always leave your SVP of Sales frustrated by wait times. An end user facing an issue and forced to wait an hour for a response will spend most of that hour looking up how quickly your competition can get them the same answer.

At this point you should have a much better undestanding of how Vertex AI workloads are something that your team can take on. There’s no time like the present to get started with Vertex AI, and there’s less of a mystery than ever about what you’ll need to do once you get your hands on it!

Latency

2. Throughput

3. Error Rate

4. TPU/GPU/CPU/RAM Utilization

5. Duration

Monitoring VertexAI in GCP - Part 2

So You’ve Been Asked To Manage AI in GCP…