Monitoring VertexAI in GCP

Last week we continued our series on GCP, covering what you’ll be monitoring and more importantly, why. We also covered the most important thing - what you’re going to do about any of the alerts you see!

So let’s get into it - here are the second set of 5 things that you’re monitoring in Vertex AI environments!

6. Model Drift

Why: Similar to error rate, this results in a poor experience for both internal and external users. As opposed to seeing straight up errors, instead you’ll see inaccurate returns, responses & predictions.

What to do about it: Look to the Model Monitoring capabilities in the Vertex AI UI. If your internal Application (or Product) team is insisting that the foundation is reliable and there’s no need to retrain the model, then it’s time to reach out to your GCP contacts or open a ticket to connect with GCP subject matter experts.

Why else is this important: This will identfy illustrate what your previous responses to similar queries were and you’ll be able to tell if your model is starting to return responses your stakeholders don’t want to see.

Pictured: your model, drifting and spinning out…

7. Model URL (Endpoint) Downtime

Why: In plain terms, this is a URL (effectively the same thing as a wehbhook) that you deploy your models to. If this is down, any applications your various teams have built won’t be able to reference the models you’ve built.

What to do about it: Check the Cloud Monitoring metrics. If you’re seeing that this is down, a DevOps (or ProdOps or similar) team would be able to re-deploy the endpoint. Application (or Product) teams would need to quantify the impact and determine if any changes in the application are needed to support the new endpoint. Important - if it’s not your issue, open a ticket with GCP IMMEDIATELY. Either way, don’t forget to dig into the logs to discover the root cause of the issue.

Why else is this important: if the foundation of your applications isn’t working, then none of the applications built on that foundation are working properly.

Take care of the building blocks, or else…

8. Log Anomalies (Surges in Logging)

Why: If you’re seeing surges in logs, something is happening. It could be a deployment issue, or if you haven’t deployed an update (or similar) in a while then there’s something happening that is causing a real spike in activity - and you need to investigate.

What to do about it: Like we did with so many of the top 5 last week, look to your Cloud Logging resources. This time, we’re looking for one of two things - infinite loops where one error kicks off one or several more or if there is actually a surge in problemantic usage. Next, look to Security Command Center. This is a native toolset that cateogorizes and classifies risk and compares usage against standard controls.

Why else is this important: At best, this is an indicator your model is getting used a LOT more all of a sudden. But, don’t high-five yourself too early - this could also be an indicator that your model is being spammed. In a world where a chatbot is allowed to train itself to better support end users of a chatbot, surges of usage could definitiely see automated usage in a way that negatively trains your Customer services resources.

9. API Anomalies (Surges in Usage)

Why: Similar to number 8 above, a surge in usage is an indicator that something is happening and you need to investigate.

What to do about it: Make sure Chronicle (Google’s SIEM toolset) and/or Security Command Center to review the surge in usage. Chronicle would stockpile the logs and cross-reference them to look for causes, while (again) looking to Security Command Center for risk categorization and comparison against standard controls.

Why else is this important: At best, this is an indicator your model is getting used a LOT more all of a sudden. But, don’t high-five yourself too early - this could also be an indicator that your model is being spammed. In this case it’s less about operational errors and more about other issues, such as a tremendous amount of failed logins.

The two of these together are multiplicative and compounding issues.

10. Cost Surges

Why: Let’s be honest about something… stakeholders/shareholders demand results. Like we covered in number 4, AI tools and models deliver results, but if the costs outweigh the outcomes then it’s a problem. Every project has a budget, and staying on top of the budget makes you a responsible IT leader.

What to do about it: GCP’s Cloud Billing capabilities will show you if you’re spending more than the budget controls you’ve set. This will determine what is causing your costs to spike, letting you advocate for expanded budgets for your end users or your internal teams.

Why else is this important: If that expanded budget isn’t an option, then for better or worse you can expect to have some tough conversations with Product and/or Application teams. If Finance doesn’t budge at first, ally with your stakeholders to quantify the impact. If you charge for your solution, has there been an increase in revenue? Has there been a leap forward in any way that can be attributed to internal usage? Attaching results to these initiatives will allow you to collectively go back to the Finance team with a revised request. This way, you’re an IT leader associated with driving outcomes as opposed to one that’s watching over science projects.

At this point you should have a even better undestanding of how Vertex AI workloads are something that your team can take on. There’s no time like the present to get started with Vertex AI, and for the second week in a row there’s less of a mystery than ever about what you’ll need to do once you get your hands on it!

Next week we’ll dig into another intriguing concept - how we go about charging for AI products built out in Vertex AI!

Monitoring VertexAI in GCP - Part 2

6. Model Drift

7. Model URL (Endpoint) Downtime

8. Log Anomalies (Surges in Logging)

9. API Anomalies (Surges in Usage)

10. Cost Surges

Charging for Vertex AI Models