Monitoring Month - Secondary Metrics

Details, details, details…

Welcome back to the second post of Monitoring Month! Hopefully you learned some practical application tips based on the fundamentals last week. Also, if you didn’t know what User Input Delay and Round Trip Time were, you’re welcome! Stop here and go back and dig deep - those two really are your best friends. So, what’s next?

If you’ve looked at resource consumption on a VM and the end user experience metrics and are coming up flummoxed, there are still things you can go look at! “Frames Skipped” are a trio of can’t-miss metrics that help point to secondary offenders, and then there’s the ALWAYS-forgotten storage layer to consider. Hint: there’s a reason seasoned admins pay for Premium SSD on their session hosts, especially when used in multi-user mode!

Frames Skipped – Insufficient Network Resources

This indicates the average number of frames skipped due to a lack of network resources for the period indicated. 

Sample alerting thresholds:

  • Warning: 5+ frames for 5 consecutive minutes

  • Critical: 10+ frames for 5 consecutive minutes

A high value indicates that the network (for example, the VNET in Azure) does not have enough throughput to handle the load placed on it, resulting in values being displayed here.

Resolving High Frames Skipped - Insufficient Network Resources: First, consider adjusting any time interval function available to change the duration of the data displayed. This can give you a sense of when the issue originally began and if there is a recurring pattern. If the issue is consistent, then consider increasing the bandwidth available to the network.

Frames Skipped – Insufficient Client Resources

This indicates the number of frames skipped due to lack of resources available on the user’s local device for the period indicated. 

Sample alerting thresholds:

  • Warning: 5+ frames for 5 consecutive minutes

  • Critical: 10+ frames for 5 consecutive minutes

A high value indicates that the user’s virtual desktop may see a performance impact due to a high CPU and RAM consumption on the user’s local device. This may result in “it feels slow” complaints or “I can’t work” reports in extreme scenarios.

Resolving High Frames Skipped – Insufficient Client Resources: Yes, this can be a thing! First, consider adjusting any time interval function available to change the duration of the data displayed. This can give you a sense of when the issue originally began and if there is a recurring pattern. Next, check what monitoring elements are available on the local device – it could be another monitoring service, Task Manager, etc. - to see what is causing excessive resource consumption on the local device.

If the issue is consistent, then consider a more wholistic approach to performance monitoring on the end user’s device. Troubleshooting and reducing CPU/RAM consumption locally should resolve this session host-level alert for their user session. While this is a worst-case scenario, the user may need a new, upgraded device to connect from. This could be especially true if their device is older, refurbished/heavily modified or both.

Frames Skipped – Insufficient Server Resources

This indicates the number of frames skipped due to insufficient server resources for the period indicated. 

Sample alerting thresholds:

  • Warning: 5+ frames for 5 consecutive minutes

  • Critical: 10+ frames for 5 consecutive minutes

This comes third out of the three, in my opinion, because it should already have been visible between CPU, RAM and User Input Delay. A high value indicates that resource consumption on the VM itself is resulting in reduced performance in user sessions. This should be easy to confirm, as CPU/RAM (or both) should indicate high consumption as well. This may result in “it feels slow” complaints or “I can’t work” reports in extreme scenarios.

Resolving High Frames Skipped – Insufficient Server Resources: First, consider adjusting any time interval function available to change the duration of the data displayed. This can give you a sense of when the issue originally began and if there is a recurring pattern. Troubleshooting what is consuming excessive CPU/RAM should reduce the values displayed here. If the issue is consistent, then consider adding additional CPU/RAM to relieve this bottleneck.

Storage Usage %:

This indicates the average % consumed of the available disk space for the period indicated. While a straightforward metric, don’t miss it - excessive storage consumption can lead to c r a w l i n g slow user sessions without firing any CPU/RAM alerts.

Sample alerting thresholds:

  • Warning: 75+% (but less than 90%) for 2 consecutive hours

  • Critical: 90+% for 2 consecutive hours

High value indicates that data stored on the disk is near the total amount of disk space available. Best practices suggest staying under 90% of the available disk to avoid performance impacts, with more and more impact to performance as you get closer to 100% consumption of the disk space available. Some backup programs can report errors if attempting to back up data where more than 85% of the disk is consumed.

Resolving High Storage Usage %: There are two options for resolving scenarios where the Managed Disk storage consumed is nearing the amount of storage provisioned for a Managed Disk.  Attempt to clean out data that is no longer used/relevant - this represents the zero-cost option. Examples include clearing out cache or temp data, clearing out the recycle bin, removing installers for applications that are no longer needed, etc.   Increase the size of the Managed Disk – while this represents a larger cost, it leaves room for growth and provides additional storage space/performance.

Disk Queue Length:

This indicates the average number of IO actions waiting for the disk for the period indicated.  This is an example of where low storage consumption could still yield a storage-related performance issue.

Sample alerting thresholds:

  • Critical: Greater than 5 for 5 consecutive minutes

  • Warning: Greater than 2 (but less than 5) for 5 consecutive minutes

High value indicates a large number of IO requests are being made against the storage system.  

Resolving High Disk Queue Length consumption: When disk queue length is a frequent bottleneck, consider increasing the performance tier of your managed disk. If you are already using Premium SSD, you can review the System event log on the system, to see if there are any error indicating problems with the disk or the storage subsystem and potentially open a ticket with Microsoft to resolve anything you find there. 

OS Disk Reads/second

This indicates the average number of disk read operations for the period indicated. 

Sample alerting thresholds:

  • Critical: Not Set

  • Warning: Not Set

  • As this metric is effectively IOPS, this metric is often not alerted on – IOPS are a highly relative data point that is largely used as a reference for other, related metrics

High value indicates that there is a lot of read activity on the disk.

Resolving High OS Disk Reads/second consumption: First, consider adjusting any time interval function available to change the duration of the data displayed. This can give you a sense of when the issue originally began and if there is a recurring pattern. If the issue is consistent and/or maxing out based on what is available at your current tier, then consider upgrading the performance / performance tier of your storage.

OS Disk Writes/second

This indicates the average number of disk write operations for the period indicated. 

Sample alerting thresholds:

  • Critical: Not Set

  • Warning: Not Set

  • As this metric is effectively IOPS, this metric is often not alerted on – IOPS are a highly relative data point that is largely used as a reference for other, related metrics

High value indicates that there is a lot of write activity on the disk.

Resolving High OS Disk Writes/second consumption: First, consider adjusting any time interval function available to change the duration of the data displayed. This can give you a sense of when the issue originally began and if there is a recurring pattern. If the issue is consistent and/or maxing out based on what is available at your current tier, then consider upgrading the performance / performance tier of your storage.

OS Disk Read Bytes/second

This indicates the average number of data read by the disk per second for the period indicated. 

Sample alerting thresholds:

  • Critical: Not Set

  • Warning: Not Set

  • As this metric is effectively IOPS, this metric not alerted on – IOPS are a highly relative data point that is largely used as a reference for other, related metrics

High value indicates that there is a lot of read activity on the disk.

Resolving High OS Disk Reads/second consumption: First, consider adjusting any time interval function available to change the duration of the data displayed. This can give you a sense of when the issue originally began and if there is a recurring pattern. If the issue is consistent and/or maxing out based on what is available at your current tier, then consider upgrading the performance / performance tier of your storage.

OS Disk Write Bytes/second

This indicates the average number of data written to disk per second for the period indicated. 

Sample alerting thresholds:

  • Critical: Not Set

  • Warning: Not Set

  • As this metric is effectively IOPS, this metric not alerted on – IOPS are a highly relative data point that is largely used as a reference for other, related metrics

High value indicates that there is a lot of write activity on the disk.

Resolving High OS Disk Reads/second consumption: First, consider adjusting any time interval function available to change the duration of the data displayed. This can give you a sense of when the issue originally began and if there is a recurring pattern. If the issue is consistent and/or maxing out based on what is available at your current tier, then consider upgrading the performance / performance tier of your storage.

This is a LOT to work through - there’s no disguising that. However, the benefits of having a layered monitoring approach are numerable - you can go FAR deeper than basic Task Manager work, which is invaluable when helping end users when they’ve had it up to HERE!

You hear me, corporate issued refurb? It’ll be straight to the moon with you!

So, you can happily say okay… NOW GET BACK TO WORK! ;)

Previous
Previous

Monitoring Month - GPU Metrics

Next
Next

February is Cloud Desktop Monitoring Month!