This is the first hands on scenario post in our series on using IT Analytics to analyze the performance of our Private Cloud at Bay Dynamics. My goal with these posts is for you to see what I see as I "eat my own cooking" and use our IT Analytics cubes to gain insight into what's happening with our virtual infrastructure.
Where do we begin? The first thing that I want to look for is patterns in the SCOM Alerts to see if there's anything I should be concerned about. I don't really know what I'm looking for, this is more of a discovery exercise to explore the Alerts cube and see where that takes us.
I start with opening the SCOM Alerts cube and dragging and dropping some measures and attributes, and before long I ended up with the view below:
Note that I added measures Entity Count and Alert Count, which will show me the total number of entities generating alerts with certain criteria, as well as the total number of Alerts. I also added the Alert Severity on Columns, then filtered down to Alerts that happened in 2011 only and added the Month name to the rows.
The first thing I note is that the number of entities creating critical alerts grew substantially from March to April, with an associated growth in alerts. But from April to May, the number of entities generating critical alerts grew by only 50% yet there's about six times as many Alerts that were generating. That was an interesting discovery, I wonder why? Now I want to know what Management Pack is generating all of these alerts to see if that'll give me any additional insight, so I drag in MP Name from my field list, and sort the Alert Count descending.
I can see that a big bulk of these Alerts came from the nworks Vmware MP from our friends at Veeam, which I'd like to dig into a bit deeper. Shifting gears now that I've found something else I want to dig into, I'd like to break down the alerts from that nworks MP further, so I right click on that MP name, choose Filter Selection, and drag that into the filter fields next to my year. I then drag the MP Object Name onto the Rows area to break down that nworks MP one level deeper.
Very interesting, most are VM Balloon Memory issues. When did those happen? I drag in the Alert Date:
So now I can see the days that occurred, a few days at beginning of May, then nothing after May 10th. Is that right? Why the abrupt stop? Either we remediated the issues or data isn't flowing properly. Now I'm worried I haven't seen data since May 10th. After I remove the MP Object Name:
OK I'm still getting Alerts from that MP up to today. I feel better about the health of my SCOM Agents picking up data from the nworks MP, so now I can go back to worrying about the memory issue. Are there any particular machines? I add the Object Name back in, filter on the VM Balloon Memory object, as well as filter to show only critical alerts. After some dragging and dropping I arrive at:
Lots of VMs, clearly not isolated to just a few. I'd like to visualize this a bit better and see just my top alert generators in a chart:
I want to save that and share it with my VMware Admin. Click save:
Now he can open the cube and see what I see, and we can talk through it together. I'd also like to keep this view handy in Excel, so I flip back to a Table and click the Export to Excel button:
That opens the same view of the cube in Excel, and I can continue my exploration from here and save the results as an Excel worksheet on my laptop to review in more detail later.
That wraps up this adventure. We covered a lot of ground in a short period of time, all using ad-hoc cube analysis without writing a single SQL query. I started out with only the question "how am I doing?" in mind, and as I explored the cube I uncovered valuable insight into my virtual machine behavior over the last month. This type of discovery is a core value of leveraging IT Analytics Cubes as part of a well managed Private Cloud.
604baa8a-d8a5-4c70-a660-8b8ee1b3ad95|5|5.0