Getting slow IT systems up to speed - with data correlation, data analysis and virtualisation

All indicators on green, and it still doesn't work? Read in the following article how you can still get to the bottom of performance problems.

The Bank's IT Landscape

The bank's Windows infrastructure includes

  • Active Directory
  • DNS
  • File
  • SharePoint
  • SQL
  • Exchange

The hardware environment consists mainly of HPE components: HPE C7000 blade systems with the corresponding BL460g7 server (blade) racks with SAN connections of 8GB/s and network connections (LAN) of 10GB/s. The storage system is an HPE 3par 7450. The disk storage system is an HPE 3par 7450.

The bank operates the infrastructure in two data centres that are more than 90 km apart.

Everything works - apparently

Mail traffic is one of the mission-critical applications for the bank. However, the employees complained about poor to very poor response times of the e-mail client MS Outlook. In addition, the connections between Outlook and the Exchange Server sometimes broke down.

The IT department could not find the cause, all monitoring systems for VMware, Windows Server, Exchange, network and storage showed "green".

Analyse the problem and find the real cause

To fix the problem, the management set up a task force and contracted our company for the project. We installed the SightLine® analysis and correlation system to collect data from

  • VMware
  • Windows
  • Exchange
  • Storage
  • Network

The result: There were two causes for the poor performance data and connection failures, which also reinforced each other.

  1. Delay due to suboptimal allocation
    The first cause was the way VMware allocated execution to virtual (guest) systems. The execution of a virtual guest was only continued when the number of vCPUs of this guest was available as a total. The virtual Exchange server was allocated 12 vCPUs, the host system had 12 pCPUs - a ratio of 1:1. Since VMware itself also required CPU resources for its management activities, 12 pCPUs were rarely available at the same time for the virtual Exchange server. This was also reflected in the VMware "CPU-Ready(ms)" metric, which was sometimes 25,000 ms.
  2. Slow memory accesses
    Due to this long inactive time from the storage system's point of view, the storage area of the Exchange server was moved to slow disks. What was an optimisation from the storage system's point of view, however, prolonged the Exchange server's storage accesses. This in turn caused VMware to "put the Exchange server to sleep" again - a spiral of slowdown.

Take targeted measures

The bank reduced the number of vCPUs from 12 to 8. The subsequent long-term measurements showed a maximum peak utilisation of 60%. As a further measure, the storage area of the Exchange server on the storage system was no longer "optimised" and firmly placed on very fast disks.

The great benefit for the company:

Through the analysis and correlation functions across the entire IT landscape, the best performance of all systems and especially the virtual Exchange servers could be restored. The employees can now exchange information and documents quickly and fail-safe by mail again. Overall, the data throughput and thus the work performance of the entire company increased.

VMWare also provides a best practice example on the topic that is worth reading: „Performance Best Practices for VMware vSphere® 6.0“

IT managers will find articles, tips and practical examples on the topics of

  • Monitoring heterogeneous system landscapes
  • Analysis and correlation of all types of monitoring data
  • End-2-End Monitoring
  • Industry 4.0
  • Compliance

I am looking forward to a lively participation and exchange!
Yours, Metin Özdiyar-Steffen
- Business Development Manager Intelligent Solutions GmbH