Following are a set of sample test scenarios to run on a monitoring system in order to evaluate its capabilities to ensure service levels, reduce MTTR, identify root causes, and prevent outages.
Long asynchronous transactions: Submit 1,000 long asynchronous transactions (with a pause time of several minutes) into the system. Show the path of each one of 20 individual transaction instances as well as their end-to-end response time, the elapsed time they spent on each tier, the time they spent between the tiers, and the amount of CPU they consumed on each tier.
Lost information due to sampling: A transaction is found to be slow due to contention over shared resources. Use the product to show everything else (e.g. all other transactions, batch activities) that ran at the time of the problem, and figure out which other transactions were disrupting the flow of the slow transaction by using shared components.
New transaction auto-discovery: Add a new business process transaction class (i.e., not just a Java class but rather a new end-to-end business activity) to the application. Determine if the transaction class is automatically discovered and monitored, or whether it must be manually defined in the BTM product before it is included in SLA monitoring.
Change and capacity planning: Show a breakdown of CPU consumption by transaction across multiple tiers in a shared environment. Make sure that collecting CPU information does not exceed the acceptable overhead for the product. Then use the product in conjunction with your existing capacity planning processes to identify the capacity requirements (such as additional CPUs) for a growth scenario. For example, answer a question such as “what capacity is needed across the board to support a 60% increase in the amount of login transactions?”
UAT/release planning: In a user acceptance test (UAT) scenario, compare the detailed performance profile of transactions in the current version of the application with their profile in a new version of the application. How easy is it to do this? Is the information conveniently available in a report or dashboard? Use this information to determine if performance of the new release is acceptable before rolling it out to production.
Problem isolation #1: Create a scenario where a certain subnet is (mistakenly) excluded from the load balancer configuration file and as a result, traffic from this branch or location always hits the first web server instead of being distributed to the entire web server cluster. As a result, the first web server will be more loaded than the other web servers, potentially causing slowdowns for the transactions that hit this node. What will the product be able to do beyond alerting to the SLA breach? Will it provide a false lead by indicating that this may be a WAN issue? Does it lead you to believe that since the problem is specific to one subnet, it is within the network? Or will it be able to immediately and correctly identify that it is a web server and load balancer issue?
Problem isolation #2: A faulty router is slowing down transactions that go to a certain remote database. In order to isolate the issue, does the product lead you down the wrong path by indicating that the transactions spent a lot of time on the Java tier (since they are waiting for the database to respond)?. Does it lead you down the wrong path by showing that database sessions are exceptionally long, indicating that it is a database issue? Is the product able to correctly isolate the problem to the internal network between the Java tier and the database tier by showing the excessive inter-tier time?