At Corvil, we have just recently made the Corvil App Agent generally available for our customers. I took the opportunity to test-drive it and would like to share some of the results.
First of all, why did we release the App Agent? Its main purpose is to equip application developers with a lightweight and minimally invasive means of generating timestamps in their applications, and of logging these to an external system. Our original design goal in creating it was to complement the visibility that is provided by capturing and timestamping network packets. This timestamped network data forms the raw fuel which our platform transforms into streaming machine-time intelligence. Network data has an abundance of advantages as the preferred source of data, and two key ones are that it can be gathered with no impact on the system being instrumented, and in a very scalable fashion. It has no impact on applications and systems because the network can tap and replicate data in parallel in hardware; it is highly scalable because all critical communication goes across the network and can be captured at a single layer.
However, there are occasions where important events happen away from the network and we must take a different approach to capturing and timestamping such events. In particular, many interesting events occur inside software applications; for example, the core functionality of an application rarely deals directly with network hardware and interacts with the outside world through multiple layers of protocol stacks. A web-based application can only start to service a user-request once that request has been handled by the web-server and the URL has been decoded into the actual application action required. An exchange line-handler may forward a trade instruction, but that instruction needs to be rendered into a binary protocol structure which is handed off to the TCP stack before it ever gets out of the machine.
When these extra layers of stack are well-tuned, then your efforts to control and improve the performance of your systems are best spent on either optimizing the application code or the network architecture and its capacity. However, it's easy to overlook the stack and forget that problems can arise there, and this is a great use-case for the App Agent.
So what exactly is the App Agent? It is a small library of code that a developer can add to her C++ or Java application which enables her to log events of interest and, without any further assistance from her, have them published out on the network. Each call to log an event just records the current time and pushes a record of the event onto an internal queue. Another pool of threads takes care of all of the housework of encoding the timestamps and event-records, bundling them into packets, and transmitting them onto the network. The goal is to keep the CPU time required in the thread being instrumented to the minimum possible. The C++ implementation also provides an ultra-lightweight call to just retrieve a timestamp, so that even the work involved in constructing the event record can be deferred to non-critical sections of code.
I won't go into much detail on the API itself, as I want to focus on the results it gave us for a demo application which we used it to instrument. However, for the curious, this is the C++ code I wrote to invoke the API:
auto now = m_agent.get_raw_time(); auto event = m_agent.get_event ( EventType::DATA); event.set_int64_field(TS1, now) .set_text_field(TEXT1, proto) .set_text_field(TEXT2, msg_type) .set_text_field(TEXT3, uuid); m_agent.dispatch(event);
There are just four steps: get a timestamp, ask the API to provide an event record, fill that record with the details of the event as a set of key-value pairs, and finally ask the API to deal with it.
While we have had several customers test-drive pre-GA versions of the App Agent and get valuable insight from it, we're not at liberty to discuss those use-cases. Instead, I'd like to share some concrete results we obtained for a very simple line-handler application. The main work of this application is to convert trade instructions from an internal message-bus into OUCH 4.1, and acknowledgements and executions back again. It must also keep track of the state of each order, and check for various error conditions along the way, handling them gracefully.
A key performance metric for this application is how quickly it can handle these functions, and for this purpose we'll use a CNE (Corvil Network Engine) to capture both sides of the line-handler at the network layer and measure the latency back and forth across it. We see these measurements in a time-series form in Figure 1:
With latencies of around 130us at the 99th percentile, this application could do with some tuning. One easy step to take is to turn on application logging - syslog-style recording of human-readable text with timestamps that can be scanned for obvious errors, or examined in more detail to see where the application is spending its time. Figure 2. shows what happens when we do this:
Ooops. Not only have we totally destroyed the performance profile of the application, which is now about 10 times slower at handling transactions, but more fatally from a forensic point of view, we've completely changed the behavior that we want to understand. This is where the App Agent can help: by using the lightweight timestamp capture mechanisms, we can have our cake and eat it too, in the form of precision visibility into the performance of the application without disturbing its behaviour. I would show the overall performance profile again, except that it’s indistinguishable from the results shown in Figure 1.
When given the challenge of how to reduce the time spent by the application in forwarding the instructions and responses, my instincts as an application developer are to start with the things under my control - namely the code that implements the application logic. There we can do things like looking at reducing the amount of code to be executed, being careful about how we use memory to keep our data resident in-cache, etc.
However, our first step should be to understand where the time is being spent and, to this end, I started with measuring how much of the time is spent in the application logic versus how much time is spent in the network and protocol stacks on either side of the system. Conceptually there are just two instrumentation points: one immediately after receipt of a message, and another just before transmission, so that we tap the perimeter of the application. In practice it takes a few App Agent calls for both protocol APIs, one in the transmit method and one in each of the message-receipt callbacks.
The results are striking: Figure 3. shows a multi-hop view from the Corvil web-based UI, with a breakdown of the time spent at several different layers of the system. The first hop labelled "Bus-Stack" shows the time taken to receive the buy-order from the network into application space, and also later the time taken to transmit the acknowledgement for that order back out to the network. The next-hop labelled "Application" shows the time spent processing the order, checking and updating its state. The third hop shows the time taken to construct the OUCH message and transmit the binary representation of it onto the network. For completeness, and also to close the loop visually, we show the time taken for the exchange to acknowledge the order.
This multi-hop view, which marshals network data and App Agent data into a coherent picture of the various processing stages makes it immediately obvious that any time spent in tuning the performance of the application logic would be largely wasted. The 25us the application spent handling the order is dwarfed by the 110us spent in the OUCH stack, and the 200us spent in the internal stack.
Of course, the next question is how much of the time spent in the stacks on either side is spent in userspace handling the application protocols, and how much in the networking stack and underlying hardware. That is a question we have to delve into another time.
In this very brief tour of the App Agent, we've seen why we developed it, and have seen striking illustrations of two key design goals of the agent, namely:
Complementing network-based timestamping with in-application timestamping to give visibility into all aspects of an application’s behavior and its interactions with the environment.