Last updated at Mon, 30 Oct 2017 18:03:37 GMT

In part 1 of this post, I went through the general issue of how systems have become more complex, utilizing more and more components and frameworks from 3rd parties. The disadvantage of this being the loss of insight into what’s happening behind all these closed doors, increasing the difficulty of identifying issues. In this post, I’ll go through how you can solve this issue.

By tracing transactions as they traverse your stack, you can identify issues much more quickly. This is achieved by passing unique identifiers from component to component through the stack, and logging on each major event. By using an identifier, such as a SessionID for a user, any request within the stack should be logged, with this SessionID included to link the request/transaction with the user. It’s then possible to go to Logentries, create an uber-log, which is an aggregated view of your selected logs, and then filter the logs based on the unique SessionID…easy, right?

Example of Tracking a SessionID

To illustrate, I’ll run through a small example. Imagine that I have a small social network application with the following setup:

  • a DB server
  • a App server
  • a Web server
  • a Web client

The web client is a JS single page app that uses a RESTful interface to transfer data to and from the client.

But I’m having issues, as I’ve noticed that we have picked up “**204 No Content” **being sent back in the logs to Logentries from our users’ browsers. Luckily, I had set up Tags and Alerts on Logentries for these types of errors. As a result, I received an email to let me know about the issue. Usually this issue would be very hard to identify or track as the error is being generated in the front end. But, by placing a unique identifier in the log messag, and passing that down with every call, we can quickly track the transaction throughout the stack.

First off we look at the log entry that was sent in from the browser within Logentries:

5-12-2014 12:15:111 IP=212.212.212.213 ACTION=ResponseSendMessage  REQUEST=SendMessage UniqueClientID=123456 PAYLOAD={’MessageTO’:’user123’, ‘Message’:’Hey Tom, love the photo.’} RESPONSE=’204 No Content’

From that message we can see that the initial request was to send a message to “user123” but, for some reason, we got a “204 No Content” back.  This gives me my starting point as I would have expected a “200 OK” response. This was a request that would need to run through the app server and down to the DB server in order for the transaction to complete. So I go to Logentries in order to track down the issue. 

Step 1: Collate the logs that I believe might be involved into one big log

In this case I believe it to be on one of the web servers, the app server or the DB server, so I use the Aggregated view to create this combined view.

Screen Shot 2014-04-01 at 12.34.32

This will take all the logs that I have selected and user each log entry, based on the time the log entry was received by Logentries, to create a single uber-log. This is great because we now have all the relevant logs in a single view without any of the irrelevant logs.

But the issue is that these logs have millions of entries, so we need to refine the entries down.

Step 2: Refine the logs entries from the selected into only the ones relating to this user

This is done by simply doing a search using the Unique identifier, which in this case is the “UniqueClientID.”

Screen Shot 2014-04-01 at 12.44.34This now gives us a list of log entries that relate to this user/client only, so we are getting closer. But now we want to focus on the time around the 204 error so as to reduce the results again.

Step 3: Zoom in to the relevant time range

We can us the zoom feature on the graph to go to relevant time range (12:15pm).

Screen-Shot-2014-04-01-at-13.03.16Now we have a view of the transactions on all the relevant servers connected with this event, all in sequential order. As you can imagine this now makes things a lot easier in identifying the issue.

EXAMPLE LOGS

>> 5-12-2014 12:14:787 JS_LOG IP=212.212.212.213 ACTION=InitialSendMessage  REQUEST=SendMessage UniqueClientID=123456 PAYLOAD={’MessageTO’:’user123’, ‘Message’:’Hey Tom, love the photo.’}
>> 5-12-2014 12:14:889 APP1_LOG IP=212.212.212.213 ACTION=RecievRequestSendMessage UniqueClientID=123456 PAYLOAD={’MessageTO’:’user123’, ‘Message’:’Hey Tom, love the photo.’)
>> 5-12-2014 12:14:947 APP1_LOG ACTION=AddMessageToDB  UniqueClientID=123456  SQLQuery=”Insert into Messages Where userID=’User123’  Values (User123, UserABC, ‘Hey Tom, love the photo’);)”
>> 5-12-2014 12:14:968 DB_CUSTOMLOG ACTION=AddMessageToDB  UniqueClientID=123456  SQLQuery=”Insert into Messages Where userID=’User123’ Values (User123, UserABC, ‘Hey Tom, love the photo’);)”
>> 5-12-2014 12:15:012 DB_CUSTOMLOG ERROR - no results found for query "Select from Product where userID=’User123’ "
>> 5-12-2014 12:15:114 APP1_LOG ACTION=ResponseAddMessageToDB  UniqueClientID=123456  SQLQuery=”Insert into Messages Where userID=’User123’  Values (User123, UserABC, ‘Hey Tom, love the photo’);)” response=”Error – no results found”
>> 5-12-2014 12:15:124 APP1_LOG ACTION=RespondToRequestSendMessage UniqueClientID=123456 PAYLOAD={’MessageTO’:’user123’, ‘Message’:’Hey Tom, love the photo.’}  Response{“”}
>> 5-12-2014 12:15:111 JS_LOG IP=212.212.212.213 REQUEST=SendMessage UniqueClientID=123456 PAYLOAD={’MessageTO’:’user123’, ‘Message’:’Hey Tom, love the photo.’} RESPONSE=’204 No Content’

So as you read through the logs (I’ve bolded and underlined the important sections), due to the fact that they are now in a nice structured format where you can follow the transaction through the relevant components, you can now quickly identify the issue. In this example we can see that the user does not exist in the database and that the APP1 server does not know how to handle this issue and so sent back no response. Through some further investigation it was found that when users deleted their accounts it was not reflected in the UI, so users could send messages to non-existing users.

So, by creating this aggregated log and then filtering on the unique ID and the relevant time of the event, I had near instant visibility into what happened as the event traversed through the stack, letting me identify the issue in seconds.

But the moral of the story is that by passing a unique identifier from any clients or components that use your system and then passing them down the stack with any subsequent call, you have the ability to very quickly track transactions through your stack. And, as a result, you can  track down causes that often result in issues in another part of your system.