Application monitoring

Application monitoring is a widely used technique to predict and report issues in runtime. The number of running applications is so large that it is impossible for a human to monitor them continuously, or at least it would be too costly, so usually it’s automated. And that’s good because even monitoring is very important, it’s also quite boring to stare into the same screen whole day.

{up toc}

Nowday monitoring and management systems (e.g. List of systems management systems) work on regular gathering state of networks and applications to give us overview how system works, but also how system worked. This historical view is very helpful to see the trends and predict how system will behave in the close future. Monitoring systems are also able to throw alert or notification if some watched parameter reaches its limit or threshold.

Maximum limits, alerts and thresholds

Every resource has its limit and this must be considered in system architecture, Application configuration and also in monitoring settings. For example, database sets the limit for maximum open connections to 1000. It’s unusual that your Application will be the only one who uses this database. At least, there will be connected also a monitoring system because you have also monitor database itself. So you cannot use 100% of this resource. Total number of allowed connections created by Application will be something less, let’s say 900. In the configuration of Application this number will be devided by a number of maximum allowed running instances.

Limits are of three levels:

  1. Instance limits – maximum number for Application instance which cannot be exceeded. Monitoring system should watch this limit per instance. Reaching this limit means, that defined Application instance is overloaded and can fail or at least performance is significantly degradated. If application limits are set correctly, resource itself should be safe. Usually value of this limit is the same as value set in Application configuration for defined resource.
  2. Aplication limits – applicable if Application run in cluster. It is a maximum number for all Application instances which cannot be exceeded. Reaching limit means that both Application and resource could have a problem. For example new connections can be refused and Application will be not able to continue, performance can be degradated on both sides, other applications using the same resource can report problems,… Usually it is easy to calculate as a multiplication of instance limit and maximum number of instances in cluster. For shared resources it could be insuficient and tricky to setup. Knowledge of configuration of other applications would be helpful. If you don’t have such information, performance tests are the only way how to get close to the limits.
  3. Resource (system) limits – it is a maximum number defined by resource itself which cannot be exceeded by all applications in system. Reaching limit means, that resource is overloaded and all applications using this resource can have problems. Resource itself can fail or have a performance issues. Connected applications can fail or degradate perfromance depending on their failover implementation. Setting of resource limit in monitoring and management system should copy the maximum value defined by resource.

There are various ways how to prevent touching limits. For example, automatic scaling, load throttling or specific manual interventions. Automatic scaling is great when reaching instance limit. It is widely used by container platforms like Kubernetes or OpenShift. On top, scaling is delimited by application and resource limits. Load throttling is the last thing you want to do and means that health of resourse is more important than your Application. For well sized systems this is not common situation and usually it is used in case of some type of DOS attack.

Monitoring and management systems are able to throw alert when something happened. That is actually the reason for their existence. Alerting after the limit is reached is usually too late to react, so thresholds are defined below the maximum limits. Value of threshold should be higher than everage or expected values for normal operation to not produce often false alerts. On the other hand, it should not be too close to limits because there could be too short time to react. Setting correct value usualy requeires a bit of laboration. Performance tests can be very helpful to find the correct settings. It is also a good habit to review threshold settings after significant changes in system or Application.

Alerts can have various levels like info, warning, error. It is crucial that you include all relevant information in the description. Morover description should be in human language, tips for next steps or even solution are highly recomended. Imagine people, receivers of the alert somehow during the midnight shift, not knowing your Application in detail, must be able to recognize the level of risk and choose correct action to prevent reaching limits. It’s impossible if you don’t know what was measured, what that value means or where is maximum allowed value.

Application support

Application support should start form early beginning, during designing architecture. Deployment diagrams together with sizing calculations encourage to define also a list of used resources to monitor. They are also good starting points to define limits and thresholds.

Implementation alone is quite easy. You don’t have to implement everything on your own. There are plenty production ready libraries and components out there (e.g., Spring Actuator). They come with implemented monitoring gauges for common types of resources (e.g., datasource, JMS,…) and support output in most used formats. As a bonus they provide standard probes for container platforms.

Default implementations are a good starting point but job is not done yet. Depending on complexity of your application, there can be types of resources, which are not supported by default or default implementation is somehow limited (typically only one resource of type). Count with a litle bit of coding in estimations. Don’t forget of monitoring when planning tasks which accompany adding new external resource to Application. Such task without implemented output for monitoring cannot be supposed to be done. If you use agile methodology for project development, add monitoring to your Definition of done template.

Special type of implementation represent low-code platforms. In various forms (e.g., BPM engines, ESB, workflow engines,…) they are with us over 20 years and grow into a powerfull tools. A lot of them support to define custom persistent connectors or pools which should be also watched. Or even you touch some resource directly, number of parallel connections should be monitored and limited. If your platform doesn’t support monitoring of used resources in custom made flows, try to find some plugin or change the platform. Adding and maintaining support for monitoring directly into each flow is usually not effective and makes flows more complicated.

Last but not least, use the same monitoring as in production also in test environment. Maybe not all environments have the same size as a production, so you will be not able to setup the same limits and thresholds. But at least configuration of monitoring system will keep the track with Application and description of alerts get a change to improve before alert firstly appear in production. It has also a huge effect on people in team. They get broader view on architecture and solving issues in production environment.

Think like a…

SW architects think like a System administrator 
SW Engineer think like a System administrator 
SW architects think like a SW Engineer 
  

Give feedback

System administrator 
Customer support 
Manager 
SW EngineerGive feedback to SW architect and Analyst if you use external resource which was not mentioned in analysis or system architecture. First of all, all interactions of application with other systems should be documented. Second, it will help in the future to create more precise analysis, architecture but also more accurate estimations.