Monitoring

Monitoring is a vital component in all operations and security work. Focus is slightly different, but a lot of the tools and ideas are the same.

In the operations world, we have System reliability engineering. This is a google concept and they have dedicted a site to it. The books are available for free. The underlying ideas about how to approach operations will be mostly identical in the security domain. Personally, I find the main difference to lie in the concept of the "malicious adversary".

Monitoring basics

There are three basic patterns used in monitoring

  • Red/green checker boards - this is a monitoring of instant values and whether or not the exceed engineering limit. E.g. is available space on a partition less than 1 GB.
  • Graphs - this is time series data of some metric, like disk space usage.
  • Logs - This is the line of text that all OS'es and services generate. They are indispensable for debugging.

Previously, this would have been three separate systems, but now it has converged, and is often found in the same bundle.

Red/green checker board

The idea is simple: Make a matrix where a cell is green is there is no issue, and red is something is detected. This makes for a very quick way of maintaining overview of a system.

The goal is to have it all green, and even the non-technical people will know that if something is red, something must be done.

Normally there will be set some limit on something measurable, and that will trigger an alert. The checker could show the state of an alarm, where e.g. grey means that the alert has been acknowlegded and that someone is working on it.

It is important that there are never red cells that may be ignored, since this wil deteriorate trust in the monitoring system. If you have such alerts - don't include them in the overview.

Graphs

Time series data showing one or more metrics for a given period. This is very useful for understanding the system and to investigate when there is an incident. Humans are really good at understanding graphs and their shape, and to know what an anomaly looks like.

Graphs would be the place to start when trying to understand a system. Based on that experience, engineering limit or alerts may be defined. This would be the alerts that the red/green check board is based on.

Logs

This has been around for a long time. All Unix systems have /var/log or similar where programs and the OS store lines of text about what the programs are doing.

The syslog protocol is a standard protocol for transmitting logs across networks to a log collector, so the logs may be seen and processed together. This is a capability that almost all network equipment has. Sending the logs over the network has the added benefit of not being on the originating devices, and makes it much more difficult for an attacker to cover their tracks - even with administrative privileges.

The logs entry may be stored in a database, either parsed or as simple text. The latter i simple and requires no knowledge of the data, the former will make searching much less intensive for the server.

Designing dashboards

In general, dashboard are a form of communication, and are govern by the same rules.

  • There is an audience
  • There is a purpose
  • There is an interface. This is normally a UI with graphs, tables, or other widgets.

If you try to talk to a broad audience, then the content must be superficial and easy to understand. For a very specific audience, perhaps one or two people, the content may be very complex and technical with little to no explanations.

When making dashboard to show multiple things, then the message will be fuzzy, and it will be more difficult for the audience. In this case, consider making more dashbaords - either for multiple sub-audiences or subdivide the content.

Lastly, as any UI person would insist on, the look and feel must be fairly consistent. The user will have a reasonable expectation of what buttons and sliders do, and what different colors mean. This includes granulation, time ranges, and which datasets/information pools the dashboard shows.

There will naturally be different levels of dashboard. In one end, a simple checker board for the large monitor in the office to the complete reference dashboard that should all metrics. Chose the audience and content of the dashboard the match the purpose.

There are many studies of visualizations. A starting point could be Kennedy Elliot: 39 studies about human perception in 30 minutes

Don't forget to hide sensitive data. It is common to have sensitive data in logs, and disclosing it would be a GDPR violation. There will ways of scrubbing the data (for API/SSH/other keys, usernames, passwords, SSN and such) either in the output step or, even better, in the ingress step.

Storing and processing

The main problem is to limit the amount of data collected and stored. Having lots of data is fine as long as you are able to process it and extract the relevant information from it. An important concept is "retention time", which means how long back in time data is being stored. For some data, a couple of days might be sufficient, for others there will be compliance regulations mandating years.

All of this ties into the broader discussion of big data and AI. As with all things IT, the saying "garbage in, garbage out" is true, and

In security we have security incident and event management systems (SIEM) which is a system to import events - often from logs - and maintain overview over what happens and to prioritize response.