Tell Me How You Work and I'll Monitor You!

Lemmings

Today, I read an interesting story in Datanews, a Belgian IT newspapers. To briefly resume, “Company A”, the customer, complains about “Company B”, the telecom operator, which installed a telephone central at the first one premises. During a weekend, hackers took control of the system and used it perform calls to very-high-rate-countries. The result was a huge bill of more than 7000 EUR! The customer refused to pay the bill and complained about a nonexistent pro-active monitoring of the telecom operator network usage. On the other side, the telco had also a killing argument: one of the customer employee configured a very simple password on his voice mail (“1111”). Read the full article in French here (English translation).

Despite the fact that this user password was not strong enough (but isa PIN code strong enough? Only 2^4 combinations of numbers), the customer submitted an interesting question: why didn’t the telecom company perform “behavioral monitoring“?

Let’s review some types of monitoring… The first type, I call it “dumb monitoring, is just comparing a key indicator against a predefined value or an interval of values. A good example is the Nagios monitoring tool. Each key indicators are compared to two values: a warning and a critical thresholds. Such indicators may be the CPU usage of a server or the used percentage of a file system:

$disk_usage=get_disk_usage("/var");
$warning=90;
$critical=90;
if ($disk_usage < $warning)
   print "Disk usage OK";
elseif ($disk_usage < $critical)
   print "Disk usage WARNING";
else
   print 'Disk usage CRITICAL";

The main issues with dumb monitoring are false alarms: some jobs executed during the night or weekend can fill the file system up to 98% during a few minutes then clean up temporary files. If the monitoring tools polls the file system usage during the job execution, a false alarm will be generated followed by a recovery. Another phenomena is when the CPU or disk usage "flaps", going always from a low to a high usage (example: a job executed every hour which comsumes 100% of the CPU).

The second type of monitoring is called trending or forecast monitoring. Based on algorithms such as Holt-Winters, we try to estimate when the key indicator will be in trouble. Applied to our first example (a file system usage), we should be able to eliminate false positive alerts (when the file system is temporary almost full) but also perform forecasting. Like in the first type of monitoring, the file system usage polling is performed but instead of checking the returned value, we pass it to the algorithm which estimates when the file system will be full and draws nice graphs. Check out the example below:

Note that trending is a very interesting tool for managers. Generated graphs are excellent evidences. It will be easier to defend a purchase order for extra disks in the company SAN or more bandwidth with the help of the graphs.

Finally, the state-of-the-art: behavioral monitoring. The oldest gamers will remember the Lemmings game? It was based on the behavior of real lemmings which are known to perform mass actions in the same time, even if they are dangerous for the whole community. To setup the behavioral monitoring, we must first define an activity profile of our key indicator. As for the two first methods, polling is performed and received values will be compared to the defined profile but also to previous alerts! Back to our file system usage, a behavioral monitoring solution will be able to detect if a disk full event is "normal" or not. Example: every Sunday night, the disk is full due to a backup procedure but it must be cleaned up a few hours later. A disk still full on Monday morning will generate an alert! It is also possible to mix multiple sources of data to build the policy: A nice example of the usage correlation of a user access badge and his session on the network. If the user swipe his badge to enter building "A", he must open a session on a computer located in the same building; otherwise, his badge might been stolen or his account compromised. Detection of unknown protocols on your network or during non business hours is also part of the behavioral monitoring. Example: is it normal to detect VoIP traffic during the weekend? Yes if a valid user used his access card to enter the building and logged on the network! To resume, behavioral monitoring makes the difference between normal and abnormal activities.

Behavioral monitoring is already used in a lot of business. Your bank monitors your credit card activity. If you just paid something in a restaurant in France and a few minutes later, the same card is used in Japan, something is wrong! (or you have some nice teleportation facilities ;-). The same applies with your mobile phone SIMM card. If your mobile is connected to your local operator and, in the same time, a connection request is forwarded from a foreign country operator (roaming), same issue, your SIMM card might be stolen and duplicated.

Now, back to our story described above. I fully agree with the "customer". The telecom operator has all required data to setup behavioral monitoring in place! Maybe one exception in this case, it's dangerous to stop the service in case of suspicious activity (telephone can be used to call 911 in case of emergency!). Why not notify the customer BEFORE sending the bill?

Tell Me How You Work and I’ll Monitor You!

Leave a Reply