We (and I’m fully part of it) deploy and use plenty of security monitoring tools daily. As our beloved data is often spread across complex infrastructures or simply across multiple physical locations, we have to collect interesting information and bring them in a central place for further analysis. That’s called “log management“. Based on your collected events, you can generate alerts, build reports. Nice! But… if systems and applications generate [hundreds|thousands|millions] of events, those ones are processed by the same kind of hardware running some piece of software. Hardware may fail (network outage, power outage, disk crash) and softwares have bugs (plenty of).
This morning, I received several alerts like this:
** Alert 1336642415.2196887: mail - ossec, 2012 May 10 11:33:35 xxxxxxxx->ossec-monitord Rule: 504 (level 10) -> 'Ossec agent disconnected.' Src IP: (none) User: (none) ossec: Agent disconnected: 'xxxxxxxx-10.0.0.1'.
This message warns that an OSSEC agent is not alive and is very suspicious. And a few minutes later, same message for another server, and so on, one by one… After a quick check, all servers and network connections were fine. The problem was on the OSSEC server itself. A typo error in a new rule put some processes in a fuzzy state. Killing and the process and restarting properly the OSSEC server solved the problem. This example based on OSSEC is just an introduction to the topic of this quick blogpost: When you deploy security monitoring solutions, be sure to monitor them too!
In parallel to the security checks performed by your log management solution, extra verifications must be performed to control the flow of events and, when required, trigger other types of alerts. A classic situation is when events are pushed to the log management platform. It will wait passively for incoming events. This can be resumed as “No event received? Everything ok! Let’s have some sleep…“. Examples of suspicious situations:
- You did not receive any new Syslog events from a specific host for x minutes.
→ The Syslog daemon might be down or a network outage prevent UDP packets to reach the Syslog concentrator. - If you did not process new lines from an Apache log file.
→Apache might be in trouble or the file system might be full. Can you read the log file? (wrong permissions) - You did not receive any new alerts for x hours.
→Your log management system might be overloaded, some process killed or a file system being full.
There are plenty of nightmare example like those. How to prevent them?
- Like any other information system, keep an eye on the system health (control the CPU, memory, storage, processes). Disk space is critical and directly depends on your amount of data and retention policies.
- Send keep-alives to your remote [pollers|sensors|agents] (whatever you name them).
- Control any derivation of your regular events flow (compared to a baseline for a defined period – hourly/daily/etc). Example, is it normal to not see any login events from your Active Directory on a Monday morning?
- Implement queuing mechanisms to prevent events to be lost (when they are automatically pushed to the central system).
- When possible, collect events using pull technologies. If the log management platform has troubles, events won’t be lost and will wait until being retrieved later.
Don’t forget: Log management solutions are your best friends when you need to investigate a security incident. There is nothing more frustrating than gaps in your events timeline!
Hi Martin,
That is what I need to do under my infrastructure, specially the point where you say “You did not receive any new Syslog events from a specific host for x minutes”.
What tools do you recommend to monitor this type of problems??
Thanks.
Hi, I wrote an article last year about monitoring your monitoring tools (MySQL inside) :
http://www.mysqlplus.net/2011/12/02/monitoring-monitoring-tools-monyog-inside/