Monitoring within your SIEM

Pastebin Cat


For those who (still) don’t know, it’s  a website mainly for developers. Its purpose is very simple: You can “paste” text on the website to share it with other developers, friends, etc. You paste it, optionally define an expiration date, if it’s public or private data and your are good. But for a while, this on-line service is more and more used to post “sensitive” information like passwords or emails lists. By “sensitive“, I mean “stolen” or “leaked” data. Indeed, allows anybody to use their services without any authentication, it’s easy to remain completely anonymous (if you submit data via proxy chains, Tor or any other tool which takes care of your privacy)

In big organizations, marketing departments or agencies learned how to use social networks for a long time. They can follow what has been said about their products and marketing campaigns. In my opinion, it is equally important to follow what’s posted about your organization on! Many people are looking for interesting data on from an offensive point of view. Let’s see how this can also benefit to the defensive side.

For me, became an important source of information and I keep an eye on it every day. But, due to the huge amount of information posted every minute, it is impossible to process it manually. Of course, you can search for some keywords but it’s totally inefficient. In a first time, I grabbed and processed some HTML content using the classic UNIX tools. Later, I found a nice Python script developed by Xavier Garcia: It checks continuously for data leaks on using regular expressions. I kept it running for a while on a Linux box and it did a quite good job but I needed more! Xavier’s script send the found “pasties” on the console. It is possible to dump the detected pasties by sending a signal to the process. Not always easy. That’s why I decided to go a step further and write my own script! The principle remains the same as the script in Python (why re-invent the wheel?) but I added two features that I found interesting:

  • It must run as a daemon (fully detached from the console) and started at boot time.
  • It must write its finding in a log file.

The next step sounds logical: If you have a log file, why not process it automatically: Let’s monitor within your SIEM! If you find information posted on, it could be very interesting to be notified (a great added-value for your DLP processes). My script generates Syslog messages and (optionally) CEF (“Common Event Format“) events which can be processed directly by an ArcSight infrastructure. Syslog messages can be processed by any SIEM or log management solution like OSSEC (see below). It is now possible to completely automate the process of detecting potentially sensitive leaked data and to generate alerts on specific conditions.

First install the script on a Linux machine. Requirements are light: a Perl interpreter with a few modules are required (normally all of them are already installed on recent distribution) and a web connectivity to If you are behind a proxy, you can define the following environment variable, it will be used by the script:

  # export HTTP_PROXY=

The script can be started with some useful options:

  Usage: ./ --regex=filepath [--facility=daemon ] [--ignore-case][--debug] [--help]
                       [--cef-destination=fqdn|ip] [--cef-port=<1-65535>] [--cef-severity=<1-10>]
  --cef-destination : Send CEF events to the specified destination (ArcSight)
  --cef-port        : UDP port used by the CEF receiver (default: 514)
  --cef-severity    : Generate CEF events with the very easy to process and can be specified priority 
                      (default: 3)
  --debug           : Enable debug mode (verbose - do not detach)
  --facility        : Syslog facility to send events to (default: daemon)
  --help            : What you're reading now.
  --ignore-case     : Perform case insensitive search
  --regex           : Configuration file with regular expressions (send SIGUSR1 to reload)

Once running, the script scans for newly uploaded pasties and search for interesting content using regular expressions. There is no limitation on the number of regular expressions (defined in a text file). To not disturb webmasters, the script waits a random number of seconds between each GET requests (between 1 and 5 seconds). There is only one mandatory parameter ‘–regex‘ which gives the text files with all the regular expressions to use (one per line). If one of the regular expressions matches, the following information will be sent to the local Syslog daemon:

  Jan 16 14:43:24 lab1[29947]: Sending CEF events to (severity 10)
  Jan 16 14:43:24 lab1[29947]: Loaded 17 regular expressions from /data/src/pastemon/pastemon.conf
  Jan 16 14:43:24 lab1[29947]: Running with PID 29948
  <time flies>
  Jan 16 15:57:48 lab1[29948]: Found in : CREATE TABLE (9 times) -- phpMyAdmin SQL Dump (1 times)

All matching regular expressions are listed with their number of occurrences. This can be easily processed by OSSEC using the following decoder:

  <decoder name="pastemon">

  <decoder name="pastemon-alert">
    <regex>Found in\.+ : (\.+) \(</regex>

The first regular expression is stored in the OSSEC “data” variable to be used as  conditions in rules. Here is an example: The rule #100203 will trigger an alert if some email addresses are leaked in (Note: This regular expression must be defined in the script configuration file!)

  <rule id="100203" level="0">
    <description>Data found on</description>

  <rule id="100204" level="7">
    <description>Detected email addresses on!</description>

If you have an ArcSight infrastructure, you can enable the CEF events support. The same event as above will be sent to the configured CEF destination and port:

<29>Jan 16 15:57:48 CEF:0|||v1.0|regex-match|One or more regex matched|10|request= msg=Interesting data has been found on
cs0=CREATE TABLE cs0Label=Regex0Name cn0=9 cn0Label=Regex0Count cs1=-- phpMyAdmin SQL Dump cs1Label=Regex1Name cn1=1 cn1Label=Regex1Count

To process the CEF events on ArcSight’s side, configure a new SmartConnector, a new UDP CEF receiver and the events should be correctly parsed:

Parsed events

(Click to enlarge)

That looks great! But the next question is: “What to look for on“. Well, it depends on you… Based on your organization or business, there are things that you can’t miss. Here is a list of useful regular expressions that I often use:

RegEx                                                                  Purpose
---------------------------------------------------------------------  -----------------------------------
company\.com                                                           Your company domain name
@company\.com                                                          Corporate e-mail addresses
CompanyName                                                            Company name
MyFirstName MyLastName                                                 Your full name
@xme                                                                   Twitter account
192.168.[1-3].[0-255]                                                  IP addresses ranges
anonbelgium                                                            Hackers groups
#lulz                                                                  Trending Twitter hashtags
-----BEGIN RSA PRIVATE KEY-----                                        Interesting data!
-- MySQL dump                                                          Interesting dumps!
belgium                                                                My country
city                                                                   My city
((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}      Credit cards

If you have interesting regular expressions or ideas, feel free to share!

Source is available here. As usual, this is provided “as is” without any warranty. Happy monitoring!

Post Navigation