Monitoring within your SIEM

Pastebin Cat

For those who (still) don’t know, it’s  a website mainly for developers. Its purpose is very simple: You can “paste” text on the website to share it with other developers, friends, etc. You paste it, optionally define an expiration date, if it’s public or private data and your are good. But for a while, this on-line service is more and more used to post “sensitive” information like passwords or emails lists. By “sensitive“, I mean “stolen” or “leaked” data. Indeed, allows anybody to use their services without any authentication, it’s easy to remain completely anonymous (if you submit data via proxy chains, Tor or any other tool which takes care of your privacy)

In big organizations, marketing departments or agencies learned how to use social networks for a long time. They can follow what has been said about their products and marketing campaigns. In my opinion, it is equally important to follow what’s posted about your organization on! Many people are looking for interesting data on from an offensive point of view. Let’s see how this can also benefit to the defensive side.

For me, became an important source of information and I keep an eye on it every day. But, due to the huge amount of information posted every minute, it is impossible to process it manually. Of course, you can search for some keywords but it’s totally inefficient. In a first time, I grabbed and processed some HTML content using the classic UNIX tools. Later, I found a nice Python script developed by Xavier Garcia: It checks continuously for data leaks on using regular expressions. I kept it running for a while on a Linux box and it did a quite good job but I needed more! Xavier’s script send the found “pasties” on the console. It is possible to dump the detected pasties by sending a signal to the process. Not always easy. That’s why I decided to go a step further and write my own script! The principle remains the same as the script in Python (why re-invent the wheel?) but I added two features that I found interesting:

  • It must run as a daemon (fully detached from the console) and started at boot time.
  • It must write its finding in a log file.

The next step sounds logical: If you have a log file, why not process it automatically: Let’s monitor within your SIEM! If you find information posted on, it could be very interesting to be notified (a great added-value for your DLP processes). My script generates Syslog messages and (optionally) CEF (“Common Event Format“) events which can be processed directly by an ArcSight infrastructure. Syslog messages can be processed by any SIEM or log management solution like OSSEC (see below). It is now possible to completely automate the process of detecting potentially sensitive leaked data and to generate alerts on specific conditions.

First install the script on a Linux machine. Requirements are light: a Perl interpreter with a few modules are required (normally all of them are already installed on recent distribution) and a web connectivity to If you are behind a proxy, you can define the following environment variable, it will be used by the script:

  # export HTTP_PROXY=

The script can be started with some useful options:

  Usage: ./ --regex=filepath [--facility=daemon ] [--ignore-case][--debug] [--help]
                       [--cef-destination=fqdn|ip] [--cef-port=<1-65535>] [--cef-severity=<1-10>]
  --cef-destination : Send CEF events to the specified destination (ArcSight)
  --cef-port        : UDP port used by the CEF receiver (default: 514)
  --cef-severity    : Generate CEF events with the very easy to process and can be specified priority 
                      (default: 3)
  --debug           : Enable debug mode (verbose - do not detach)
  --facility        : Syslog facility to send events to (default: daemon)
  --help            : What you're reading now.
  --ignore-case     : Perform case insensitive search
  --regex           : Configuration file with regular expressions (send SIGUSR1 to reload)

Once running, the script scans for newly uploaded pasties and search for interesting content using regular expressions. There is no limitation on the number of regular expressions (defined in a text file). To not disturb webmasters, the script waits a random number of seconds between each GET requests (between 1 and 5 seconds). There is only one mandatory parameter ‘–regex‘ which gives the text files with all the regular expressions to use (one per line). If one of the regular expressions matches, the following information will be sent to the local Syslog daemon:

  Jan 16 14:43:24 lab1[29947]: Sending CEF events to (severity 10)
  Jan 16 14:43:24 lab1[29947]: Loaded 17 regular expressions from /data/src/pastemon/pastemon.conf
  Jan 16 14:43:24 lab1[29947]: Running with PID 29948
  <time flies>
  Jan 16 15:57:48 lab1[29948]: Found in : CREATE TABLE (9 times) -- phpMyAdmin SQL Dump (1 times)

All matching regular expressions are listed with their number of occurrences. This can be easily processed by OSSEC using the following decoder:

  <decoder name="pastemon">

  <decoder name="pastemon-alert">
    <regex>Found in\.+ : (\.+) \(</regex>

The first regular expression is stored in the OSSEC “data” variable to be used as  conditions in rules. Here is an example: The rule #100203 will trigger an alert if some email addresses are leaked in (Note: This regular expression must be defined in the script configuration file!)

  <rule id="100203" level="0">
    <description>Data found on</description>

  <rule id="100204" level="7">
    <description>Detected email addresses on!</description>

If you have an ArcSight infrastructure, you can enable the CEF events support. The same event as above will be sent to the configured CEF destination and port:

<29>Jan 16 15:57:48 CEF:0|||v1.0|regex-match|One or more regex matched|10|request= msg=Interesting data has been found on
cs0=CREATE TABLE cs0Label=Regex0Name cn0=9 cn0Label=Regex0Count cs1=-- phpMyAdmin SQL Dump cs1Label=Regex1Name cn1=1 cn1Label=Regex1Count

To process the CEF events on ArcSight’s side, configure a new SmartConnector, a new UDP CEF receiver and the events should be correctly parsed:

Parsed events
(Click to enlarge)

That looks great! But the next question is: “What to look for on“. Well, it depends on you… Based on your organization or business, there are things that you can’t miss. Here is a list of useful regular expressions that I often use:

RegEx                                                                  Purpose
---------------------------------------------------------------------  -----------------------------------
company\.com                                                           Your company domain name
@company\.com                                                          Corporate e-mail addresses
CompanyName                                                            Company name
MyFirstName MyLastName                                                 Your full name
@xme                                                                   Twitter account
192.168.[1-3].[0-255]                                                  IP addresses ranges
anonbelgium                                                            Hackers groups
#lulz                                                                  Trending Twitter hashtags
-----BEGIN RSA PRIVATE KEY-----                                        Interesting data!
-- MySQL dump                                                          Interesting dumps!
belgium                                                                My country
city                                                                   My city
((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}      Credit cards

If you have interesting regular expressions or ideas, feel free to share!

Source is available here. As usual, this is provided “as is” without any warranty. Happy monitoring!


  1. Hi,

    has anyone experienced pastemon eating up memory resulting in swap being filled up? Or leaving zombie processes?

  2. Hi Xavier, have a questions.
    This actually works fine? You need applied a change in the source code like url parsing:

    “$p = ‘’ . $p;” for “$p = ‘’ . $p;

    In parallel in create a paste and searched that with the propioursly search tool in the site and don’t be found.

    In the comment for disable proxies cant see what need to remove or comment. Please, can you tell me what remove for not use proxies and go out directly?

    9 months ago
    To completely disable proxies, just remove or comment the following line in your pastemon.conf file:

  3. Hello Xavier
    When I run the script in debug mode displays the following message

    DBI::db=HASH(0x201a878)->disconnect invalidates 1 active statement handle (either destroy statement handles or call finish on them before disconnecting) at ./ line 760.

    any idea?


  4. This looks great, but I only want it to send to ArcSight via CEF and cant get past the SMTP server error.

  5. Hi, the dump function is dumping all date and not only the matched pastie.


  6. Hi Xavier,
    I m running the script on a VM and trying to forward the cef output to a LOgger running on another VM on the same host. The problem is i am getting just “cannot fetch ….” and “disabled unreliable proxy” messages in the syslog. Please let me know what am i not doing correctly?

  7. Hi i’m testing your pastemon, everything ok, but i’ve a question, in a database when keyword is found the column “matched” is blank. If i would search a pastebin that matched a regular expression in a database. How can i do?? Thanks

  8. I’m testing out pastemon for production use and my end goal is to have it post to a WP site, but while running in debug I keep receiving the error: ‘WordPress configuration disabled: WordPress::XMLRPC not installed’. ‘xmlrpc.php’ is active and available on my site, I do not have ‘http://’ or ‘/’ on my URL within the .conf, I have triple checked my username and password, and verified the category is available within my WP site. Can you offer any help in this area? Is there a dependency I’m missing?

    Also when using the default proxies.conf I’m constantly receiving: “Cannot fetch 500 Can’t connect to (timeout)
    +++ Disabled unreliable proxy (956 active proxies)”

    Any help would be greatly appreciated.

  9. Thank you for your fast response. It seems like i have my wires crossed, but i dont understand how i can disable the support of proxies. Any advice? Thanks in advance!

  10. To use proxied connections, you must provide a list of proxies (format is [IP|FQDN]:port one by line). Proxies will be selected randomly and removed if not available. It’s up to you to build a reliable list of proxies…

  11. Hi Xavier,
    much appreciation for your effort in this script from Austria.
    Just updated to the newest version, but the proxy-list doen’t work with the provided entries. Get a lot of log-entries saying “Disabled unreliable proxy…”.
    Any suggestions?

  12. Andriy, Perl supports the SOCKS protocol. I have installed a tor client locally and use http_proxy=socks://


  13. 本田技研工業株式会社

    you can try with these

  14. Hello Benny,
    Output supports non-roman characters:
    open(DUMP, “>:encoding(UTF-8)”, “$dumpDir/$pastie.raw”)
    The regex file is opened and processed as a regular file. Not tested honestly! Do you have some example? I will test.

  15. For the file with regular expressions, can you use non-roman characters? Umlaut? Cyrillic? Arabic? etc??

  16. HI! Very useful tool!

    Has it TOR support?
    I’ve been having troubles with proxies I use, because they go offline much times

  17. Hi Xavier,

    I noticed that the source code of has been changed…that means that no pastie can be fetched.


  18. Hi guys,

    if you have problems behind a proxy and “export” doesn’t work you can also replace the following line:

    # $ua->env_proxy;
    $ua->proxy([‘http’], ‘’);

    It is used twice in the script. 😉


  19. Many Thanks Xavier! I’ve modified the script to monitor our sensitive data posted on pastebin. OSSEC is once again my best friend!!!


  20. This is maybe my fault… Are you sure to use HTTP_PROXY? (upper case). This is important on case-sensitive systems like UNIX. I tested again here and it works!

  21. The only prob I have is that it doesn’t work behind a proxy. I used the export command but the script doesn’t use the proxy trying to connect to pastebin.

  22. Nice tool. One thing that might help others spare a huge amount of time:
    if you use –debug the tool is _NOT_ writing to any logfile.


  23. Thanks for your effort, Xavier! And indeed since the index is initialized with 1, the Device Custom parameters come in.

  24. Indeed, the CEF dictionary mentions 1-6 custom fields. I fixed this in the script (this will be available after the next commit). Thank you for your tests!

  25. I think I know why I did not see the matched expressions:

    Jan 30 14:18:37 CEF:0|||v1.3|regex-match|One or more regex matched|3|request= msg=Interesting data has been found on cs0=vodafone cs0Label=Regex0Name cn0=1 cn0Label=Regex0Count

    It starts indexing at 0 not at 1. I changed line 384 to “my $i = 1;”. Hope it works now…

  26. I can confirm Josh’s issue, I also receive some “it seems you are requesting a little bit too much from Pastebin”. I now doubled the wait timers (i.e. random(3)*2 and random(5)*2) and I am curious to see if it persists…


  27. Hi Xavier,

    thank for picking this up.

    What I mean with the CEF comment is that I cannot see the matching regexes in ArcSight. Now I reviewed your script and can see that it puts the matches in the event. I now configured ArcSight to preserve the raw event so that I can see what the script actually submits to see what happens…


  28. Hello Heiko,

    Thank for you the suggestion/report. I just committed release 1.3 of my script:
    – You can know define your own PID file (–pidfile)
    – Sample of data can be printed (–sample)

    About your commend on the CEF event, the matching regex and their count is already reported using deviceCustomStringX and deviceCustomIntegerX. Or I didn’t understand your remark? Feel free to give me more details.

  29. Hi,

    great idea! I implemented the script, forwarding CEF events to ArcSight. I’m curious to see what it catches.

    However, I came across two issues:

    1. The script tries to write the daemon’s pid to /var/run. Because the script runs as a normal user, this does not work (at least on my machine). I changed this in the script to /tmp, but I would prefer if it was either default or configurable.

    2. It would be great to have the matched pattern in the CEF event, e.g. in Message or in requestContext. Then one could tell at a glance from the ArcSight console what was found where.

    3. Another idea would be to put a piece of the pastie, e.g. the line containing the matched pattern, in a CEF field like deviceCustomString1


  30. Hello Nicolas,
    Thanks for the idea! I just published a new version of the script which implements this feature.
    You can now define rules like “regex1 _EXCLUDE_ regex2”. This could help to get rid of false positives. A good example is looking for countries: If you look for “belgium”, there is a good chance that you will catch HTML code with list of countries. Using “belgium _EXCLUDE_ belize” (Belize is the next country in alphabetical order), you won’t be notified.

  31. So I’ve been try this and it appears that I’ve gotten my IP address blocked on Pastebin. I guess trying it every 30 seconds was a bit overkill.

  32. Hi Sertan,
    That’s why I linked the script with OSSEC! I prefer receiving emails in a unified format from ONE tool instead of being flooded by thousands of scripts output. Thanks for sharing your script too!

    PS: Your idea to fake the User-Agent is good btw!

  33. Xavier thanks for sharing the tool.

    Sending a CEF event to your SIEM is cool, but I would also recommend adding a mail alert functionality similar to

    Also from my experience, though not realtime, 2-3 minutes polling interval is good for fetching all recent pasties.

  34. Adrian,
    Good remark! I just uploaded v1.1 which has now a ‘–dump’ option.
    You can specify a directory where pasties matching a regex will be saved (raw). This will allow you to check pasties which expired. Thank you for your comment!

  35. Excellent tool, one recommendation only, to add the ability to exclude words. For example posts that contain the word ‘summer’ but not the ones with the word ‘house’ in the same content.

  36. Forgive me if I’ve misunderstood, but does this script download and store the matches it finds? If so, can you make it clear in the article where they are stored; if not, can you add such functionality? It seems to me that this script is only any good if the paste doesn’t have a time limit.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.