Monday, May 13, 2013

The Kelihos Botnet: Spam Data Mine + i2 Analyst Notebook

On April 17th & 18th, 2013, we blogged about spammers who were using the Boston Marathon Explosion and the Texas Fertilizer Plant Explosion to dramatically increase the size of their botnet. The botnet in question was the Kelihos botnet, and the primary purpose of the malware being delivered in that two day campaign was to cause newly infected computers to also join the botnet as additional spam-sending computers. Malcovery Security, where I serve as Chief Technologist, put out a free copy of their daily malware "Top Threats Today" report because the prevalence of that spam was nearly 80 times the level that we normally consider to be an "outbreak" of malicious activity.

So, what have the criminals behind Kelihos been doing with all of their new spam-sending power? Primarily they are sending Pump and Dump spam.

Pump & Dump

A Pump and Dump spam campaign is an email that claims a particular stock symbol is going to have a large increase in value in the near future and encourages investors to jump in while the price is still low. These are usually sub-penny stocks where the criminals have arranged to own millions of shares of an existent publicly traded "pink sheets" company. They then do false press releases about new business developments, accompanied with a spam campaign. We've seen stocks rise from 1/5th of a cent to 30 or 40 cents or on rare occasion $1 per share before the criminal dumps his millions of shares for a 10,000% profit. These attacks often coincide with brokerage phishing attacks where a stolen Fidelity account (or something like it) is used to buy the initial shares, or to buy many shares to give the appearance of high market activity in the junk stock to encourage wary investors.

Over the weekend, the Kelihos Pump & Dump target is "GT RL" which they claim is a small movie studio that is primed for an acquisition. In the spam emails they tell the story of an investor who owned 39% of Lions Gate and earned $1 Billion USD when the studio was acquired by a larger organization. GTRL is "Get Real USA, Inc." which claims last summer to have had "Academy Award Nominee Dean Wright" join their board of advisors, according to their website, which is denying any involvement in the current spam run.

On March 4, 2013, GTRL opened the market day trading at $0.0052. Friday it closed at $0.01 on a volume of 1.9 million shares traded. So someone is certainly buying shares!

Why do we care? Primarily because it has been one of the top spam-sending botnets ever since the Boston explosion spam. Yesterday, May 11, we saw a RIDICULOUS number of spam subject lines, all touting this penny stock.

Spam Data Mine

Long time readers will be familiar with the UAB Spam Data Mine. In December, we licensed the Spam Data Mine technology to Malcovery who use the Malcovery Spam Data Mine to identify Today's Top Threats for their customers, based on techniques and methodologies developed at UAB over the past six years. The Spam Data Mine receives in the neighborhood of a million messages per day, which we "parse" to extract key features which are stored in a PostgreSQL database. As we look at the top subjects recently, they have been dominated by Pump & Dump spam. For example, here are some of yesterday's Top Subject lines related to Stock:

  1267 | It is Our New Alert! This Low Float Monster is a Must See
  1203 | You won't beleive your eyes!
  1123 | This Stock is Starting to Heat Up
  1109 | Perfect Time To Add!
  1103 | Our Featured Gem
   804 | It`s official, this stock is a 100% perfect buy!
   621 | There should be outrage against bailouts!
   617 | Things to Know Before Your Next Trade
   574 | Closing out the week with Mega Gains!
   534 | This Stock is moving up as it should
   526 | Exciting Trade Idea Details Inside!
   503 | New Pick Coming Tomorrow, This is a Must Read!
   496 | This Stock is well positioned for another monster run!
   494 | Spectacular bouquets, only $19.99!
   478 | Stocks on watch for mega gains this week!
   460 | This Company IS RED HOT!!!
   458 | This Company is on Immediate Alert! This Bull is Positioning for a Major Run

If we just limit our search to spam that contained the word "Stock" or "Company" in the spam, we had more than 175,000 emails yesterday, using 1,976 subject lines! But how would we know the other subject lines in the campaign? "Perfect Time To Add!" doesn't have the word "Stock" or "Company" in the subject. There is also no guarantee that all of the messages containing these words are part of this spam campaign.

To get a better handle on this, we are going to do a series of queries to build a candidate pool, and then use IBM's i2 Analyst's Notebook to perform what we call "Visual Pre-Clustering" to help us determine some ground truth and to help us screen out some possible outliers. If there are several unrelated botnets all sending Pump and Dump spam, the clusters should be easily identifiable using this technique, while if there are other spam messages unrelated to Pump and Dump being sent by Kelihos, those should also be easily identifiable.

First, let's pile up our data:

Spam Queries to Build a Candidate Data Set

To begin, I'm going to collect a list of IP addresses of computers that sent me spam on May 11, 2013 that used the word "stock" or "company" in their spam message. This query creates a temp table called "may11stockip" that contains the list of IP addresses that sent me those messages and a count of how many times each was used.

spam=> select count(*), sender_ip into may11stockip from spam where (subject ilike '%stock%' or subject ilike '%company%') and receiving_date = '2013-05-11' group by sender_ip order by count desc;
This gave me 27,425 unique addresses. Our next step is to ask the Spam Data Mine for other subjects that were sent by that group of IP addresses. While it is true that I could build one massive query to do all of this work, we've found over time that the temporary tables can be useful to have preserved, and using the temporary tables actual speeds up the final result.

spam=> select count(*), subject into may11stocksub from spam a, may11stockip b where a.sender_ip = b.sender_ip and receiving_date = '2013-05-11' group by subject order by count desc;

This generated 6,420 spam subject lines! Far more than the 1976 that contained the words "stock" or "company"! In fact, given the size of the botnet, it is actually likely that I may have received some spam from computers that DID NOT use the word "stock" or "company", so we'll run one more iteration. Dropping the "may11stockip" table, we rebuild it from any computer that sent a subject found in the new temptable, may11stocksub.

spam=> select count(*), sender_ip into may11stockip from spam a, may11stocksub b where a.subject = b.subject and receiving_date = '2013-05-11' group by sender_ip;

Now we have 93,538 candidate IP addresses to consider as possible Kelihos nodes!

Our last iteration in building our "Pile of Data" to hand to i2 is to create relationships between those 93,538 candidate IP addresses and all of the subjects they used. Our goal is to have a nice table that can be imported into i2 Analyst's Notebook.

spam=> select count(*), a.sender_ip, subject into may11stockpairs from spam a, may11stock b where receiving_date = '2013-05-11' and a.sender_ip = b.sender_ip group by a.sender_ip, subject order by count desc;
This generates 282,763 pairs of "sender_ip x subject".

Visual Pre-Clustering with i2 Analyst's Notebook

From these 282,763 pairs, we're going to let i2 do all the hard work. Here's the basic idea. Let's say we have 4 computers, A, B, C, and D and each of these computers sent an email from the set M1, M2, M3, M4, M5, M6, M7. For the sake of argument, we are going to say that because there is NO CHANCE that the computers would have sent the same email, unless they were CONTROLLED by the same criminal spammer. If we can demonstrate which computers sent the same messages, we could then determine which computers were controlled by the same criminal.

A - M1
A - M2
A - M3
B - M4
B - M5
C - M1
C - M6
C - M7
D - M1
D - M6 
D - M7
If we were to draw a picture of that, just as you see it on the list, it might look like this:

But if we allow i2 to give a more intuitive layout, it would look like this, which makes it very plain that Computers A, C, and D are sending "the same" emails, while Computer B is sending "different" emails.

One Day of Kelihos in i2 Analyst's Notebook

You might say to yourself, "That didn't seem to add much value?" But now imagine that there are 282,763 rows on your list instead of eleven, and that instead of having four computers you have 93,538 and instead of having seven email subjects you have 7,226.

Here's the chart you get when you do that!

or with some labels on it:

Cluster A
The cluster labeled as "A" is our main "Stock Pump & Dump" cluster. All of our "main" Stock and Company subjects are in the heart of that cluster, with many related computers coming from them.

Cluster B
This cluster is primarily formed of spam for "Work at Home" scams. Some sample subjects from this group include:

Ready to be your own boss?
Business Startup
Your second chance in life just arrived
Sick of paying bills?
Wanna pay off your debts?
Stop just barely making ends meet every month
Make Money Online
Wanna Learn how to make money online?
Success Kit
Ill show you the road to early retirement
Successful Business
New Income
Wanna make up to $6500/month?
Job openings in your area!
At Home Income
A living online is easier than you think
Work From Home Jobs Available!

One slight "False join" is linking "A" and "B" and has to be manually eliminated. "Empty Subject" is the only subject in Cluster H hidden in the midst of the Corpus Callosum that joins A and B. After discovering this, we manually deleted that subject from the chart, and re-ordered the chart, after also first removing "disjointed" clusters that had not tie to the core, such as Cluster F and the others at the top, and many of the "Fan-subclusters" such as Cluster I that surrounded Cluster A.

The "Cleaned Up" version of the chart still makes it abundantly clear that THOUSANDS of IP addresses that are part of the "Stock Pump and Dump" cluster on the left are ALSO part of the "Work at Home" (B) and "Pharmacy Express" (C,D,E) clusters on the right. The Cleaned Up chart, shown below, still has 91,833 IP Addresses and 6,242 Email Subjects, with 277,747 unique "pairs" between them.

IP addresses closer to the right have primarily "Work at Home" spam subjects, such as

 count |                 subject                  
     2 | TODAY`S TRADING IDEA IS `Advanced`
     1 | Work for Moms
     1 | It moves up nicely on heavy accumulation
     1 | Job Hiring is at an all time low...
     1 | Sick of paying bills?
     1 | Business Startup
(6 rows)


 count |               subject               
    13 | Successful Business
     1 | Sick of not making ends meet?
     1 | Wanna make up to $6500/month?
     1 | Job Hiring is at an all time low...
     1 | What kind of investor are you?
(5 rows)

IP addresses closer to the left have primarily "Stock Pump and Dump" spam subjects, such as

 count |                                subject                                 
     5 | This Company is Ready to Run
     5 | It is one to watch this week!
     4 | Analysts gives this stock a "STRONG SPECULATIVE BUY" rating
     4 | New Play Coming
     3 | This Company has a history of Huge Rallies, on verge of another Rally?
     3 | New Wild Breakout Pick Coming TONIGHT!
     3 | The NEW TRADE ALERT
     3 | A Potential Mover from Penny Stock
     3 | It Is Wasting Little Time Making Waves
     2 | This Company Ends Last Week Strong
     2 | Get Ready For The Hottest Gold Pick On The Planet!
     2 | Our New Blazin Sub-Penny Alert!
     1 | Be Ready
     1 | Success Kit
     1 | This Company exploded in volume today
     1 | Second chance for traders who have `calmed down`...
     1 | Sick of a dead end job?
     1 | We`ve Got A Bouncer On Our Hands!
     1 | This Stock Signs Agreement With Reputable PR Agency
     1 | Back to work week will get this play really going!
(20 rows)

The "Bumps" that circle cluster B are groups of IP addresses that share "some but not all" of the subjects found in Cluster B. There are many IP addresses that we saw only once or twice -- because of their low volume, they do not appear as "fully meshed" as the IP addresses in the "core" of Cluster B. A couple examples will demonstrate this.

In the core of Cluster B we see thousands of IP addresses that were used for at least 2 or 3 Work at Home messages:

     2 | Successful Business
     1 | Wanna make up to $6500/month?
     1 | Income At Home
     1 | Success Kit
     1 | Stop just barely making ends meet every month
     2 | Success Kit
     1 | Wanna make up to $6500/month?
     1 | Income At Home
     2 | Work for Moms
     1 | Replace your nine to five...
     1 | Business Startup
     1 | Success Kit
     1 | Make Money Online
     1 | Your second chance in life just arrived
Small "micro clusters" of IP addresses used for both the "C" or "D" Pharma spam and one or more of the Work at Home subjects fill the ridge between Clusters "B" and "C, D, E":

     1 | ð°ð°ð°Cialis (30 pills 20mg) USD 91.50 & Viagra (30 pills 100mg)  USD 81.90ð°ð°ð°
     1 | ð°ð°ð°Viagra (30 pills 100mg)  USD 81.90 & Cialis (30 pills 20mg) USD 91.50 ð°ð°ð°
     1 | Your second chance in life just arrived
     1 | ð°ð°ð°Cialis (30 pills 20mg) USD 91.50 & Viagra (30 pills 100mg)  USD 81.90ð°ð°ð°
     1 | Replace your nine to five...
     1 | ð°ð°ð°Viagra (30 pills 100mg)  USD 81.90 & Cialis (30 pills 20mg) USD 91.50 ð°ð°ð°

Here are two example IP addresses from a single "Bump" on the left edge of Cluster B.

     1 | Stop just barely making ends meet every month
     1 | Stop just barely making ends meet every month

Cluster C, D, and E
These are Viagra Spam clusters. C & D are two very popular subjects, both resolving to "Pharmacy Express" websites. The small cluster "E" is formed of IP addresses that sent spam for both Cluster C and Cluster D.

Cluster F & Friends
Cluster F and the neighboring small clusters at the top of the chart have been included primarily through a coincidental usage of the word "Company" in their subject lines. F, for example, is a well-known spammer of the type the industry calls a "Snowshoe spammer." They rotate through hosted data centers, paying their bills for nice hardware to be used for spamming with stolen credit cards. When they get thrown out of one data center for spamming, they move to the next.

Cluster G & J
These clusters are also primarily joined through the coincidental use of the word "Company" in the subcluster subjects.

Cluster I
There are many "Fan-shapes" around the edges of Cluster A. Looking at Cluster I as an example, there are 36 subjects in that "fan cluster" all related to "Replica goods":

A Rolex replica watch
Beautiful quartz, water-resistant Replica watches
Box Sets
Gold Watches
Gucci Bags

Only a single (subject x sender_ip) pair links this fan-cluster to the main Cluster A. The subject "replica watches! rolex, patek philippe, vacheron constantin and others!" which was attached to dozens of IP addresses in the fan-cluster, is also attached to the IP address "" That IP address also sent us two messages with the email subjects "This Stock Move Starting!".

154 IP addresses in Cluster A also used the subject "This Stock Move Starting!"

To focus on the core activity, disconnected subclusters, such as F, and "fan-clusters" such as I are removed from the chart, and the layout is performed again.