Thursday, September 09, 2010

Random Pseudo-URLs Try to Confuse Anti-Spam Solutions

For the past couple weeks things have NOT been normal for the Spam & Phishing folks at the UAB Computer Forensics Research Laboratory. The Phishing Operations team has been inundated by URLs being reported to them as "potential phish" that are not only not phish, they are not even URLs!

Here's a handful of the recently received URLs (in the past 5 minutes or so):

Without a "pattern", its hard to "mass-whack" the URLs, and so they keep ending up in our "Phishing-URLs-to-be-checked" list. The problem is that MOST of those domains actually exist! is owned by "Future media architects" offers free domain names (so we see them sometimes on phish normally!) is a research & development incubator is a tourism site for the city in Spain. claims to be the Online Finance Company. is a redirector to is the Patricia Seybold Group is the Scientific Applications & Research Association is a webcam chatting service is a parked domain on FirstLook.
only "" is not "live" somewhere.

The UAB Spam Data Mine has been seeing similar things. We're accustomed to spammers creating a "wildcard" DNS entry for a host, and then they can make up any random hostname they want and use 1,000 different machine names to refer to one Viagra sales website. We actually deal with that quite effectively, because once we have seen five hostnames for the same domain, we create a random domain name ourselves. If the contents of a random hostname for a domain gives us the same results as a spammed hostname for a domain, we mark all of them as being related and stop checking the rest. When "normal" spammers are using many hostnames, we only see a few domains using this technique per 15 minute work period, so even though, for instance, on August 8th we saw 2,670,602 hostnames in spam, (counting repeated hostnames), it wasn't such a big deal.

The problem with this new spam is that rather than having one destination per 15 minute work period that has randomization for the domain, we may have thousands in a single 15 minute work period.

On September 5th we saw 450,976 UNIQUE hostnames advertised in spam! To put that in perspective, from August 1 until August 28 the highest single day unique hostname count we had was 38,452. On August 29th, we had 391,594 unique hostnames advertised in spam! A tenfold increase in a single day! And its stayed there. We've had more than 370,000 every single day in September.

Or did we? My anti-spam friends RedDwarf and SiL were discussing this recently, over on the "InBoxRevenge" forums, and they mentioned that another lab had seen a dramatic jump in unique URLs beginning about August 26th. Its hard for me to see the same jump in unique URLs, because we see millions of URLs per day, and the number hasn't changed so dramatically -- but when we look at unique hostnames instead, we do see an enormous jump!

This corresponds to the second problem we observed in the lab. In our multi-phase spam parsing, phase two is "resolve the domains to IP addresses and store that data in a database." We started experiencing a backlog in that phase that was brought to my attention on September 1st. We hadn't put the two pieces together until last night when someone called attention to the RedDwarf posts on this topic.

I ran another query to count how many times we have seen each unique DOMAIN name -- not HOST name -- and the tail goes "to infinity and beyond" on this chart!

In the first seven days in September, we saw 149,964 unique DOMAIN names used in spam!!!

I tried to chart the distribution of domain names, but the chart ends up looking like I've shown you an empty chart because the tail is SO long and the drop-off is so dramatic. I'll try it as a table instead:
30 domains25,000+ times
132 domains10,000 - 24,999 times
1,051 domains1,000 - 9,999 times
5,818 domains100 - 999 times
21,417 domains10 - 99 times
13,907 domains5 - 9 times
39,580 domains2 - 4 times
68,030 domains1 time

An analysis of how often these "pseudo" domain names appear helps us to understand that the selection process for these host names is NOT a random selection from a dictionary, but rather a random selection from a large text sample. We know this by the frequency of commonly occurring words. During that same period, here is the count by domain name for the spam:

179,958 - - #1
100,255 - - #2
74,603 - - #4
66,104 - - #6
47,307 - - #5
42,713 - - #3
28,217 - - #7
20,051 - - #13
18,234 - - #17
18,097 - - #29
17,512 - - #12
16,178 - - #14
14,962 - - #16
13,990 - - #26
13,879 - - #15

The number following the domain name is that word's frequency from "The Most Common Words in English." The fact that they don't follow the true frequency count probably points to the fact that while they have a large language sample, its not a truly enormous language sample, or we would see a true-er frequency distribution.

The first possible "double-usage" domain comes here:

13,526 -

Apple is not one of the 500 most common words in English.

Clearly most of these are NOT going to be "Pseudo-URLs", as we know that "apple" is not nearly as common a word as "are" and "from". In fact, most of the emails we have with in them are unlikely to be spam at all. Other domains that we saw with at least this high a count are either "whitelisted" domains or they are clearly "spam" domains. (List below has "whitelisted" domains supressed).

58,039 -
53,504 -
41,012 -
39,629 - (???)
33,724 -
31,613 -
31,579 -
29,704 -
28,802 - (???)
27,752 -
27,572 -
27,184 -
25,928 -
25,583 -
24,164 -
23,800 -
23,300 -
23,152 -
23,004 -
22,808 -
22,800 -
22,784 -
22,100 -
21,696 -
21,648 -
21,144 -
20,892 -
20,880 -
20,530 -
20,460 -
20,416 -
20,400 -
20,360 -
20,236 -

When we get down to the single use domains, it becomes clear again that the "word list" for these randomly created domains is not a dictionary. We have words like "bariloche" "doughton", "vignarajah", and "okjeo", which does seem to lend credence to the idea that has been floated that these are words selected from Wikipedia.

Example One: Pharmacy Express

But what does the spam actually LOOK like? and what does it do?

(Click here to see the original email)

Here is an example image from the spam:

In the spam message that used this image, the image was loaded from the URL:

and clicking on the spammed image would take the visitor to:

which contained an auto-forwarder that would have sent the visitor instead to:

Which is a Pharmacy Express pill sales site hosted on

Please note that the URL on "" is a compromised domain, as we discussed in our August blog article Viagra Spammers as Hackers, where compromised domains were used as spam targets and redirected the visitors to a Pharmacy Express domain.

The NOISE in that spam message however, includes links to non-existent images including:

Then there is a block of text, hidden from the email recipient by a "span style" tag that reads:

The variation among the German dialects is considerable, with only the neighboring dialects being mutually intelligible. NYS School of Industrial and Labor Relations. The then-reigning government (cabinet Persson) stated that they would only take into consideration the results of the referendum in Stockholm Municipality. They too have been deaf to the voice of justice and of consanguinity. The country accounts for two-fifths of global military spending and is a leading economic, political, and cultural force in the world. A National Public Radio affiliate, and Public Broadcasting Service television station WPBA 30. Stadiums with a capacity of more than 40,000. New York City at the Open Directory Project. The other professional rugby union team in the city is second division club London Welsh, that plays home matches in the city. A sense of Indonesian nationhood exists alongside strong regional identities.

Mixed in among that text are additional non-existent image tags:

Example Two: Canadian Pharmacy

(Click here to see the original email)

The group above actually hasn't been so troublesome, because we don't bother to resolve every .jpg URL that comes through our spam. The explosion actually has come from the group described in THIS example. In this email there are a large mix of ".php" URLs mixed into the hidden data.

The image displayed in the spam has a randomly created name itself, in this case anchored to the "real" domain "". In this example the graphic file was retrieved as:

but, just as a test, I told my browser to load instead "", a name that I just made up. I get the same image either way. In fact, any machine name with any file name that ends in ".gif" will show you the same graphic if the domain name is "".

The same is true for the URL that you are directed to if you click on the graphic in your spam. Going to "" with any machine name and NO filename will cause the autoforwarder to send you to:

Which looked like this when we visited:

There are currently thirty "real" spammer domains, each of which function in the same way as the "" domain:

The Pseudo URLs in this email included:

And the same "style span" trick was used to supress text intended to confuse spam filtering systems, which in this example read:

Expansion of transportation options encouraged economic expansion. Kahn then founded LightSurf in 1998. Temperate grasslands, savannas, and shrublands. Subtracted from 10, that leaves a result from 1 to 10. After the Cold War, the 86th was realigned to become an Airlift Wing, which it remains today. Just under three quarters of Australia lies within a desert or semi-arid zone. Since this ion is three steps removed from atmospheric CO 2, the level of inorganic carbon storage in the ocean does not have a proportion of unity to the atmospheric partial pressure of CO 2. Japanese Journal of Religious Studies 33. The law went into effect in March 16, 2006, garnering much local and national media attention. Wars involving the Illinois Country, Illinois Territory, and State of Illinois. NGS FAQ - What is a geodetic datum. Archived from the original on 6 July 2010. Australia is also powerful in track cycling, rowing, and swimming, having consistently been in the top-five medal-winners at Olympic or World Championship level since 2000. The SI units for both systems are summarized in the following tables. Time has seen significant improvements in the usability and effectiveness of computer science technology. The ISBN separates its parts (group, publisher, title and check digit) with either a hyphen or a space.

No comments:

Post a Comment

Trying a new setting. After turning on comments, I got about 20-30 comments per day that were all link spam. Sorry to require login, but the spam was too much.