Robots and Crawlers
web stats ruined for March 2006

T   
his isn't the first time I've had to complain about 'bots wrecking my stats. Here they come again this time it's Yahoo's SLURP crawler

@ http://help.yahoo.com/help/us/ysearch/slurp/index.html

My stats gradually went from about 2500 "sessions" (formerly visitors) per day to around 4500 per day over a period of about 2-3 weeks. Now, because it's March, I expect to see a significant rise in the number of people who visit my site. It's a general occurrence in both March and October when all the kids are back in school (high school and college) and are not preparing for any significant holiday. They're just doing school and surfing the net in larger than normal quantities.

But when it looked like I'd have over 100,000 visitors this month, I had to step back and say ... "Hmmmm, this ain't right. What the hell is goin' on now?". Hence, an investigation of my stats ... which revealed, after some 15 minutes of digging, that the culprit was in fact the Slurp robot crawler which Yahoo has unleashed upon the web.

In general, I don't mind a 'bot crawling my site. That's how you get listed in search engines. I do object when the suck up my entire site in one day ... then come back the next day and do it again ... and again ... and again ... like some kind of stupid machine. Uhh ... I guess that's what they are, eh?

Slurp's new tactic

What's different here is that Slurp "spoofs" visitors by using a new IP number whenever it revisits your site ... like 2000 times per day in my case. That is, the designers have made Slurp's activities appear as normal traffic thus destroying the validity of any stats package the owner of the site might employ ... thus increasing the apparent traffic going to any particular site that they are crawling. Why they are spoofing visitors is not clearly understood by me ... but ... there is undoubtedly something sinister involved here. Yahoo is battling against Google and has come up with some sort of dirty deeds tactic.

If the crawler would use one IP and then suck down 1/4 of my site, I wouldn't mind. I can detect that in my stats. Also, the Slurp crawler obeys the robots.txt page in the main index folder which all 'bots are supposed to honor. Here you put in some snippet of code that the bot reads and you can exclude that particular bot from any particular folder. That's what I promptly did and the next day it stayed out of my old wwwboard/messages folder which contains a couple thousand old posts from my first (now obsolete) Matt's wwwboard. Nobody uses it anymore because there are other evil bots that look for Matt's old programs and post ads and porn messages to all it finds. And all the posts to that old form were made into .html files. Hence, when the crawler finds a .html file, it automatically records its content and puts it into its database, i.e. it downloads it just like a human and is counted as a visitor by my stats program.

When I excluded the Slurp 'bot from that folder, my stats immediately took a nosedive to normal traffic stats. Hell, for a while I thought my site was going great guns again! Alas, it is just chugging along at the usual rate with no increase in visitors. I think I've reached saturation and will expect no further increase in my traffic henceforth. It will stay around 2000 - 2500 visitors per day ... for the foreseeable future.

But this got me to thinking evil thoughts

What then is my actual traffic? How could I know? In my stats, one of the most downloaded pages is the robots.txt page. It gets downloaded maybe 200 times per day. That's a lot of 'bot traffic to my site. What are they doing? Are all of them doing what Slurp was doing? How would I know?

It's now conceivable to me that as much as half the traffic on the internet is not human at all. It may be that 'bots have taken over the whole shebang. Think about it. They all spoof human surfers and make people think that they have lots of visitors which they don't. Some of the illicit bots may be even able to fill out forms and click on links. Think of the consequences to Google's AdSense program. A bot is "hired" to spoof different IP addresses and click on ads which generates illicit revenue for someone while the rest of us honest people get less and less revenue by being squeezed out. Maybe robots get and respond to more spam than people. Robots scamming robots. What a hoot!

What sort of resources are being wasted by these bots? The server time and bandwidth. Maybe they are getting ready to crash the internet altogether.

More examination

My stats have leveled off. This Sunday 3/25/06 I got 2198 "sessions". If they were all human that would be about 2000 different people if I consider that most people don't go away and come back later in the day (which I think is usually the case). Looking over the sessions info in my stats package, I have to assume that over half of my supposed visitors are just 'bots downloading pages. And, they are all using the above spoofing tactics ... so ... internet traffic is probably only in the 40% area for real people actually surfing.
60% of all internet traffic today is undoubtedly done by robots. We are overrun already. The machines have taken over.

So, my conservative guess as to how many living people actually visit my site every day is about 500 and my liberal guess is 1000 while the stats package says around 2000 because it doesn't identify robot traffic directly for me. That is, it can identify such traffic but won't because it's something "THEY" don't want you to know about. After all, the Web is a strictly commercial venture. That's why no pages are cached anymore (not even on your own machine). If pages were cached you wouldn't have to re-fetch them and valuable commercial marketing info would be lost.

Note:
My Google AdSense program is giving me my best stats info actually. It says I get about 1000 pageviews per day which have ads on them. If each visitor looks at an average of two pages (what my stats package says), I would be getting 500 visitors per day. Hmmm ... I am deflated ... but I prefer the truth ... always.

The Web as it first was (back in the mid nineties) ... is dead. Replaced by ... commercialism, subterfuge and bean-counters



Ebtx Home Page