This past Sunday I was greeted with not one, but two emails from my host both warning that I was about to reach the 100 GB bandwidth limit on one of my accounts.
I thought surely there must be some sort of mistake because this particular account is home to my very low traffic sites. These sites could never collectively come anywhere close to maxing out even the most modest resource limits. However there it was, staring me in the face, the first email saying 90.2 GB of traffic has been used and the second, just a mere 24 hours later, saying 95.3 GB of traffic has been used.
Wow! 5.1 GB in 24 hours?! In one day my bandwidth consumption was well beyond what this account typically uses in an entire month. In an "oh shit" moment, I imagined the pending doom; the extra fees that were to become a reality with just another 24 hours of 5+ GB of consumption.
I hurried off to research the cause and discovered the culprit was a bot. (Duh, right?) But to my surprise this wasn't the kind of bot you'd expect. This was a "friendly bot"; one I've always considered to be "one of the good guys." The bot I am referring to is bingbot - the search engine crawler for bing.com by Microsoft.
Apparently bingbot's bandwidth binge started back at the end of November but didn't come to my attention until it got so greedy it consumed enough bandwidth to trigger the alerts from my host. Thousands upon thousands of requests - all made by my dear friend bingbot. (You know what they say: With friends like that, who needs enemies?)
This Couldn't Be Right
So certain was I that bingbot's user-agent had been spoofed that I double and triple checked the many IP addresses that were showing up in my logs and sure enough, they all belonged to Microsoft. Damn it.
The Hunt For a Fix
In typical danbriapps fashion, I hit the search engines trying to figure out WTF and, more importantly, figure out what to do about it. I also sent a shout out for advice on twitter but, when your only followers are dip shits hoping to sell you something... well, the results are less than ideal. (What can I say? It was worth a shot.)
Anyway, most people want to GET crawled by the search engine spiders, not turn them away, so my search for a fix was a bit diluted with information about luring the bingbot to visit which, clearly, was not something I was having trouble with. By tweaking my search terms slightly I managed to find information that ultimately led me to my chosen solution. Be forewarned that with the switch from MSN to Bing there is now a lot (and I do mean a lot) of outdated documentation.
My Bad, Sort Of.
Turns out I am partially to blame for this debacle. Four of the sites on this account were designed to be both user and search engine friendly. Apparently too much so. They are made up of dynamically generated pages strategically linked to more dynamically generated pages. Perfect for keeping a human visitor on-site for ages but a virtual trap for a bot that doesn't know when to call it quits.
When building the sites, I purposely added rel=nofollow to links that would cause search engine bots to hang out too long. I guess I just wrongly assumed that all the major players would obey my wishes. Oddly enough they did - but only for a while.
The sites affected by this have been up for ages and have even been getting regular, albeit low, amounts of traffic from Bing as well as from Google and Yahoo. Why only now did Bing decide to kick things up a notch? It didn't make much sense and all I knew was that it was a pain in my ass that needed to go away and was about to start costing me money.
My Options and Solution
I could have banned bingbot completely by disallowing the user-agent in my robots.txt file but, as I said, Bing sends me traffic, and that traffic converts, so I didn't want put an end to it. Also, with these sites I have taken a "set it and forget it" approach. Banning the bingbot now would mean periodic reassessment and testing of that decision which I had no desire to futz with.
I could have set a crawl delay in the robots.txt file reasoning that if this over-consumption was Bing's idea of "normal" then delaying it would slow it down to my idea of normal. I didn't like this idea because it seemed too subjective and sensitive to changes in Bing's logic. The bingbot crawled these sites for months without issue and if I were to have slowed it down to a trickle what would happen when they went back to the way things were? My fear was that I would end up getting these sites dropped out of Bing's SERPs. Not a wise option.
Finally, I could (and did) edit my robots.txt file to disallow the crawling of any page having certain parameters in the URL. (The ones born from that endless dynamic linking I spoke of earlier.)
Here's what the robots.txt file looks like.
User-agent: *
Disallow: /*?*count=
Disallow: /*?*last=
Disallow: /*?*page=
Disallow: /*?*sortorder=
This is what SHOULD have been done from the start - it just never occurred to me that my friends could become my enemies if I didn't take this simple measure.
Outcome
My challenge was to get the bingbot to back off without banning it completely. So far, it looks like I have managed to accomplish just that with the new robots.txt edits. I have been monitoring my logs closely and am happy to report that bingbot is officially on a diet - at least as far as my sites are concerned.
If you have sites that make use of URL parameters that are great for users but pointless for bots, such as page=, count=, etc., do consider disallowing them in your robots.txt file right away! I realize this will not deter rogue bots that don't even check the robots.txt file and it won't make sense to bots that do not support wildcards(*) but it will certainly level the playing field among those that do.
Hindsight
I have heard that when you know better, you do better. Well, now that I know better, you can be sure I will be paying much closer attention to what I need to do to get the search engines to play nice on my sites and not eat up more bandwidth than absolutely necessary.
FYI, in the midst of all this, I did end up getting a bill for the bandwidth overage. Not too bad - just a little over $25 for going over by 12.8 GB. I've paid a lot larger sums for much smaller lessons so I am thankful for getting such a huge wake-up call at such a tiny price.
Your Thoughts?
What have you done to keep the bots in line? What oversights have you had that ended up costing you money? Feel free to let me know in the comments.
Development
WTF?Tagged as...
bing
bingbot
microsoft
robots.txt







Psssst!
{ 15 brilliant remarks below…
Got something to say? Let's hear it! }
- danbriapps
I’ve been slammed before but never that badly. Too bad you can’t send a bill to Mr. Gates for your trouble.
LOL! Don’t think it didn’t cross my mind!
I had this happen. It almost feels like bleeding profusely from a major artery with no tourniquet in sight. Good to know you were able to handle it fast.
Quite the visual there with that analogy of yours Paul – and not too far off from how I felt! Thanks for stopping by.
Hey Brandi,
Don’t worry about a lack of followers on twitter you already admitted your new to the whole online social scene. You seem pretty smart so I’m sure your network will get going soon. Be patient. I’d follow you but I don’t tweet. Good luck.
Not worried at all Lisa. It would be a sad day indeed if I tied my self-esteem to my number of twitter followers. Whatever happens, happens. Thanks for your concern.
No problem
What are the asterisks and question marks for in the robots.txt code?
The asterisks are wildcards to tell search engines they should disallow any url matching my string no matter what character(s) may be in place of the asterisks.
The question marks reference the question mark found in a url that contains parameters (e.g., /?page=2&count=50).
I have had the same issue on my site with Bing/MSN bots over 30GB bandwidth gobbled up in 17 days!! I have used the delay setting in robots.txt and seems to have worked, and I don’t have the dynamic page links that you have either!!
Quite frustrating, isn’t it? Try scouring your logs for anything you can afford to disallow in the robots.txt (such as /images, large script files or less important pages like “about” or “privacy”). The hard part seems to be finding balance. After all, we don’t want to ban them completely because they send us business but we don’t want to pay (via bandwidth consumption) to be in the SERPs either. Good luck.
Just found your post after searching for “bingbot is a pain in the ass”. Had a similar occurrence which I just posted about. http://www.aestheticdesign.com/blog/bingbot/
Fortunately it didn’t shut me down, I just got a little ding in the wallet. I too wish I could send the bill to Bill. I know he could afford it better than I.
I love this: “Just found your post after searching for ‘bingbot is a pain in the ass.’”
LOL! And here I thought I was the only one that used search engines in my passive-aggressive fits. Thanks for visiting – and for making me feel more normal. :)
How long did it take bingbot to start paying attention to the your directives? I can see that the bot is fetching the new robots.txt file, but it’s disregarding the disallows. I even tried entering the parameters in Bing’s web tools to no avail. The other bots have stopped indexing the dynamic URLs but Bing continues to hammer the site. Over 500 hits this morning indexing the same 15 pages over and over.
It didn’t take long (< 24 hrs or so).
There are a couple errors in your robots.txt file but since I do not know the structure of your site and what exactly you are wanting to exclude I cannot say what specifically would be upsetting the bingbot.
Try putting all directives together grouped by the bot name leaving the woldcard (*) group for last. Also, the current order of your cgi-bin rules will prevent the second from being followed. Try the following changes and see if it helps.
User-agent: BecomeBot
Disallow: /cgi-bin/
User-agent: *
Allow: /cgi-bin/store/
Disallow: /cgi-bin/
Disallow: /images
Disallow: /GeneratedItems
Disallow: /store_files/
Think of it as a top-down list of rules. As soon as a bot sees a rule that applies to it for the page it is crawling, it should (if it plays nice) stop there.
For a bit more detail see this wikipedia article on the matter... http://en.wikipedia.org/wiki/Robots_exclusion_standard
Good luck. Hope this gets worked out soon for you.
Psssst!
{ 1 trackback }
- danbriapps