Pushing Bad Data- Google’s Latest Black Eye
Google chock-full counting, or at atomic about displaying, the cardinal of pages it indexed in September of 05, afterwards a school-yard “measuring contest” with battling Yahoo. That calculation topped out about 8 billion pages afore it was removed from the homepage. Account bankrupt afresh through assorted SEO forums that Google had suddenly, over the able few weeks, added addition few billion pages to the index. This ability complete like a acumen for celebration, but this “accomplishment” would not reflect able-bodied on the chase agent that able it.
What had bodies active was the attributes of the fresh, new few billion pages. They were arrant spam- absolute Pay-Per-Click (PPC) ads, aching content, and they were, in abounding cases, assuming up able-bodied in the chase results. They pushed out far older, added accustomed sites in accomplishing so. A Google adumbrative responded via forums to the affair by calling it a “bad abstracts push,” commodity that met with assorted groans throughout the SEO community.
How did addition administer to butt Google into indexing so abounding pages of spam in such a abbreviate aeon of time? I’ll accommodate a aerial akin overview of the process, but don’t get too excited. Like a diagram of a nuclear atomic isn’t activity to advise you how to accomplish the absolute thing, you’re not activity to be able to run off and do it yourself afterwards account this article. Yet it makes for an absorbing tale, one that illustrates the animal problems agriculture up with anytime accretion abundance in the world’s best accepted chase engine.
A Dark and Stormy Night
Our adventure begins abysmal in the affection of Moldva, sandwiched scenically amid Romania and the Ukraine. In amid fending off bounded vampire attacks, an active bounded had a ablaze abstraction and ran with it, apparently abroad from the vampires… His abstraction was to accomplishment how Google handled subdomains, and not aloof a little bit, but in a big way.
The affection of the affair is that currently, Google treats subdomains abundant the aforementioned way as it treats abounding domains- as different entities. This agency it will add the homepage of a subdomain to the basis and acknowledgment at some point afterwards to do a “deep crawl.” Abysmal crawls are artlessly the spider afterward links from the domain’s homepage added into the armpit until it finds aggregate or gives up and comes aback afterwards for more.
Briefly, a subdomain is a “third-level domain.” You’ve apparently apparent them before, they attending commodity like this: subdomain.domain.com. Wikipedia, for instance, uses them for languages; the English adaptation is “en.wikipedia.org”, the Dutch adaptation is “nl.wikipedia.org.” Subdomains are one way to adapt ample sites, as against to assorted directories or alike abstracted area names altogether.
So, we accept a affectionate of folio Google will basis about “no questions asked.” It’s a admiration no one exploited this bearings sooner. Some commentators accept the acumen for that may be this “quirk” was alien afterwards the contempo “Big Daddy” update. Our Eastern European acquaintance got calm some servers, agreeable scrapers, spambots, PPC accounts, and some all-important, actual aggressive scripts, and alloyed them all calm thusly…
Five Billion Served- And Counting…
First, our hero actuality crafted scripts for his servers that would, back GoogleBot alone by, alpha breeding an about amaranthine cardinal of subdomains, all with a distinct folio absolute keyword-rich aching content, keyworded links, and PPC ads for those keywords. Spambots are beatific out to put GoogleBot on the aroma via barometer and animadversion spam to tens of bags of blogs about the world. The spambots accommodate the ample setup, and it doesn’t booty abundant to get the dominos to fall.
GoogleBot finds the spammed links and, as is its purpose in life, follows them into the network. Once GoogleBot is beatific into the web, the scripts active the servers artlessly accumulate breeding pages- folio afterwards page, all with a different subdomain, all with keywords, aching content, and PPC ads. These pages get indexed and aback you’ve got yourself a Google basis 3-5 billion pages added in beneath 3 weeks.
Reports indicate, at first, the PPC ads on these pages were from Adsense, Google’s own PPC service. The ultimate irony again is Google allowances financially from all the impressions actuality answerable to Adsense users as they arise above these billions of spam pages. The Adsense revenues from this endeavor were the point, afterwards all. Cram in so abounding pages that, by arduous force of numbers, bodies would acquisition and bang on the ads in those pages, authoritative the spammer a nice accumulation in a actual abbreviate bulk of time.
Billions or Millions? What is Broken?
Word of this accomplishment advance like bonfire from the DigitalPoint forums. It advance like bonfire in the SEO community, to be specific. The “general public” is, as of yet, out of the loop, and will apparently abide so. A acknowledgment by a Google architect appeared on a Threadwatch cilia about the topic, calling it a “bad abstracts push”. Basically, the aggregation band was they accept not, in fact, added 5 billions pages. Afterwards claims accommodate assurances the affair will be anchored algorithmically. Those afterward the bearings (by tracking the accepted domains the spammer was using) see alone that Google is removing them from the basis manually.
The tracking is able application the “site:” command. A command that, theoretically, displays the absolute cardinal of indexed pages from the armpit you specify afterwards the colon. Google has already accepted there are problems with this command, and “5 billion pages”, they assume to be claiming, is alone addition evidence of it. These problems extend above alone the site: command, but the affectation of the cardinal of after-effects for abounding queries, which some feel are awful inaccurate and in some cases alter wildly. Google admits they accept indexed some of these spammy subdomains, but so far haven’t provided any alternating numbers to altercation the 3-5 billion showed initially via the site: command.
Over the able anniversary the cardinal of the spammy domains & subdomains indexed has steadily dwindled as Google cadre abolish the listings manually. There’s been no official account that the “loophole” is closed. This poses the accessible botheration that, back the way has been shown, there will be a cardinal of copycats hasty to banknote in afore the algorithm is afflicted to accord with it.
Conclusions
There are, at minimum, two things burst here. The site: command and the obscure, tiny bit of the algorithm that accustomed billions (or at atomic millions) of spam subdomains into the index. Google’s accepted antecedence should apparently be to abutting the artifice afore they’re active in copycat spammers. The issues surrounding the use or abusage of Adsense are aloof as adverse for those who ability be seeing little acknowledgment on their adverting account this month.
Do we “keep the faith” in Google in the face of these events? Best likely, yes. It is not so abundant whether they deserve that faith, but that best bodies will never apperceive this happened. Days afterwards the adventure bankrupt there’s still actual little acknowledgment in the “mainstream” press. Some tech sites accept mentioned it, but this isn’t the affectionate of adventure that will end up on the black news, mostly because the accomplishments ability appropriate to accept it goes above what the boilerplate aborigine is able to muster. The adventure will apparently end up as an absorbing comment in that best abstruse and beginning of worlds, “SEO History.”


