Where Are the SEOs Yachts? / QBot 2.0 Going Live Tonight!
About a year ago, we unveiled QualityBot ("Qbot" for short”). Qbot at that time was our most successful algorithm for filtering out webspam.
I am pleased to announce that tonight, QBot 2.0 goes into production. With it, Link Insight users will have access to a more comprehensive version of Spam Analyzer, which looks something like this:
The technology behind QBot is complicated, but the motivation behind it is simple: we wanted to increase the number of high quality links our subscribers can find while simultaneously eliminating more spam.
To achieve this, we needed a benchmark. Ours are:
- Precision – how accurate is the model? Eg: does it minimize false positives and false negatives?
- Recall – how much of the target set was passed through the filter? If we started with 1,000 non-spam links and 980 of them made it through, then the recall is 98%.
For both of these metrics, higher numbers are better. A perfect score (which is not attainable) is 100%.
QBot 1.0 scored 83.1% on Precision and 64.5% on Recall. Not bad. To put it in perspective, QBot was performing somewhat better than a trained expert. Given the fact that it could process millions (billions if you count preprocessing) of pages a day, this advantage was significant indeed. After all, who has the time or resources to pursue a list of 300,000 backlinks (and that’s a “small” list according to some backlink vendors).
Those numbers also indicate that the original QBot runs on the aggressive side. At its default setting, we found that it would reduce a pre-filtered link set by 43%. So for every nine spam links identified, five legitimate pages would be incorrectly flagged. Friendly fire. We don’t like that.
Meet the General: GoorooAI
Our latest algorithm is based upon the GoorooAI machine learning platform. GoorooAI is a set of algorithms we use internally for tasks as varied as advertiser vertical assignment, language detection, and so on.
GoorooAI is a metamodeler. It runs millions of model permutations in conjunction with each other to determine effective techniques for identifying spam. The original QBot factors into some of these models. If Qbot is the soldier, then GoorooAI is the general.
How does it compare?
Very favorably. QBot scores 97.6% Precision and 96.7% Recall. For every 28 spam links it removes, there's only one legitimate page misclassified. Overall, it allows 32% more pages through and makes very few errors.
Thinking Like a Search Engine
Webmasters are frustrated by the engines because they lack transparency. Matt Cutts has alluded several times that they need to sensitive to the fact that webspam detection algorithms can be gamed. Fair enough.
However, I would add to this another intriguing possibility: modern webspam detection techniques are so sophisticated that they defy most attempts to easily verbalize the rules of engagement.
Peter Norvig once stated that "PageRank is overhyped". We know that PR is a component of both the webspam and ranking algorithms at Google, but at the end of the day it is a relatively small one. There are many flavors of PageRank and back a few years ago, Google was using at least 8 of them (I personally believe that number has increased today).
We all crave simple, direct answers to our questions. But what GoorooAI has taught us is that there are usually no simple ones.
Allow me to illustrate this with a few examples.
Does TLD Matter?
Let’s start with a simple one: does the TLD (top level domain) give us any indication of the likelihood of spam? You betcha!
You could devise an entire SEO strategy around this one simple chart:
- Procure links on .gov, .us, and .edu TLDs
- Avoid links on .biz, .info. and .jp TLDs
If you know only one rule about link building, this one should be it. However, this strategy has limited application (it’s not that easy to get links on government sites) and it also suffers from the fact that you’re going to miss out on a lot of potential backlinks. Not all .jp sites are spam, and not all .edu pages are trustworthy.
For this reason, the TLD of a potential backlink is of only secondary importance to most link builders. But not for GoorooAI, because it can dynamically adjust to this incredibly valuable datapoint. For instance, it often decides to be tougher on .ru links and to be more lenient on .org links. And it does so with a mathematical precision that even the most skilled linked builders cannot approach.
For most of us (especially as we get older… speaking from experience here), we tend to learn once and then get set in our ways. From then on, many of us resist learning from new information.
Machine learning doesn’t suffer from this problem. Here’s a neat example of this. Let’s look at the correlation between spam and title length as it existed in 2005:
In this chart, the columns show the distribution of pages while the pink line shows the probability of a particular page being spam (a “true positive”). Back then, once the title exceeded 30 words, the likelihood of spam increased greatly.
Fast forward to today:
This chart illustrates that the correlation between spam and title length has fallen apart. Today, there’s little you can conclude about a page based solely on the number of words in the Title tag. GoorooAI knows this and puts the proper weight on this variable (essentially none).
Another example of a well-studied factor which influences page reputation is Outdegree – the number of unique hosts linked to by a particular page:
Once a page links to more than 20 domains, the likelihood of its being spam increases greatly. GoorooAI responds to this by subjecting the page to an increasingly more strict webspam detection model.
Amazingly, some marketers still ignore the impact of outdegree. Most directories, link exchange pages, and even legitimate corporate sites which use too many third-party tracking scripts tend to be casualties of this variable.
Where Are the SEOs Yachts?
I know SEOs who still live by the rules they learned a few years ago and haven’t adapted to the ever changing Internet landscape. I see sites which try to figure out how the algorithms work by having SEOs vote on what ranking factors they believe are most important. And of course, there is no shortage of sites which generate metric after endless metric backed up by years of SEO folklore.
Computers are running the show now, folks. You're not going to keep up with "X-Rank", counting keyword occurrences, or whatever the metric-du-jour is. Don’t bring a dinosaur bone to a gun fight.
Stay flexible, keep your link profile clean, and focus on where the customers are and you won’t go wrong.
Did You Know?
Founded in 2004, AdGooroo is the original Search Engine Intelligence company. Our services help over 2,000 global advertisers excel in PPC, SEO, and Display Advertising.
“Based partially on your data, we have moved the site from a no-show, to 5th place, to #1 in a relatively crowded space in a couple of months.”
inSite Internet Solution
“Today, our use of AdGooroo tools sets us apart from most agencies”
Director of Strategic Partnerships