Raymond Castleberry Blog: RankBrain Unleashed

Posted by gfiorelli1

Disclaimer: Much of what you're about to read is based on personal opinion. A thorough reflection about RankBrain, to be sure, but still personal — it doesn't claim to be correct, and certainly not "definitive," but has the aim to make you ponder the evolution of Google.

Introduction

Whenever Google announces something as important as a new algorithm, I always try to hold off on writing about it immediately, to let the dust settle, digest the news and the posts that talk about it, investigate, and then, finally, draw conclusions.

I did so in the case of Hummingbird. I do it now for RankBrain.

In the case of RankBrain, this is even more correct, because — let’s be honest — we know next to nothing about how RankBrain works. The only things that Google has said publicly are in the video Bloomberg published and the few things unnamed Googlers told Danny Sullivan for his article, FAQ: All About The New Google RankBrain Algorithm.

Dissecting the sources

As I said before, the only direct source we have is the video interview published on Bloomberg.

So, let's dissect that video and what Greg Corrado — senior research scientist at Google and one of the founding members and co-technical lead of Google's large-scale deep neural networks project — said.

RankBrain is already worldwide.

I wanted to say this first: If you're wondering whether or not RankBrain is already affecting the SERPs in your country, now you know — it is.

RankBrain is Artificial Intelligence.

Does this mean that RankBrain is our first evidence of Google as the Star Trek computer? No, it does not.

It's true that many Googlers — like Peter Norvig, Corinna Cortes, Mehryar Mohri , Yoram Singer, Thomas Dean, Jeff Dean and many others — have been investigating and working on machine/deep learning and AI for a number of years (since 2001, as you can see when scrolling down this page). It's equally true that much of the Google work on language, speech, translation, and visual processing relies on machine learning and AI. However, we should consider the topic of ANI (Artificial Narrow Intelligence), which Tim Urban of Wait But Why describes as: "Machine intelligence that equals or exceeds human intelligence or efficiency at a specific thing."

Considering how Google is still buggy, we could have some fun and call it HANI (Hopefully Artificial Narrow Intelligence).

All jokes aside, Google clearly intends for its search engine to be an ANI in the (near) future.

RankBrain is a learning system.

With the term "learning system," Greg Corrado surely means "machine learning system."

Machine learning is not new to Google. We SEOs discovered how Google uses machine learning when Panda rolled out in 2011.

Panda, in fact, is a machine learning-based algorithm able to learn through iterations what a "quality website" is — or isn't.

In order to train itself, it needs a dataset and yes/no factors. The result is an algorithm that is eventually able to achieve its objective.

Iterations, then, are meant to provide the machine with a constant learning process, in order to refine and optimize the algorithm.

Hundreds of people are working on it, and on building computers that can think by themselves.

Uhhhh... (Sorry, I couldn't resist.)

RankBrain is a machine learning system, but — from what Greg Corrado said in the video — we can infer that in the future, it will probably be a deep learning one.

We do not know when this transition will happen (if ever), but assuming it does, then RankBrain won't need any input — it will only need a dataset, over which it will apply its learning process in order to generate and then refine its algorithm.

Rand Fishkin visualized in a very simple but correct way what a deep learning process is:

Remember — and I repeat this so there's no misunderstanding — RankBrain is not (yet) a deep learning system, because it still needs inputs in order to work. So... how does it work?

It interprets languages and interprets queries.

Paraphrasing the Bloomberg interview, Greg Corrado gave this information about how RankBrain works:

It works when people make ambiguous searches or use colloquial terms, trying to solve a classic breakdown computers have because they don’t understand those queries or never saw them before.

We can consider RankBrain to be the first 100% post-Hummingbird algorithm developed by Google.

Even if we had some new algorithms rolling out after the Hummingbird release (e.g. Quality Update), those were based on pre-Hummingbird algos and/or were serving a very different phase of search (the Filter/Clustering and Ranking ones, specifically).

Credit: Enrico Altavilla

RankBrain seems to be a needed "patch" to the general Hummingbird update. In fact, we should remember that Hummingbird itself was meant to help Google understand “verbose queries.”

However, as Danny Sullivan wrote in the above mentioned FAQ article at Search Engine Land, RankBrain is not a sort of Hummingbird v.2, but rather a new algorithm that "optimizes" the Hummingbird work.

If you look at the image above while reading Greg Corrado's words, we can say with a high degree of correctness that RankBrain acts in between the "Understanding" and the "Retrieving" phases of the overall search process.

Evidently, the too-ambiguous queries and the ones based on colloquialisms were too hard for Hummingbird to understand — so much so, in fact, that Google needed to create RankBrain.

RankBrain, like Hummingbird, generalizes and rewrites those kinds of queries, trying to match the intent behind them.

In order to understand a never-before-seen or unclear query, RankBrain uses vectors, which are — to quote the Bloomberg article — "vast amounts of written language embedded into mathematical entities," and it tries to see if those vectors may have a meaning in relation to the query it's trying to answer.

Vectors, though, don't seem to be a completely new feature in the general Hummingbird algorithm. We have evidence of a very similar thing in 2013 via Matt Cutts himself, as you can see from the Twitter conversation below:

At that time, Google was still a ways from being perfect.

Upon discovering web documents that may answer the query, RankBrain retrieves them and lets them proceed, following the steps of the search phase until those documents are presented in a visible SERP.

It is within this context that we must accept the definition of RankBrain as a "ranking factor," because in regards to the specific set of queries treated by RankBrain, this is substantially the truth.

In other words, the more RankBrain considers a web document to be a potentially correct answer to an unknown or not understandable query, the higher that document will rank in the corresponding SERP — while still taking into account the other applicable ranking factors.

Of course, it will be the choice of the searcher that ultimately informs Google as to what the answer to that unclear or unknown query is.

As a final note, necessary in order to head off the claims I saw when Hummingbird rolled out: No, your site did not lose visibility because of a mysterious RankBrain penalty.

Dismantling the RankBrain gears

Kristine Schachinger, a wonderful SEO geek whom I hold in deep esteem, relates RankBrain to Knowledge Graph and Entity Search in this article on Search Engine Land. However — while I'm in agreement that RankBrain is a patch of Hummingbird and that Hummingbird is not yet the "semantic search" Google announced — our opinions do differ on a few points.

I do not consider Hummingbird and Knowledge Graph to be the same thing. They surely share the same mission (moving from strings to things), and Hummingbird uses some of the technology behind Knowledge Graph, but still — they are two separate things.

This is, IMHO, a common misunderstanding SEOs have. So much so, in fact, that I even tend to not consider the Featured Snippets (aka the answers boxes) part of Knowledge Graph itself, as is commonly believed.

Therefore, if Hummingbird is not the same as Knowledge Graph, then we should think of entities not only as named entities (people, concepts like "love," planets, landmarks, brands), but also as search entities, which are quite different altogether.

Search entities, as described by Bill Slawski, are as follows:

A query a searcher submits
Documents responsive to the query
The search session during which the searcher submits the query
The time at which the query is submitted
Advertisements presented in response to the query
Anchor text in a link in a document
The domain associated with a document

The relationships between these search entities can create a "probability score," which may determine if a web document is shown in a determined SERP or not.

We cannot exclude the fact that RankBrain utilizes search entities in order to find the most probable and correct answers to a never-before-seen query, then uses the probability score as a qualitative metric in order to offer reasonable, substantive SERPs to the querying user.

The biggest advancement with RankBrain, though, is in how it deals with the quantity of content it analyzes in order to create the vectors. It seems bigger than the classic "link anchor text and surrounding text" that we always considered when discussing, for instance, how the Link Graph works.

There is a patent filed by Google that cites one of the AI experts cited by Greg Corrado — Thomas Strohmann — as an author.

In that patent, very well explained (again) by Bill Slawski in this post on Gofishdigital.com, is described a process through which Google can discover potential meanings for non-understandable queries.

In the patent, huge importance is attributed to context and "concepts," and the fact that RankBrain uses vectors (again, "vast amounts of written language embedded into mathematical entities"). This is likely because those vectors are needed to secure a higher probability of understanding context and detecting already-known concepts, thus resulting in a higher probability of positively matching those unknown concepts it's trying to understand in the query.

Speculating about RankBrain

As the section title says, now I enter in the most speculative part of this post.

What I wrote before, though it may also be considered speculation, has the distinct possibility of being true. What I am going to write now may or may not be true, so please, take it with a grain of salt.

DeepMind and Google Search

In 2014, Google acquired a company specialized in learning systems called DeepMind. I cannot help but consider that some of its technology and the evolutions of its technologies are used by Google for improving its search algorithm — hence the machine learning process of RankBrain.

In this article published last June on technologyreview.com, it's explained in detail how not having a correctly-formatted database is the biggest obstacle for a correct machine and deep learning process. Without it, the neural computing (which is behind machine and deep learning) cannot work.

In the case of language, then, having "vast amounts of written language" is not enough if there's no context, especially if not using n-grams within the search so the machine can understand it.

However, Karl Moritz Hermann and some of his DeepMind colleagues described in this paper how they were able to discover the kind of annotations they were looking for in classic "news highlights," which are independent from the main news body.

Allow me to quote the Technology Review article in explaining their experiment:

Hermann and co anonymize the dataset by replacing the actors in sentences with a generic description. An example of some original text from the Daily Mail is this: "The BBC producer allegedly struck by Jeremy Clarkson will not press charges against the "Top Gear" host, his lawyer said Friday. Clarkson, who hosted one of the most-watched television shows in the world, was dropped by the BBC Wednesday after an internal investigation by the British broadcaster found he had subjected producer Oisin Tymon “to an unprovoked physical and verbal attack.”

An anonymized version of this text would be the following:

The ent381 producer allegedly struck by ent212 will not press charges against the “ent153” host, his lawyer said friday. ent212, who hosted one of the most - watched television shows in the world, was dropped by the ent381 wednesday after an internal investigation by the ent180 broadcaster found he had subjected producer ent193 "to an unprovoked physical and verbal attack."

In this way it is possible to convert the following Cloze-type query to identify X from “Producer X will not press charges against Jeremy Clarkson, his lawyer says” to “Producer X will not press charges against ent212, his lawyer says.”

And the required answer changes from “Oisin Tymon” to “ent212."

In that way, the anonymized actor is only possible to identify with some kind of understanding of the grammatical links and causal relationships between the entities in the story.

Using the Daily Mail, Hermann was able to provide a large, useful dataset to the DeepMind deep learning machine, and thus train it. After the training, the computer was able to correctly answer up to 60% of the questions asked.

Not a great percentage, we might be thinking. Besides, not all documents on the web are presented with the kind of highlights the Daily Mail or CNN sites have.

However, let me speculate: What are the search index and the Knowledge Graph if not a giant, annotated database? Would it be possible for Google to train its neural machine learning computing systems using the same technology DeepMind used with the Daily Mail-based database?

And what if Google were experimenting and using the Quantum Computer it shares with NASA and USRA for these kinds of machine learning tasks?

Or... What if Google were using all the computers in all of its data centers as one unique neural computing system?

I know, science fiction, but...

Ray Kurzweil's vision

Ray Kurzweil is usually known for the "futurist" facets of his credentials. It's easy for us to forget that he's been working at Google since 2012, personally hired by Larry Page "to bring natural language understanding to Google." Natural language understanding is essential both for RankBrain and for Hummingbird to work properly.

In an interview with The Guardian last year, Ray Kurzweil said:

When you write an article you're not creating an interesting collection of words. You have something to say and Google is devoted to intelligently organising and processing the world's information. The message in your article is information, and the computers are not picking up on that. So we would like to actually have the computers read. We want them to read everything on the web and every page of every book, then be able to engage an intelligent dialogue with the user to be able to answer their questions.

The DeepMind technology I cited above seems to be going in that direction, even though it's still a non-mature technology.

The biggest problem, though, is not really being able to read billion of documents, because Google is already doing it (go read the EULA of Gmail, for instance). The biggest problem is understanding the implicit meaning within the words, so that Google may properly answer users' questions, or even anticipate the answers before the questions are asked.

We know that Google is hard at work to achieve this, because the same Kurzweil told us that in the same interview:

"We are going to actually encode that, really try to teach it to understand the meaning of what these documents are saying."

The vectors used by RankBrain may be our first glimpse of the technology Google will end up using for understanding all context, which is fundamental for giving a meaning to language.

How can we optimize for RankBrain?

I'm sure you're asking this question.

My answer? This is a useless question, because RankBrain targets non-understandable queries and those using colloquialisms. Therefore, just as it's not very useful to create specific pages for every single long-tail keyword, it's even less useful to try targeting the queries RankBrain targets.

What we should do is insist on optimizing our content using semantic SEO practices, in order to help Google understand the context of our content and the meaning behind the concepts and entities we are writing about.

What we should do is consider the factors of personalized search as priorities, because search entities are strictly related to personalization. Branding, under this perspective, surely is a strategy that may have positive correlation to RankBrain and Hummingbird as they interpret and classify web documents and their content.

RankBrain, then, may not mean that much for our daily SEO activities, but it is offering us a glimpse of the future to come.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

Raymond Castleberry Blog

Tuesday, November 24, 2015

RankBrain Unleashed