STC

lunes, 21 de junio de 2010

Stone Temple Consulting (STC) Articles and Interviews on SEO Topics

Shashi Seth on the Future of Yahoo! Search

Posted: 20 Jun 2010 02:36 AM PDT

Published: June 20, 2010

Shashi Seth is the Senior VP of Yahoo! Search Products. In this role he oversees the future strategy of search at Yahoo! Prior to Yahoo!, he worked as the Sr. VP, Global Ad Products at AOL Time Warner. Prior to AOL he was the Chief Revenue Officer at Cooliris, and before that headed up the efforts to monetize YouTube.

Interview Transcript

Eric Enge: Can you tell us a bit about where things are with the Yahoo! - Bing transaction?

Shashi Seth: Since the Microsoft transition approval was announced this February, we have been busy working on the integration. And, that project is well underway with lots of internal testing going on as we speak, and we have been testing their index and ranking in our environment for a couple of weeks now. Things are looking good, and our goal is to be able to complete the migration sometime before the holidays.

Eric Enge: From what I've seen it looks like a phased rollout.

Shashi Seth: It is being done in two phases. Our first priority is to make sure that the algorithmic side transitions over with quality. We have a set of metrics in place with Microsoft to assess what it takes to migrate over a certain country and what the quality measures are.

The testing that has begun looks at those metrics carefully to tweak the different elements and parameters. Once we are comfortable on both sides, we will pull the trigger and do the transition. Soon after that we would do the same for the sponsored search side. These efforts are happening in parallel with two separate dedicated teams for each effort. They are working hard on the task at hand, but the goal was not to push them exactly together so that everything happens in one day because that would become a very big task. Our aim is to accomplish all this before the holiday season.

Eric Enge: After the transition, if someone does an arbitrary query in Bing and the same query in Yahoo, would the search results, in principle, be identical, or are you doing some fine tuning to make them different?

Shashi Seth: The index and ranking coverage are going to be identical. What we left open in the deal terms was what Yahoo can and cannot do with both the algorithmic and the sponsored side. Yahoo has the flexibility not only to utilize data sources from places like Twitter or our own properties, but we can also build what we call shortcuts, and other elements, and trigger them for appropriate search terms.

We have a lot of room to develop unique user experiences on Yahoo. For some sets of queries, especially the long tail of queries, there will be identical results for Bing and Yahoo, nevertheless with completely different treatments. For information with vertical intent like local or shopping, we are going to not only bring our proprietary Yahoo content to show on the search results page, but also pursuing a different and separate strategy that aims to help people get an answer quickly without wading through the blue links. That requires a lot of data in our "look aside" indexes; which we call the "web of things". That means we source data, we extract and enrich it, and we use it to give deep insights to users for whatever they are looking for.

50% of all queries are going to receive experiences like that, but the other 50% will remain long tail and will not even trigger an ad or a shortcut. The benefit of this relationship with Microsoft is that Yahoo gets to focus on the front-end of search, look at other data sources and explore different paths, while Microsoft does the heavy lifting on the backend side.

Eric Enge: It gives Yahoo time to develop a position on what will be the higher value ad aspects of a search experience in the future.

Shashi Seth: Exactly. Essentially the backend is rapidly becoming a commodity, and we don't need to be in that business. The analogy is if we were in the car business, we've outsourced the engine and are going to focus on the entire user experience for the car. Part of that strategy is determining how to win in this space, and part of the strategy is taking advantage of the fact that we have 600 million users worldwide.

In many countries and regions, such as the US, we have 80% to 85% penetration of the internet population. We have 170 million users in the US and a subset of roughly 80 million of those users use Yahoo to search. That means there are 90 million users just in the US that we can go after who are spending a considerable amount of time on Yahoo properties.

The goal is to get in front of them with compelling experiences that help their browsing and content discovery, turn them into searches, bring them over to the search results page, give them a great experience and over time turn them into active and engaged Yahoo searchers.

Eric Enge: You could say that this deal had nothing to do with Yahoo exiting the search business, but actually repositioning itself in a stronger leadership role.

Shashi Seth: That's exactly what we are saying. We believe the user base is changing significantly. Where people are spending their time, how they are discovering content and the content they are engaging with is changing so rapidly that search needs to evolve with it. Instead of waiting for users to come to a search results page to enter a query, we need to be proactive in fulfilling their content discovery needs.

The biggest problem is all searchers have limited time. The amount of content on the web is exploding to the point where nobody even knows what exists out there. Search is going to evolve into a discovery engine to get in front of users where they are spending time and solve their core needs day-in and day-out.

Eric Enge: So essentially, by leveraging what you learn about users across all your properties, you can do a better job of anticipating their next need.

Shashi Seth: Exactly. We have started doing a lot of that. Yahoo News has both the contextual slide shows that not only target the user and their interest, but also use search as the backend technology to generate those slide shows.

We also have contextual shortcuts which underline terms that are interesting for the user. They are shown appropriately on various content pages so that when people hover on it for three seconds they get a short-end search results page on top of the content page. If the user engages with it, that's good but if not it goes away.

We do something similar on many of our homepages with what we call trending now modules. It basically looks at the topic of a page and puts all the trending searches on top so that people can find all the trending topics in that space. These are several ways that we are starting to engage our users to bring them over to search and give them a great experience. If we can do that really well a couple of times, users will start thinking of us as the search destination to go to.

In the last two months, there was a slight uptick in our comScore numbers for the first time in 18 months. Then last month (April) the numbers shot up by 1% point, because the amount of activity we can generate from 600 million users is pretty large. That shows the power of our audience base and why tapping into that resource makes a lot of sense.

Eric Enge: 1% is a pretty significant move in this game.

Shashi Seth: An interesting aspect of market share reporting is that some people carve out context-driven search efforts as a different number than traditional searches. As long as the user engages with the experience intentionally and gets search results, we believe that how it is generated shouldn't matter.

Eric Enge: Going back to the notion of discovery and fulfilling that need. It's not as simple as going to a specific URL, typing in a query in a specific format and getting results. It seems like it doesn't have to be that standardized, that it could be much more distributed across various formats for interaction.

Shashi Seth: Exactly, and that's how we are going to be successful. Now, it's up to the industry to get together and decide what they count as searches across the board. Today what happens is comScore calls out these content-driven searches separately from people going to the search homepage and entering a query and hitting enter. While that is interesting, we don't think that is where the industry is headed. The industry is changing significantly and needs a different way to measure overall search that is equitable for all competitors.

Eric Enge: Can you speak to the presentation that you did for your Investor Day?

Shashi Seth: That presentation was largely focused on how we see the search industry changing, how we see our users changing, and how we think search results are going to change. The work is already underway. The underlying driver is that there is so much data out there that either doesn't exist on web pages altogether or it exists on web pages but is so deeply embedded that it is hard for anybody to extract and to assimilate it.

For example if someone was looking for which actors were in a certain movie, that is a fairly easy for search engines. It becomes harder to look for slightly more complex information such as the names of all the other actors that have worked with a certain actor. It becomes even slightly more difficult to find the name of the director that has directed an actor the most often and even more arduous to find which movies brought the most money and fame to a given person.

Eric Enge: Now, you are talking structured databases.

Shashi Seth: Exactly. This information is probably embedded in snippets over hundreds of thousands, or millions of documents. If somebody went to a search engine for that information, they could eventually find the answer, but people should not have to do that amount of work, given our short attention spans and limited time.

The onus is on search engines to extract this information and put it in their repository in a way that can be useful for users. This is going to be critically important, and is what we call the web of things. Another example is finding dishes or menu items in restaurants rather than looking for restaurants. Instead of looking for a Greek restaurant in Palo Alto, California, someone could look for any restaurant in Palo Alto, California that has mango cheesecake on their menu.

That starts changing how people search and the landscape of search completely. That's one big effort that we have underway and we launched this Menu Item Finder feature as one of the first efforts in that direction, but there is so much more that can be done.

We are also focusing on monetizing search differently in respect to rich ad inserts, search assist tabs, and so forth. These are really important features that advertisers are looking for, because at the end of the day when an advertiser is looking to do a buy, all they care about is an audience.

Those audiences can likely be found in many different places, and that is OK with advertisers as long as they can do the same demographic or psychographic targeting. We have been investing in this space for about a year and we have seen a lot of movement from advertisers.
For consumers, another new area of investment is centered on getting in front of users in a contextual manner, getting them at the right place at the right time and presenting them with experiences that fulfill their needs. The last six or seven slides walk through one use case that we are working on and planning to deploy, which is much more integrated than search experiences today.

Eric Enge: The Napa Valley Restaurants example?

Shashi Seth: Yes. People spend a great deal of time on email, their personal home page, and content properties like Finance and Sports. This behavior gives us a good sense of what interests them. By looking at their search history, their content engagement and similar information, it is possible to personalize and tailor experiences to them, which we do – while staying within our trusted privacy policy, of course.

For example, the today modules on the Yahoo! Home Page are targeted to our users; every user gets something different. Someone else's computer likely has different stories than mine would. We call that technology content optimization that personalizes the type of content displayed to a certain user.

The technology exists, so now we need to do a really good job of getting in front of those users and offering them something that interests them. If they engage with it and have a compelling experience like the Napa Valley example, they will want to come back and do it again.

Eric Enge: You could argue that you are providing the results for four different searches in the old-fashioned world. Reviews, menu directions, a way to share it, and putting that all into a single answer to the original question.

Shashi Seth: That brings up a good point of how to measure that. Should it be counted as one search or four searches? The world is changing and as we get better at providing answers to user's needs, the number of queries alone is never going to be a good enough measure of how good of a job someone is doing. One would actually argue the opposite. If a user's needs can be answered in one query, or without even them asking them a question, how is success measured?

Eric Enge: Maybe another way is to measure the average number of queries per session and if that goes down you are making progress.

Shashi Seth: Yes. User engagement or time spent on a search results page is another. A host of different measurements tell the story significantly better than just the number of queries being performed on a search engine.

Eric Enge: Is Yahoo continuing to invest in mobile search too?

Shashi Seth: Absolutely. We believe that the next frontier of search is definitely in the mobile space. We already do really well in that space with nearly a hundred partnerships with various carriers and OEMs around the world. In recent months we've picked two out of the three carriers in Canada as partners. There are countries like Indonesia where we have 80% penetration in the mobile search space. We've done an amazing job with it, and today our volume of mobile search is pretty high. In the next five years, mobile search has the opportunity to become larger than web search.

For us, mobile search is not about a search box where someone enters a query and gets an answer. We gave people a peek into the future by creating an app called Sketch-a-Search, which we launched on the iPhone.

Essentially the user never has to type a query and we don't even offer a keyboard in that scenario. It starts with a map and the user simply points out an area that interests them on the map. In San Francisco they might be looking for restaurants around near them. They simply draw a circle along the road and we find all the restaurants and provide ways to filter the results so they can find exactly what they are looking for. When they find it, we have the contact information, menu and if possible images and reviews, and they never have to type a query.

In the world of Smartphones, people are going to use apps as a proxy for search. They will look at the task at hand, and find an app that fits that need. It will become more vertical as people use it for shopping, local services, restaurants, points of interest, travel, music and so forth, which is quite different from web search. It is much more contextual and location driven.

Eric Enge: Despite the rage about how great mobile devices are, the reality is that the keyboard experience just isn't the same. People need to have options about discovering information. What do you leverage in the discovery process to personalize the user experience on web or mobile search?

Shashi Seth: We already look at their interests and their demographics for advertising, which is not much different from what is needed to do content targeting or personalization.

Over a period of time, we can determine what someone's interests are. If they have been looking for a car on Yahoo Search for a week and going to Yahoo Autos and interacting with different modules and content across the network related to automobiles, we have a fair sense that they are interested in researching or purchasing a car. While protecting these details and any personal information, we can target them both from an advertising perspective as well as a content perspective.

A lot of our content optimization and content targeting personalization technologies straddle those worlds really well, so we are able to get to the users and create experiences that make sense for them. It has to be done subtly so as not to be alarming to them. We have learned that art over the years, and have already deployed a good amount of this technology. Of course we are always improving and making it better.

When we run a test of personalized content and targeted content discovery versus non-targeted content discovery, the numbers speak for themselves. That is a large problem anyways that people like Amazon and others have to grapple with. That science is becoming better and better, and we are seeing the results get significantly better over the months. We have a lot of hope and a lot of excitement in that space.

Eric Enge: Thanks Shashi!

Shashi Seth: Thank you Eric!

Have comments or want to discuss? You can comment on the Shashi Seth interview here.

Other Recent Interviews

About the Author

Eric Enge is the President of Stone Temple Consulting. Eric is also a founder in Moving Traffic Incorporated, the publisher of Custom Search Guide, a directory of Google Custom Search Engines, and City Town Info, a site that provides information on 20,000 US Cities and Towns.

Stone Temple Consulting (STC) offers search engine optimization and search engine marketing services, and its web site can be found at: http://www.stonetemple.com.

For more information on Web Marketing Services, contact us at:

Stone Temple Consulting
(508) 485-7751 (phone)
(603) 676-0378 (fax)
info@stonetemple.com

miércoles, 16 de junio de 2010

Stone Temple Consulting (STC) Articles and Interviews on SEO Topics

Dixon Jones Interviewed by Eric Enge

Posted: 15 Jun 2010 09:36 AM PDT

Published: June 7, 2010

Dixon Jones is the Marketing Director of Majestic12 LTD, owners of a web based technology used by the world's leading SEOs to analyze how web pages on the Internet connect between domains. It is the largest database of it's kind that can be analyzed publicly in this way, with well over a trillion back links indexed.

Dixon Jones is also a founding director of an Internet Marketing company and has a decade of experience marketing online, primarily above the line, building Receptional up from a start-up in my front room to a reasonable size. From near the start I did this with David to build a team of 15 Internet Marketing Consultants last time I counted and little sign of a slow down. The office has now become so tight that the landlord has agreed to build a considerable extension that would more than double our floor space. I don't think he wants us to leave.

Other of Dixon Jones accolades (or chains, depending on your point of view) in the world of Internet Marketing include being a moderator on Webmasterworld which most webmasters have heard of. If you haven't, I guess you are not a webmaster. To be fair, nor am I these days. I am an Internet marketer - but I don't know how an Internet marketer can really understand the nuances of the Internet Marketing world without at least some understanding of web-servers and CMS systems.

Interview Transcript

Eric Enge: One of the landmark deals of the industry in 2009 and 2010 was the search agreement between Microsoft/Bing and Yahoo. I am sure one of the things that they consider to be a minor side effect was the announcement that this would result in Yahoo Site Explorer becoming obsolete. That leaves us with a situation where the SEO industry has lost its ability to analyze link structures and get access to link data, as links continue to play a huge role in rankings.

That means that other tools are required. I would say that Linkscape and Majestic-SEO are the two major contenders to benefit from everything has happened. Can you tell us anything regarding how you go about collecting your data?

Dixon Jones: Majestic-SEO was born out of an attempt to build a distributed search engine. By distributed I mean that instead of getting a massive data center the size of Google's to try and crawl everything on the web, what we did about four years ago to get people to contribute their unused CPU cycles to our crawling efforts. We have more than 1,000 now, that have downloaded a crawler on their PC. When they get spare bandwidth, the crawler is then crawling the web from their PCs and servers. We have been able to crawl incredibly quickly, but it took a couple of years to get the crawl right and optimized.

As we built it up, we started to crawl the web from hundreds of websites and machines every single day. We didn't try and collect all of the data about the Internet, because we realized early on just how much memory that would require. What we started doing was looking at the links, not the internal links, but simply those links between domains and sub-domains. We are looking at the link data that we think people would find difficult to analyze any other way. For internal links, you can use something like Xenu Link Sleuth link tool to analyze the link structure of a particular website.

A few years ago, the only data available that was giving us any kind of backlink information was Yahoo Site Explorer. We think that we over took Yahoo Site Explorer a couple of years ago in terms of volume of links indexed, and we are now at 1.8 trillion URLs as of May, 2010.

So there is an awful lot of data that we have collected over that time, and with the way that we are giving it back to people, it's really easy to go and analyze everything from the ground up. Even the links that have been deleted remain in our database, so deleted links begin to show up over time as well. Then of course we are recording the anchor text, whether it's Nofollow, an image link or a redirect. We pretty much have all the data that we need, and we provide a web-based interface into the data that SEOs can use.

Eric Enge: What is a typical capacity of one of these computers?

Dixon Jones: I can actually show you part of this, an example taken from our crawlers on this specific day. Someone in Malaysia is our top crawler today; he has crawled 35 million URLs. Magnus has done 18 million, and you can see all the different people that are coming in. We have 113 people crawling at the moment, and 214 million URLs crawled today.

The green line shows how we have been crawling over time, and you can see that in 2007 we figured out how to crawl better. We found a fairly steep increase in the amount of URLs that we were crawling in any given day. There is also a little competition going on between the people who are crawling. We have someone who has done close to 30 million URLs and 809 megabytes of data overall.

Different PCs are obviously going to do different levels of crawling. If you find me somewhere down in that list, I am right near the bottom with my little local broadband connection at home, but you know there are all sorts of people doing the crawling. There are some pretty hefty services out there doing some of the crawling, and of course we are doing some crawling directly as well, so it's not all distributed. Nevertheless, the back of this beast was broken through distributed crawling, and it continues to be a major part of Majestic's technology.

I think it's probably best to go through how I use Majestic with my SEO hat on, and how we do it that way. There are different ways of using it and there are different people doing different things. On the Majestic-SEO homepage, there is a Flash video that gives a two-minute introduction on how to use the system, which I suggest that people watch.

It will require that you register, but it's free. Some of the services do cost money to access, but it's worth registering just to get access to some of these free tools. One of the first things that I'll look at is the backlink history tool, which is just underneath the search box.

When an SEO receives a call from a client requesting links for their website, the SEO needs to very quickly ascertain whether the client's website has any chance of getting to the top of a search engine. Once they are registered, one of these free tools allows SEOs to start comparing websites.

You can see the number of backlinks that we found each month, and then the number of the referring domains that we found each month as well. The difference is quite extreme, as you can see petersons.com has lots and lots of backlinks coming in, but when we actually look at the referring domains, there is lot more parity between the three sites. Petersons.com is still collecting the links from more domains than anybody else on this bottom graph, but it is on a similar sort of scale. As Petersons starts getting links, that will dramatically increase the backlinks discovery rate. Site-wide, that link will give a huge number of links, but still from only one domain.

Eric Enge: Right, of course one of things that is interesting about that is, what really is the added value of getting a whole pile of site-wide links.

Dixon Jones: I usually use this domain discovery as a barometer of how good a link-building campaign is, because this top one could be extremely confusing. It's not to say that those links don't necessarily all produce some kind of benefit, but you would assume that with this kind of difference, Google is working hard to work out the differences between site-wide menu links and their value, compared to an individual link at its value. It's not to say that either one is valueless, but I imagine that they both have slightly different interpretations than Google.

On the left-hand side, there is a cumulative graph that I like to look at. In the second graph we can see how the links are being built up over time. We can see that Collegeview.com and Rileyguide.com are very, very closely matched on the number of domains that are linking to them. This screenshot here is a good, quick way of having a look at how far you have to go with a customer to try and get realistic rankings.

You are going to have to make some kind of view into whether you are interested in trying to get backlinks from web pages or websites, because it can mean two very different things.

We also have another tool that does the same sort of thing, but we can do about a hundred URLs at a time, and it's still free. The bulk backlink checker link on the homepage will allow you to enter all the sites as a list. You can also get a CSV file, which is always useful as well. There is a list of at least 100 of them, and you very quickly get an idea of how many external backlinks we see for each of these domains and referring domains. Let's use Hitwise, which has all of its websites within a given industry, as an example. You can enter all its sites in and see the referring domains by industry segments, or you could use an DMOZ directory category, Yahoo Directory category, or business.com, and that would give you some idea of all the sites in a particular vertical. You would very quickly be able to get an idea of a market using this bulk backlink checker.

I recommend that people register just so they can find all their own data for free. This way they won't have to pay to get a full report on their own website. All they have to do is put up a verification file, Google analytics or anything that would verify to us that they own the site, and from there they can get a report of all the sites that they control. When they want to start seeing sites that they can't control, that's when a subscription becomes required, which costs about 10 dollars a month. I would say that you have got to be pretty much a small B-to-B business to use it, and by the time you get to see your real goal and you can start playing to you know some reasonably serious analyses, so it's not that big.

On the homepage, one of the other things that you can do is setup all sorts of folders and things to help spread your report.

I have started by putting in one advanced report and one standard report. For the standard report, I analyzed this deep link:

One of the things that you can do with Majestic is analyze deep links rather than just domains. This is what we have with this particular deep link of education-portal.com/pages/Computer_Science_Academic_Scholarship.html. It comes out with an overview, which allows you to see the number of links coming into the domain itself, or to the sub domain, which is http://www.education-portal.com What's interesting to point out here is that the root domain has a lot more links coming into it than the www.

Eric Enge: That's pretty rare.

Dixon Jones: Yes. It has lots of other sub domains, and there seemed to be an issue with some of the domains. With the advanced report, we can check backlinks to the domain itself.

These are the main backlinks coming into the domain. What I'm going to do is take it right down to the URL. It obviously gives us less links, but I think this stands out because you can't go to Yahoo very easily and find the backlink text to an internal URL. You can find some domains and things, but here we can start seeing the best links as ordered by AC Rank. AC Rank goes from 0 to 15 and it's a very, very loose quality score, but it's extremely transparent. It's purely based on the number of links going into the page that is linking to your site.

It's not trying to find the PageRank or the quality of the link in any high-developed metric. We found that our clients really want to make that judgment for themselves. At this stage, we do intend to improve AC Rank, and we will also likely come up with another quality metric. We do not intend to try and emulate PageRank in any way, we just haven't sat down and looked at the way in which it was copyrighted. We want to come up with our own methodology for quality of a page, but at this stage we have a very vague but transparent quality score with AC rank.

On this URL, backlinks, we can start seeing the source URLs, the best links. If you click on anchor text, then we can start seeing things sorted completely differently. You start seeing the anchor text coming into this deep page. You would expect to have a wide variety of anchor text and of course you would expect the URLs coming there as well. Again, you can export this as a CSV file, which is very useful.

Another tool, that I have written myself, goes and pings every one of these links. I can put them all in one particular URL and go validate all of those links and check them. Majestic crawls an awful lot of links, but it's optimized to collect data, collect new links; we are less focused on verifying those links are still there after time. We do flag where the links are, and when they disappear all of these links that we have marked as deleted will get filtered out.

Nevertheless, it may be that there are plenty in there but they are deleted, that we haven't got back to look. So I go and set up another little routine, where you can do it manually to go and visually check that these links still exist. At that stage, I can pull out some kind of quality score, whether it's compete.com, PageRank, or any other metric that I want to use to judge the quality of the page and the incoming link as well. It's not something that everyone can get their hands on, but it's not a difficult script to do. We basically take all of these links in the CSV, then we plug them into another web-based system that we've built that just goes and physically checks whether the link still exists. Then we also can check for a quality metric on the website.

Going back to what we have here, you can very quickly start filtering out all the things that you don't want. If an SEO decides that he is not really interested in NoFollow links, they can very quickly exclude them, and that changes the system. Links might go down, but now we are going to download a much more accurate representation of links, so that is going to start to carry juice should you decide to sort by anchor text, or source URL. We are going to have a look at those, and now we can start seeing them in alphabetical order of URL and things like that as well, so you can sort them in anyway that you want. We have some easy filters here with this standard report. For example, say I am only interested in ones that have "computer" in the anchor text. If we refresh our data, we can see all the links that have anchor text with the word computer in the data, and you can export that as CSV. You will very quickly start seeing a lot of stuff about a deep URL, and if you start doing that with the whole domain, you can have a lot more links as well. So please, if you compare us with Linkscape, then make sure that you are either comparing it to the domain as opposed to the URLs, because we have so much data here.

If you were to export that into CSV, the CSV file that comes down is very easy to start manipulating and will allow people to see all the anchor text, external backlinks, root domains, and then flags as to whether they have redirections or no-follows on them. Of course we have filtered all of those out from this section here, but if we did include those you'd have a much longer list. When you want to compare, you've got to make sure that you compared like with like.

It depends on the level of a customer's subscription, but for something like the standard report you would get at least 5,000 URLs. I think that Yahoo stops at 2,000, and we go up to 5,000. On higher levels, you can get 7,500. We have a lot more usually, but on the standard report we limit it to the top URLs because it is a much, much easier thing for us to take as a subset of the main database. This way, it's quicker for us to analyze, so it's easier on our servers. For advanced reports, the amount that you can take depends on your usage level.

I can get 17 reports in this subscription this month, but if I tried to start getting Amazon's backlinks, these would get extremely expensive because we have a number of URLs limits within URLs as well. This is why we are like to do standard reports, because you can use these for sites like Amazon and eBay and get some good information without breaking the bank. If you really, really want to get all of the backlinks for eBay, that's going to require quite a lot of effort.

Here is a look at the advanced report for utexas.edu:

This is the default report that you get back for an advanced report, and this is for times when you really want to analyze in a lot more depth than you could with the standard report. This is laid out a bit differently, and the first thing I should talk about is that we have decided that people probably don't want a number of the links that we have for SEOs. If I click on options, I can redefine search parameters within this report, and it doesn't cost more to do that, you just have to decide what you want. So you can say, in this case, the default settings are, we don't want any links that have NoFollow in them.

Perhaps, we also don't want any that are deleted because they aren't useful for current SEO purposes. We keep the deleted links in our database because the information is extremely useful and valuable to see where somebody has been buying links or had an alliance with another website that's now gone to dust. We also take out the mentions as well.

Eric Enge: To clarify, a mention essentially is when the URL is there, but it's not actually a link?

Dixon Jones: Right. Somebody may be referencing StoneTemple Consulting but they only put stonetemple.com without making it an anchor statement (a href= ...). They have the domain, but they didn't include an actual link, so it's a mention. They are useful because if somebody has mentioned a site, but didn't link to it, you might want to phone them up and ask them to add the link to provide proper credit.

To refine the report, you could target specific URLs. This becomes a useful tool to start dicing the data because there is an awful lot of it. If I go back to the control panel, we can see the extent that the filtering rules have already trimmed the data. We started with 33,000,000 backlinks to utexas.edu and we've filtered out about 3,000,000 of them. That leaves 30,000,000 coming from 314,000 domains.

The backlinks total can be misleading, which is why I look at the referring domains as much as anything. By using the analysis option, which is the same as the options button, there are essentially three different reports that are created, to analyze the three things that people want to look at.

The first thing is the top anchor text that people are using to link to you. Second is the referring domains that are coming into them, and third is the top pages on your website based on the number of external links coming into them. The homepage is almost always the most important page.

If you take www.utexas.edu, you can see some of the sub-domains that people are using on these sites. They've got www.lib.utexas.edu.p.utexas.edu; it seems that utexas.edu has sub-domains for all of its department areas.

They can be signs to build links independently whether on purpose or naturally. We can see that all these pages are bringing in links. Would you like to have a look at the anchor text, referring domain, or the strongest pages on utexas.edu?

Eric Enge: Let's look at anchor text.

Dixon Jones: By clicking on the anchor text, more information is available and using a CSV file to extract this data is the most sensible tactic. You can either click explore by CSV or download all; explore by CSV takes the top 500 and download all takes everything.

With the anchor text, let's take a look at Petroleum Eng Reading Room. There are 2,000 external backlinks here. A lot of these may be coming from other sub-domains within the texas.edu site. We only record links from other domains, but a lot of these other domains can be sub-domains, I don't know. Looking for the phrase Petroleum Engineering Reading Room, we can see where the links are coming from such as archive.wn, nag.com, meratime.com, electricity.com, energyproduction.com, globalwarm.com.

Then we can look at these links for Petroleum Engineering Reading Room, and see which are related sites that link through to utexas.edu. On this anchor text we can start seeing the spread of any one anchor text specifically. We can see that Petroleum Engineering Reading Room has 1,300 domains but only 12 IP numbers. This is significant because although they are coming from a lot of domains, the majority are related because they come from the same IP number.

Eric Enge: Right. Multiple sub-domains are counted separately.

Dixon Jones: Although, these look like different URLs, they are completely different domains. By exporting and opening the CSC, we see that these aren't sub-domains of utexas.edu, these actually are different web domains. That's even better information really.

The major takeaway with this advanced report is that you can customize it with what you want to see and how much you want. If you really wanted to bring down 36,000,000 backlinks into a CSV file, if your software could handle it, you could analyze all those in spreadsheets. That's Majestic SEO for you.

Eric Enge: Thank you Dixon!

Dixon Jones: Thanks Eric!

Have comments or want to discuss? You can comment on the Dixon Jones interview here.

Other Recent Interviews

jueves, 10 de junio de 2010

Stone Temple Consulting (STC) Articles and Interviews on SEO Topics

SEOmoz Linkscape Team Interviewed by Eric Enge

Posted: 10 Jun 2010 01:40 PM PDT

Published: June 9, 2010

Rand Fishkin is the CEO & Co-Founder of SEOmoz, a leader in the field of search engine optimization tools, resources & community. In 2009, he co-authored the Art of SEO from O'Reilly Media and was named among the 30 Best Young Tech Entrepreneurs Under 30 by BusinessWeek. Rand has been written about it in The Seattle Times, Newsweek and the NY Times among others and keynoted conferences on search around the world. He's particularly passionate about the SEOmoz blog, read by tens of thousands of search professionals each day. In his miniscule spare time, Rand enjoys the company of his amazing wife, Geraldine.

Ben Hendrickson graduated from the Computer Science Department at the University of Washington, he then rather enjoyed being a developer at Microsoft, although not quite as much as his current position serving the SEOmoz corporation. Nick Gerner leads SEOmoz API development and worked on solutions for historical Linkscape data tracking prior to leaving SEOmoz about 1 month ago.

Interview Transcript

Eric Enge: Can you provide an overview of what Linkscape is, for the readers who aren't familiar with you all and what you have been developing?

Rand Fishkin: Linkscape is an index of the World Wide Web, built by crawling tens of billions of pages and building metrics from that data. The information Linkscape provides is something webmasters have cared about and wanted to see but search engines have been reluctant to expose.

Linkscape is a way to understand how links impact a website and how they impact the rankings given by search engines. Our aim is to expose the data in two formats. One for advanced users to perform some of the complicated analyses they have longed to do but couldn't, and a second to provide simple recommendations and advice to webmasters who don't necessarily need to learn the ins and outs of how metrics are calculated.

Ultimately Linkscape will provide actionable recommendations that go beyond raw data to explain a site's ranking and the rankings of competitors. It routes the source of the links and shows a user who is linking to a competitor, but could link to them. Our tools also expose which links are more useful and which ones are less so.

Eric Enge: One interesting aspect of your tools is that in addition to collecting a large dataset of pages and links across the web, you do your own calculations to approximate trust value and rank value as in mozRank or mozTrust. Can you talk a little bit about that?

Rand Fishkin: For those who are interested in the technical details and methodologies, the patent applications are now available. For the less technical webmaster, mozRank is a way to think about raw popularity in terms of how many links point to a page and how important those links are and consequently how important that page is. It leverages the classic PageRank concept.

MozTrust similarly asks the same question, but with a trust base bias. Instead of analyzing how important the page is among all other pages on the web, it looks at how important the page is to other trustworthy websites and pages. The link graph is biased to discount what mikes-house-of-viagra.info thinks and focuses instead on how sites such as WhiteHouse.gov, NASA, smithsonian.org, and Loc.gov (the Library of Congress) regard the page. They are powerful metrics to analyze how trustworthy or how spammy a website or page is.

Eric Enge: The original definition of trustrank was founded on a concept that there was a distance in number of clicks between you and a feed set of websites.

Rand Fishkin: Ours is a little more complex than that in the sense that if you are three clicks from one trusted site versus being four clicks from multiple sites, perhaps the guy that is four clicks from multiple trusted site is in fact more trusted.

It's not only the source route, but a more complex interaction. It's similar to the mozRank or PageRank style of iterative algorithm where the metrics are combined.

Eric Enge: Can you dive into some examples of what Linkscape and its companion product Open Site Explorer can do?

Rand Fishkin: One of my favorite applications is a tool called Link Intersect. Given a site and a few competitors, it shows who links to the competitors, but does not link to the page of interest.

Nick Gerner: Someone can quickly find exactly which links they could easily target to go out and pick up. It also communicates the types of communities in which competitors are engaging. It's great for a site that's new in an established community to learn the fundamentals of the community. In addition to high-level market data, it provides information on actual sites in the community and the blogs attracting other players in the field. These are the people that they need to build relationships with, or participate with in forums and blogs.

It's similar to any competitive back linking effort, but with the extra twist of quickly isolating the major communities. It can give a list of targets but it can also be used for link building strategy.

By using the metrics and prioritizing the data, we are indicating not only communities where everyone is engaging but the communities that are most influential and important targets.

Rand Fishkin: The corollary to this is another powerful tool called Top Pages, which can be seen in Open Site Explorer. When given a website it provides a list of pages that have attracted the highest number of unique linking root domains. It doesn't only give pages with the highest number of links, it also shows pages with the greatest diversity of people linking to them.

This enables a business to understand how a competitor attained their ranking and links, and what was on their page that attracted the interest and link activity. A site owner can not only see what the competition is doing right, but perhaps what they themselves are doing wrong. They can see which pages on their site are attracting the majority of links and which are attracting few. It's a powerful system for competitive and self analysis.

Eric Enge: Basically it's a filter that creates a link dataset for a given site, but it's easily executable. I would call both examples of Link Analytics. Yahoo has tools that will extract data but the resulting list of links isn't well-prioritized and there aren't any additional tools for filtering.

Nick Gerner: Having an index to the web is only part of the story. The larger part is building scenarios and tools that go beyond data pukers. Beyond raw data, Link Analytics provides a site owner with an understanding of what the data means and what they can do to improve their placement. The data we provide is prioritized and has metrics to understand what's going on that's placing them where they are.

Rand Fishkin: We have been developing ways to make metrics actionable. PageRank is maybe 5% better than random guessing for ordering the search rankings and mozRank is not much better. They are both interesting metrics and are useful for indexation, but they don't tell the whole story.

By combining a ton of metrics and knowing the number of unique linking domains a page has as well as the anchor text distribution, mozRank, mozTrust, domain mozRank, and domain mozTrust are able to build models of how important a page is and how able it is to rank against something with x competitiveness, and that goes so far beyond a link graph or calculating metrics, and speaks to the heart of the SEO problem which is how to make pages accessible and have the ability to do keyword research and find pages that can compete for those keywords or build pages that can compete for those keywords. To tackle the problem, the site owner needs to know who their competition is and how to implement a solution.

That's where domain and page authority are so amazing. They've been under the radar but are the best (and most correlated) metrics we've built for ordering the results.

Ben Hendrickson: When Linkscape first launched, we had a lot of metrics. We made mozRank and mozTrust and we had counts of how many links were of various sorts. There were also some we couldn't exactly classify and didn't know how to use them.

We can look at these metrics now, and analyze popular pages. It's usually more interesting to look at external mozRank than normal mozRank. Typically if the number of unique linking domains is low, but other metrics are high that's incredibly suspicious. Being able to make even these simple comparisons gives a very complex view of how one could use all of our metrics to look at any given problem.

Our numbers can be used in a holistic fashion to compare two pages. By considering the, Page Authority (PA) and Domain Authority (DA), strong inferences can be made.

Nick Gerner: mozRank and mozTrust are technical, very low level algorithmic measures that do a very specific task which is great. On the other hand they are not terrifically well packaged for human consumption.

Page Authority and Domain Authority are more closely matched with intuition. Big thinkers in the field look at our product and say it makes sense. To them mozRank is not that different than PageRank, and is not packaged in a way that the user understands. Page Authority and Domain Authority, however are packaging the data in a consumable and extremely useful way.

Ben Hendrickson: The older tools that provide the number of unique linking domains are rather obsolete because it is easy to determine where that number came from and how to make it higher, or lower. The new numbers are derived from very complex formulas that even if everyone knew them, would be very hard to utilize. The simpler information has a whole lot of value because it's understandable and usable.

Eric Enge: Conceptually, how would you approach an arbitrary search query such as digital cameras to create a simple visual of the top ten results and the metrics driving the Page Authority, Domain Authority, and consequently the ranking?

Rand Fishkin: Linkscape and the tools built on top of it, Open Site Explorer and the Keyword Difficulty Tool, are designed to answer that question. If someone runs a query, they can quickly pull up the list of digital cameras and a list of pages.

Looking at the rankings, there are questions that would be interesting to know but couldn't be answered before such as if some rank lower because they are less important pages on powerful domains or if a higher ranking is a function of exact match keyword domains with lots of anchor text, or if they obtained their ranking because they are very powerful pages on moderate domains. It would also be useful to know if they have collected links from a quantity of sources, or a few of exactly the right sources.

Linkscape starts to answer those questions, which speaks to the true question every site owner wants to know which is what they need to do to move up the list.

Eric Enge: You have created a formula for evaluating this?

Rand Fishkin: You are definitely on the phone with two guys who love formulas for solving problems like these. As far as ranking the competitiveness of a keyword, this is exactly what the Keyword Difficulty Tool does. Typically, the process is to look at how many people are bidding in AdWords and how many people are searching for the keyword. Also, knowing how many root domains are ranking for it as opposed to internal pages is important. Those are all second order affects that correlate with competitive rankings, but they don't answer the question of how a site compares with the competition.

Eric Enge: Your tool is giving a site a sense of where they will get the best results for their efforts. If they have to climb a huge mountain to win on one keyword or could win a different one with significantly less effort even though it may have less total volume, the decision could be quite easy. Difficulty tools often use simpler, and I never used them because they really didn't tell me what I wanted to know.

Rand Fishkin: It's frustrating for tool builders to build a Keyword Difficulty Tool that still doesn't get at the real answer. We could look at toolbar page rank or number of links that come through Yahoo, but we still don't know how powerful or important those are. To get to that takes building ranking models.

Ben Hendrickson: PA knits together metrics that our Keyword Difficulty Tool is based on in order to figure out how to combine our metrics to be predicative of Google ordering of results. The major missing piece in Page Authority is the keyword features in terms of defining the content on the page.

If, in comparing two pages, one ranks higher, and you are trying to determine if the guy outranking has more/better links, Page Authority helps to answer that. If it's not link strength, it could be an anchor text issue in terms of the anchor text distribution and how well that matches the keyword being analyzed and also on-page features.

Eric Enge: That holds a lot of value because it's a meaningful metric that the client can get advice on. If they are in position five and want to go to position three, it tells them how big of a move that is which couldn't be done with the earlier tools. That's really cool.

Ben Hendrickson: A useful feature of using the Keyword Difficulty Tool is that even though there is a degree of error in our models because we don't understand Google a hundred percent, it should average out that error.

Rand Fishkin: I pulled up the search on digital cameras. It has 91% difficulty which is extremely competitive. I am looking down the accordion chart and I see that the Domain Authority is consistently in the mid to high 90's for almost all of these sites with one exception, the guy ranking #4. Looking over his Page Authority, I see he has an 88% for Page Authority, so it's a pretty important page, but still some questionability there. Going down I see that #4 is digitalcamera-hq.com, and it makes sense because all his anchor text to that homepage likely says digital cameras, so it's not surprising that he is doing so well, and is beating out Amazon.com, and usa.canon.com, and Overstock.com, and Ritz Camera. Clearly he is winning the battle.

Another case is RitzCamera.com which is at #9. They have a little ways to go in Domain Authority. Their Page Authority is strong, but perhaps anchor text is their weak area. They could focus on getting links to their homepage with the exact match to the anchor text. To boost the Domain Authority, they could add link bait or run a blog or have a UGC portal on the site, or get people creating profiles, and contributing, and linking to lots and lots of different pages on their site.

Using this example of digital camera's shows how our tool starts to answer the question of how competitive a page is, but more importantly it sheds light on the missing pieces behind a ranking and what a site can do to fill the gap.

Eric Enge: Given the model that you are creating, have you been applying machine learning techniques to better approximate Google?

Ben Hendrickson: That's where our authority numbers now come from. When we control for page features, it depends on what you are trying to model at the given time, Domain Authority obviously has fewer inputs than Page Authority (DA only knows the domain you're on, PA knows the exact page), and we are going to try to model everything. We're working even now on topic modeling, which we think can help get us a significant step closer to accuracy and prediction.

Nick Gerner: In the search engine ranking factors survey which we do every three years, each feature area is covered by a number of independent low-level features. At a high-level we cover the quantity of links, the authority of the links, trust of the links, and the domain on the page is on, and if there a keyword in the URL. There are ultimately dozens of features.

Ben Hendrickson: We can tell people whether or not the more important factor is the overall domain mozRank or the individual page mozRank.

Eric Enge: Have you done anything in terms of using other trusted datasets to train your algorithms?

Rand Fishkin: We did early modeling on datasets from places like Wikipedia, but as we've expanded out to the broad web, we've been training our rank modeling on Google.com's search results. We think that's what most SEOs care about.

Ben Hendrickson: There are issues building tools with external data because these numbers are harder for us to obtain. We haven't found anything useful enough yet to go out of our way to figure out how to approximate it internally.

Rand Fishkin: Getting access to Twitter data either through their fire hose directly or a third party is going to be useful and interesting for us because there is little doubt in my mind that Google is using it in some interesting ways.

Ben Hendrickson: The most interesting one that we have seen was Delicious Bookmarks, but in terms of filling in the gap in our data it wasn't big enough for us to actually look into how to guess their methodology.

Rand Fishkin: We are all working on showing webmasters and SEOs these metrics because a lot of people still want to know the importance of PageRank, Compete Rank, Alexa Rank, and the number of links via Yahoo. It's against our core beliefs to not expose that data since we've got it, so we are going to make an effort to show people the ability of each of these metrics to predict rankings, and how useful combinations of those metrics are, and Domain Authority, and Page Authority, and ranking modeling.

Eric Enge: Can you outline another interesting scenario?

Rand Fishkin: It's exciting to see people doing analysis of links that matter. A lot of people are concerned about which of their campaigns are providing them a return on their investment, whether that's a public relations campaign, a link acquisition campaign, or even something like a link buying campaign.

People look intently at which links provide value and metrics like Domain Authority, Page Authority, and even mozTrust. With our API or through a CSV, some users are applying Excel to see all their links, their competitor's links, the domains where the links are from, and determine if they are worth getting. They are also able to see if a given link acquisition campaign they conducted had a positive impact.

In the past, we have been able to use the second order effects of traffic and rankings to analyze ranking but the problem with that approach, especially for competitive rankings, is if a site moves up only one or two places, there's no way to know if they are stagnating because the link building campaign isn't working or if competitors are gaining links faster.

Nick Gerner: We now have hundreds of people sucking down our API and integrating it into their own toolsets, which is really exciting because the data is getting out there and people are looking at it, being critical about it, and integrating it into their processes.

Rand Fishkin: We have a number of metrics we focus on, but as human beings we are excited about our projects, which is why we are giving away so much data and doing API deals. We want people to use our data and be critical of it because that'll make it better, that'll make SEO less of a dark art and bring some science to this marketing practice.

Eric Enge: Yes. There is a real shortage of science in the whole process, that's for sure.

Rand Fishkin: There are practitioners in the search marketing field who aren't going to be digging into formulas and looking at patent applications, but for the population that does care and wants to dig in deeply, we owe it to them, and to ourselves to be transparent. It's always frustrated me that Google encourages SEO but gives no information on how or why their rankings are calculated.

The answers Google provides to questions are often surface-level and are filtered through a PR lens. I believe, along with a lot of other people, that they could open source their algorithm, because exposing it is not dangerous if you have smart engineers and tons of people who care. Wikipedia is open for anyone to edit, and yet it still has phenomenal quality because most of the community are good players.

Google is the same thing. Most of the players are good players. The spammers can do a lot of interesting things if they know everything, but, hiding data via a "security through obscurity" policy is not the way that we should act. Therefore, I want to share the math and the daily processes underneath and be open about how we calculate data. Even if we can't share the model and its thousands of derivatives, we can explain all the content behind it. Eric Enge: Let's talk briefly about the API. You've mentioned it a couple of times, but could you go into more detail on the kinds of technology involved and what data you can get and ways to use it?

Nick Gerner: The API is a huge part of our engineering and infrastructure. We run a large cluster of machines out of Amazon Web Services, and all of our tools are powered by this API. We opened a lot of new information in our API which is the same API that's powering Open Site Explorer. If someone approaches us that thinks they can do a better job with our data than we are with Open Site Explorer, they can actually build it themselves; we are encouraging this and are always looking for partners because getting our data out there and sharing that mentality with as many people as possible helps the industry and helps us also.

Rand Fishkin: The API has two implementations: the free one, up to a million calls a month, and the paid one, beyond a millions calls a month. There are definitely times when someone is doing interesting things and needs to test above and beyond the million calls per month, and we can make an exception and allow them to do it for free.

Nick Gerner: We are really flexible about that. We have a forum where people ask questions and we are super responsive to that. It's on the API wiki which is apiwiki.seomoz.org. We have documentation there as well sample applications. Dozens and dozens of agencies and software developers are there.

Rand Fishkin: Nick, could you give a few examples of the public applications?

Nick Gerner: There is the search status bar that integrates some of our metrics. I know a lot of people have this one because of the traffic to the API on it. More recently the SEO bar for Chrome has been a hit. It started as a YOUmoz post on SEOmoz and it was huge. It was easily one of the most popular blog posts that month when it launched.

These other directories are using mozRank or Page Authority, which is in the free API.

Virante has donated open source code to the community about how to use the API. They are using our data to find issues on sites that go beyond simple linking issues, into more architectural and technical problems. They are essentially using our data to create technical solutions to the problems. Someone can plug in their site and indicate there issues, click a button and get an .htaccess file to solve them.

Rand Fishkin: And basically, 301 Redirects a bunch of 404s, and error pages which is nice for non-technical webmasters who don't want to go through the process. Virante plugs in the API and comes out with a tool that does it. They had been using it for their clients, who loved it, so they put it out for free.

Nick Gerner: HubSpot also integrates some of our data in their product.

Eric Enge: Can we talk about more comparison service tools?

Rand Fishkin: There are several sources of link data on the web. Yahoo Site Explorer, which relies on Yahoo's web index, has been the most popular for the longest time. Some are concerned that the data will go away when Bing powers their search results.

Site Explorer is an awesome source with pros and cons in comparison to Linkscape data. Yahoo Site Explorer's data is fresher than Linkscape data. Linkscape crawls the web every two to three weeks, and has a couple of weeks of processing time to calculate metrics and sort orders for the tools, and API. Therefore, Linkscape produces a new index every three to five weeks. However, Yahoo is producing multiple indices, multiple times per day. When you query Site Explorer, chances are the data will be much fresher. If a site was launched last week, Yahoo is a better tool to see who is linking to it in this first week. We are working on fresh data sources, but for now Yahoo is great at that. Yahoo is great at size as well; but even though they are bigger than Linkscape, they only expose up to 1,000 links to any URL or domain, so much of the time, Linkscape actually has a greater quantity of retrievable links. One weakness of Site Explorer is that it doesn't show which links are followed versus not-followed. It is also not possible to see which links contain what anchor text or see the distribution of those anchor texts. It also doesn't show which pages are important or not important or which ones are on powerful domains or not. Those metrics are, in our opinion, critical to the SEO process of sorting and discovering the right kinds of links.

Another player in this space is a company called Majestic-SEO. They have a distributed crawl system similar to the SETI at Home Project. Lots of people are helping them crawl the web. In terms of raw numbers of retrievable links, their data set is tremendously large, in fact substantially larger than Yahoo which makes some webmasters raise an eyebrow. They've been crawling the web for many years, and storing all of that data.

Something that needs to be considered in respect to this is that in looking at the Internet, it has been shown that if the good pages continue to be refreshed, 50% of the web disappears every year, and 80% disappears over a couple of years. Majestic has a great deal of historical information that may or may not be in existence. Some people like having that ability to see into their past. Though that segment of the information is dated, since they don't calculate or process a lot of metrics, some of their data is very fresh along with the older stuff. We certainly consider them a competitor and work to have better features but we respect what they are doing and a lot of webmasters like their information as well.

Those would be the three big ones, Linkscape, Yahoo, and Majestic. It's also possible to do a link search or Google blog search and find some good links there as well. Alexa also has some linking information but it's not terrific.

Eric Enge: For heavy duty SEO, if Yahoo Site Explorer does disappear, the real choices are Linkscape and Majestic?

Rand Fishkin: That's absolutely right. The great thing about both of these two companies is that they will push each other to be better. Majestic is working hard as are we and pouring a lot of money and resources into smart people to try to be the best and provide webmasters with the absolute best data. I think that's great. I lament the day that there is one search engine. If Google does ever take 90%+ market share, I don't think innovation will happen.

Eric Enge: Could we cover a few interesting metrics such as how many pages you've crawled, and links you are aware or anything along those lines?

Nick Gerner: That is a hard question because our tools focus on a timeframe of approximately a month, but in that timeframe, roughly 50 billion pages and on the order of 800 billion links. Those 50 billion pages are across roughly 270 million sub-domains and around 80 million root domains. That's a good ballpark figure and represents maybe 800 billion links.

That data we are refreshing stacks on itself when going back in history. We do have the historical data, but we aren't doing anything externally with it right now. Historical data is useful, but in terms of important links, what matters is where competitors are engaged today, what communities they are engaged with, and what a site looks like now. Including historical data could increase our numbers tremendously, but the numbers that we publish are indicative of what data is serving our users. Looking at that, we have 50 billion pages, 270 million sub-domains, 80 million root domains, and in the neighborhood of 800 billion links.

Rand Fishkin: Some users don't like that our numbers are smaller than Yahoo, or that Yahoo's numbers are smaller than Majestic. People think bigger numbers have a bigger impact. That desire for bigger numbers needs to be balanced against the usefulness and value of the information for webmasters.

We also want to provide the most transparent story. Sure, we have crawled two trillion pages, but that doesn't matter if we only serve data on the 50 billion URLs through Open Site Explorer that we saw in the last 30 days. Maintaining this outlook has been tough because branding-wise, big numbers are a big selling point.

Eric Enge: The other tradeoff is how often refresh you can refresh the data.

Nick Gerner: Since we want to make our cycles shorter, we might actually err on the other side and end up with smaller numbers. We might have a two and a half or three week cycle with slightly smaller index sizes. We want to have multiple indexes and match things up rather than link things in a cycle. If we really want our number bigger than Yahoo's, than we can make an index covering the last 90 days or 4 months.

Rand Fishkin: As we start to seriously address the historical data of the last six months, or two to three years, our numbers are going to reflect that. Right now we are showing the latest snapshot instead of showing the latest snapshot for every month over the last 5 years for instance.

Eric Enge: Any metrics on how no-follow usage has dropped?

Nick Gerner: We don't have data to suggest that it's dramatically dropped, but we do have data that suggests that rel canonical is taking off amazingly. The first time we looked, there were a million pages out of our 43 billion that were using it. Now, it's being used at least as much as no-follow is being used.

Ben Hendrickson: So, about 3% of pages are rel canonical.

Nick Gerner: Some smaller proportion of large sites use it on all their pages, but for those big sites that have jumped on that bandwagon, it's great for webmasters and SEOs because it's another tool to use.

Rand Fishkin: Rel=canonical is absolutely phenomenal, and I almost always recommend it by default because it protects against pages getting weird stuff on them or people adding weird tags. What's your philosophy Eric?

Eric Enge: I am a big believer in rel=canonical. In the recent interview I did with Matt Cutts, he clearly stated that it's perfectly acceptable to have rel=canonical on every page of a site. Doing that offers protection from people linking with weird parameters on the end, by essentially de-duplicating those pages, which otherwise might not be de-duplicated.

Nick Gerner: It's exciting to see the web pick it up and have datasets with a snapshot every month. It is somewhat surprising that no-follow usage hasn't dropped. I don't know if people saw the negative impact of putting no-follow on links on their pages internally or externally, or if they felt that removing it can have a strong positive impact at least generally speaking.

Eric Enge: I wonder if it says that at this point use of no-follow in blogs and forums is dominating the sculpting.

Rand Fishkin: In the near future, we could potentially expose where no-follow is being used, and whether it is a small number of domains with a large number of pages, or the reverse, and how widespread it is. If I can offer a hypothesis about the no-follow matter, a lot of the recommendations have been if it's working, don't change it. There hasn't been a huge mass migration, but maybe new sites that are being built are more conscious of that. On the other hand, with rel canonical, it's basically big, huge sites that make up large portions of all the pages on the web so it's not a surprise that it's taking off in a dramatic way. No-follow, people are still unsure about if they should change, how they should change or if it's even worth investing in thinking about it at all.

Eric Enge: What do you have in the pipeline for future developments?

Rand Fishkin: By this summer, users will have the ability to look back in time at previous Linkscape industries and compare link growth rates. We think webmasters are interested in what links they have gained or lost over the last few months, particularly important links, which is the information we will try to provide.

We are also going to do more with visualization. Open Site Explorer is a good place to be more visual about the distribution of page authorities and anchor text. Having pretty charts and graphs to show the information can help users see dips. Visually it can highlight what may be an opportunity, or something they missed, or an outlier prompting them to dig deeper. That's an area of significant growth rates.

Index quality and size is going to get a tremendous amount of attention as well. It's gotten a great deal of attention over the last 6 or 9 months, but we still have a long way to go in terms of size, freshness, quality, what we crawl, how we crawl it, and how much of it makes it into the index.

Nick Gerner: The Page Authority and Domain Authority have been incredible for us so far and we are going to do more along those lines too.

Rand Fishkin: The data points get better every month, and the Page Authority and Domain Authority get more accurate in predicting Google rankings. There is some internationalization we need to consider for the long-term such as a scenario where if someone is in google.co.uk, or google.com.au, google.com.de, the Page Authority, Domain Authority might not stack up perfectly. We still need to take those into account.

Eric Enge: Do you have a lot of usage internationally?

Rand Fishkin: We do. Right now something in excess of 40% of all pro members at SEOmoz are from outside the US.

Eric Enge: Thanks Rand, Nick & Ben!

Have comments or want to discuss? You can comment on the SEOmoz team interview here.

Other Recent Interviews