One of the biggest activities that our agency undertakes is Manual Penalty Removal. A manual penalty is shown via a notification within your Webmaster Tools console.


This blog post looks at the main data sources used in the SEO industry and compares the data found within the exports. The aim to help you decide which data source may be the most relevant for the tasks you are completing and most importantly, which data sources you need when undertaking Google manual penalty recovery work.

We will be using as the domain and link profile to analyse. We chose this as it has never had any “SEO” done on it. We do actively market the site to the digital marketing community but we can honestly say we have never “bought” an SEO link to the site.

As a result of the good links we have earned, we have picked up a lot of scraper style listings giving us a nice mixture of high and low quality inbound links.

We did not use this site,, as the site is too new. We only rebranded and launched the domain in April 2014. We also accept that we have a small data set as we are just comparing one site – we may expand the post and analyse a batch of sites but for what we were looking to answer, one site will suffice.

Data Gathering

One of the most important elements of recovering a site from a Google link based penalty is to gather as much link data as possible.

As Google do not always give examples of what they are classifying as a bad link in the reconsideration request, gathering as much at the beginning of the process is a major factor when looking at timescales to recovery.

The link data sources

Whilst some tools in the industry claim to have 22 link sources (?), we firmly believe that using the following 4 data sources does give you as clear a picture as possible.

The source name links through to the website and the download link will send you to a guide on how to download your links as a raw export – which is how the links for this exercise were downloaded.

MajesticSEO – Backlink checker and site explorer – Download instructions

Ahrefs – Site explorer and backlink checker – Download instructions

Open Site Explorer (OSE) – Moz’s Search Engine for Links – Download instructions

Google Webmaster Tools (WMT) – Latest Link Report – Download instructions

The Majestic fresh index contains links that the Majestic crawlers have found in the last 90 days. The Majestic historic index contains links that Majestic has crawled over the last 5 years. We would always advise that if you are under a manual penalty you HAVE to gather both the fresh and the historic data.

The historic index will give you a lot of instances where the link is not found due to lost links and sites closing over the longer period of time so do bear this in mind.

For the purposes of this work, we have focussed on the MajesticSEO fresh index.

Moz has a limit of 10,000 links so any analysis beyond 10,000 potential links could leave some out. We are confident that Moz did give us an export of all the links they know about due to the overall size of the link profile.

Comparing data sources

Here is a link to the dataset we used

All data was downloaded within 5 minutes of each other and the crawling was done between 17.38 and 17.42 GMT on May 13th 2014 – so we hopefully have a fair snapshot as it was all taken at more or less the same time.

We crawled the domains using LinkRisk. The export from this will form the basis of our comparisons.

Feel free to use the dataset above and check the % of 408’s. The 408 Request Timeout error is an HTTP status code meaning the request you sent to the website server (a request to load a web page) took longer than the server was prepared to wait – Your connection timed out.

You may see more timeouts on “bad” links as the hosting may not be up to speed and the chances of a server timing out are increased.

We have compared each tool against each other for the following criteria –

LinkRisk score

LinkRisk Score529613455595576


We see a big difference % wise in the LinkRisk score provided by the different datasets.

MajesticSEO LinkRisk Image

Ahrefs LinkRisk Data

WMT LinkRisk Data

OSE LinkRIsk Data


Moz gave us a much lower level of risk than Majestic and Ahrefs suggesting Moz is a better tool when it comes to finding links of a higher quality. This does make sense as OSE is just part of the Moz suite whereas Ahrefs and Majestic are dedicated link data sources and you would expect them to make a deeper crawl.

Unique domains in the data sets and what % are live at recrawl

No of Domains983666173325602
Active Domains636429134269451
% Live64.70%64.41%77.46%82.77%74.92%


As you can see, all data combined and de-duplicated gave us a total of 983 referring domains. The difference between the data sources is striking – with Majestic and WMT giving over 4 times the number of domains than OSE.

Individual URLs in the data sets and what % are live at recrawl

No of links1289841813362290
Live links260829411601593
% Live20.22%70.33%86.83%69.56%


When we look at the actual number of linking pages the datasets provide the differences are quite frankly amazing. MajesticSEO gave us almost 13,000 linking pages although only 20% of them were live when we recrawled them. This does seem alot given that this is MajesticSEO’s fresh index.

One of the more random learnings from this was that it is actually Google’s own WMT that shows us the highest % of nofollow links!

% NoFollow11.969.1815.3422.41

Unique links across sources

Given that the data provided by each source was so different we started to look at how many URLs were unique to each dataset – This helps us to help clients understand why we need to gather all datasets available.

  • Total appearing just in OSE = 74
  • Total appearing just in Ahrefs = 278
  • Total appearing just in WMT = 1,257
  • Total appearing just in MajesticSEO = 11,488

As expected, Majestic is showing us by far the most unique URLs. It is interesting to see that although Moz appeared to give us a much smaller dataset, there are still 74 URLs found in the OSE data that were found in no other source.

Sometimes it is still not enough

Whilst it does not happen often, we do occasionally see Google send example of bad links that can not be found in any datasets. This is never ideal and the reality is that there is very little we can do here. There are many theories that have been discussed about the WMT links list rotating through a larger dataset.

Unless you have been proactive and have been monitoring your latest links there is every likelihood that you will not know about these links. In this scenario we would contact the example sites for removal, add to the disavow and explain to Google the situation via the next reconsideration requests.


One of the main conclusions we would like you to take away is that your link profile is fluid – Links appear and disappear – servers go down and come back online. Scrapers always be scraping, people are always chatting in forums, presentations go online with references to a tool or product. All you can do is take a comprehensive snapshot of what you have available to you.

No one data set can tell you the full picture – not even Google’s own data set – so do gather multiple datasets and use a tool like LinkRisk or similar to de duplicate and analyse.

If you are doing penalty recovery, you need to include MajesticSEO’s historic index. We have seen unnatural examples given by Google that were from 2008!

If you are being proactive or you have an algorithmic issue, maybe fresh data will suffice. We do recommend you gather the historic data also, even if it does create more work. Imagine explaining the scenario where you have cleaned up the fresh index and a penalty is given for a link in the historic data.

If we had to pick just one source……

It really does depend on the scenario and each sources gives it’s own unique value

  • Looking for the biggest volume of data – Use MajesticSEO
  • Looking for data on “good” links for you or a rival – Use OSE
  • Looking to find the worst of the worst in your link profile – Use MajesticSEO and/or Webmaster Tools.
  • Looking for the data set with the highest % of live links – Use Ahrefs

If you are completing a manual penalty recovery, you need to gather all as you need to know the complete picture.

You can get the link data for your own site free of charge by validating with WMT and also MajesticSEO so that is always a good place to start. Paid accounts are available from every data source and we can also provide you with your link data should you allow us to help you recover your site.

Another major point we would like to make is that it is increasingly more and more important to keep on top of your link profile and monitor the new links coming in.

You can do this manually, via LinkRisk and daily link imports – Or let us help you with our link profile management service.

Also – a big thank you to Paul Madden for helping me with the advanced Excel skills needed to work a lot of this out 😉