Duplicate Content: How to Get Rid of It
Often times in SEO discussion communities you come across questions from webmasters asking, "˜If I do XYZ, will it cause a duplicate content penalty?' The common misconception, ever since Google released its Panda update, is that a duplicate content penalty exists and you risk having your site removed from Google's index if you have the same content on different pages of your site. At some point during your website's content creation you might have thought about duplicate content; using the same images multiple times across the site or, if it is an e-commerce site, worrying about category pages appearing in more than one URL with the same product and description, or about your articles being syndicated word-for-word on other sites. So, how much and what do you really need to worry about in terms of duplicate content? Let's start with the basics.
If you're not careful, you could be inadvertently publishing duplicate content a few different ways:
- Multiple URLs pointing to the same content
- Multilingual versions of the same page
- Paginated content
The good news here is that there are some on page methods you can use to get rid of duplicate content on your site. They are known as rel="canonical", hreflang and rel="prev"/rel="next" (pagination).
What is Duplicate Content?
Any content that is identical to other content that exists either on the same website or a different one.
Examples:
- Your blog content syndicated (copied) onto another website.
- If your home page has multiple URLs serving the same content, for example: http://yoursite.com, http://www.yoursite.com and http://www.yoursite.com/index.htm.
- Pages that have been duplicated due to session IDs and URL parameters, such as http://yoursite.com/product and http://yoursite.com/product?sessionid=5486481.
- Pages that have sorting options on the basis of time, date, color or other sorting criteria can produce duplicate pages, such as http://yoursite.com/category and http://yoursite.com/category?=sort=medium.
- Pages with tracking codes and affiliate codes, such as http://yoursite.com/product and http://yoursite.com/product?ref=name.
- Printer-friendly pages created by your CMS that have exactly the same content as your web pages.
- Pages that are http before login and https after.
What is Not Duplicate Content?
Examples:
- Quotes from other sites when used in moderation on your page inside quotation marks. They must preferably be associated with a source link.
- Images from other sites or images repeated on your own site(s). (This is not considered duplicate content as search engines cannot crawl images).
- Infographics shared via embed codes.
There is no such thing as a duplicate content penalty. You have proof right out of the horse's mouth from Google here and here. But that does not mean taking the issue of duplicate content lightly. The repercussions of having duplicate content on your web pages are a loss of traffic, simply because you are "omitted from search results". That's right, you are not de-indexed or penalized, but the duplicate content is simply not shown to users in search results. On Google, you may find a message similar to the one shown below:
If a user clicks the link to repeat the search, they will come across these missing, duplicate-content pages. The chance of a user actually clicking this link, however, is basically nil, as the message is shown on the last search page "“ yes, page 8042 or however many pages a search might return. Plus, if you have one version of the content why would you need a repeat one? This is one way Google refines the user-experience of its search engine, and rightly so. So, how is your site affected by this? There are many ways your site can be affected by the way Google handles duplicate content:
- Lose Your Original Content to Omitted Results: If your original blog has been syndicated onto many third-party websites without a link back to your content, there is a good chance that your original content will be omitted and replaced by their content. This is especially true if the third-party site has a higher PageRank, higher influence and/or higher-quality backlinks than your site.
- Waste of Indexing Time for Bots: While indexing your site, search engine bots treat every link as unique and index the content on each of them. If you have duplicate links due to session IDs or any of the reasons mentioned above, the bots waste their time indexing repeat content rather than indexing other unique content on your site.
- Multiple Duplicate Links Means Diluted Link Juice: If you build links pointing to a page that has multiple URLs, the passing link juice is distributed among them. If all the pages are consolidated into one, the link juice will also be consolidated which could increase the search rankings of the web page. For more information, see SEO Guide to The Flow of Link Juice.
- Traffic Loss: It is obvious that if your content is not the version Google chooses to show in search results, you will lose valuable traffic to your site.
How Can You Detect Duplicate Content on Your Site?
The simplest and most logical method is to copy and paste a snippet of your content into Google search and see if any other page shows up with exactly the same content. There are other ways as well, and they are as follows:
1. Google Search Console:
Duplicate content is not limited to content present on a web page but can also be content seen in search snippets, such as meta titles and meta descriptions. The duplication of such content can be detected easily via Google Search Console under Optimization > HTML Improvements, as shown in the screenshot above.
2. External Tools:
Copyscape.com is an excellent tool to check for duplicate content on your site. It is a free tool available for both Mac and PC.
3. "Site:" Search Operator:
Enter your site on search using the site: search operator along with part of the content from the page, as follows:
site:www.yoursite.com [a part of the content copied from your site here]
If you see a message from Google talking about omitted results (as shown in the first screenshot on this blog), it is an indication that your site has duplicate content present on the website or outside of it.
So, the final question is"¦
How Can You Get Rid of Duplicate Content? Here are 8 ways:
Removing duplicate content from your site is possible, and it is worth the time and effort to make your site as search-engine friendly as possible. Removing duplicate content from other sites that syndicate your original content should be taken care of in a way you prefer; either by sending them a polite email, or a mention in their blog comments giving credit and a link to your original content.
The following are ways to cope with duplicate content generated on your own site:
1. Rel="canonical":
If you use a content management system, syndicate content or have an ecommerce shopping site, it's easy to wind up with multiple URLs or domains all pointing to the same content. To combat this, tell search engines where they are to find the original using the rel="canonical" tag. When a search engine sees this annotation they know the current page is a copy and where to find the canonical content.
How do I do it?
Start by deciding which URL you want to be canonical. In general, you should pick your best optimized URL as your canonical URL.
To properly tell a search engine that content is copied from your canonical URL, place the rel="canonical" annotation in the <head> of your page. It should look like this:
<link rel="canonical" href="<https://www.example.com>"
If you've got a non-HTML version of a document (like a PDF available for download) you can includ the canonical reference in the HTTP header like this:Link: <https://www.example.com/document.html>">; rel="canonical"
What could go wrong?
While the rel="canonical" tag seems simple enough to implement, getting it wrong can have a major impact on your search performance. There are a few common misapplications of canonicalization that you need to be sure to avoid:
- Paginated content all pointing to page one: When you add the canonical annotation to paginated content match your page 1 URL to your canonical page 1 URL, page 2 to page 2, etc. We'll cover this in a bit more detail later.
- Canonical URLs that are not 100% exact matches: If your site uses protocol relative links, leaving off http/https will still result in search engines seeing duplicate content at those two addresses. Always make your preferred URLs 100% exact matches.
- Pointing to canonical URLs that return a 404 error: Search engines will ignore tags that point to a dead page.
Multiple canonical tags: Search engines only support one rel="canonical" annotation per page. You can end up with multiple when a webmaster copies a page template that already includes rel="canonical" or a plugin inserts a rel="canonical" automatically. In cases of multiple canonical tags, Google will simply ignore all of them.
2. Hreflang
Introduced by Google in 2011, the hreflang tag lets you tell a search engine that a page is related to other pages in different languages and/or regions. If your website is https://example.com, and you've got the same page in Spanish on [https://example.com/es[https://example.com/], use the hreflang tag to tell search engines to serve that page to Spanish-speaking searchers.
It's important to note that hreflang is a factor, not a directive, in search results. So if you have pages that are too similar (like English pages targeting the US and Canada) you run the risk of the wrong version ranking for a search term. Multilingual sites need to be a part of your overall marketing strategy.
How do I do it?
The hreflang annotation is implemented in the section of an HTML page. For non-HTML pages the tag can be placed in the HTTP header. When done correctly the hreflang tag should look like this:
- HTML: <link rel="alternate" hreflang="en" href="<https://www.example.com>">
- HTTP: link: <<https://www.example.com/>>; rel="alternate"; hreflang="en"
You must include links to every version of your page. If you have English, Spanish and French copies put links to all three in the page .
If you have two or more pages in the same language but targeted to different geographies (say, the US, Canada and UK) you can extend the hreflang variable to include the country code like this:
<link rel="alternate" hreflang="en-us" href="<https://www.example.com>">
<link rel="alternate" hreflang="en-ca" href="[https://www.example..com/ca](about:blank)">
<link rel="alternate" hreflang="en-gb" href="<https://www.example.com/uk>">
If you've got a non-HTML page in multiple languages, separate each hreflang annotation using commas like this:
link: <<https://www.example.com/>>; rel="alternate"; hreflang="en-us",
link: <<https://www.example.com/>ca/>; rel="alternate"; hreflang="en-ca",
link: <<https://www.example.com/>uk/>; rel="alternate"; hreflang="en-gb",
There's also a third option to implement hreflang tags: your XML sitemap. Instead of adding markup to your pages, include the foreign language versions of your URLs in your sitemap. Just like with the other annotations, include a URL for each language.
What could go wrong?
A common problem when inserting hreflang annotations are "Return Tag Errors." These errors come from hreflang annotations that don't link to each other. Annotations are a two-way street; if your English page links to your German page, your German page must link back to your English page. Possibly the most common Return Tag Error is omitting the self reference - your English page needs to link to itself.
To check for Return Tag Errors, look in Google Search Console's International Targeting data under Search Traffic. This will tell you how many hreflang tags Google found and how many have errors.
Another common problem implementing hreflang annotations is incorrect language or country codes. The hreflang value must be in ISO 639-1 format for language and ISO 3166-1 Alpha 2 format for country. Using 'uk' for the United Kingdom is the most common culprit; in this system the value should be 'gb' for Great Britain. Note that your hreflang value must start with the language code and that region targeting is limited to countries - you can't target the European Union or North America, for example.
3. 301 Redirects:
You can use 301 redirects on duplicate pages that are automatically generated and are not necessary for the user to see. Adding rel="canonical" tags to the duplicate pages keeps the page visible to users, while 301 redirects point both search engine bots and users to the preferred page only. This should be done specifically to home page URLs from the WWW URL to the non-WWW URL or vice versa, depending on which URL is used most. Similarly, if you have duplicate content on multiple websites with different domain names, you could redirect the pages to one URL using a 301 redirect. NOTE: 301 redirects are permanent, so please be careful when you choose your preferred URL.
4. Meta Robots Tag
You can use the meta robots tag with nofollow and noindex attributes if you have to keep a duplicate page from being indexed by a search engine. Simply add the following code to the duplicate page:
<meta name="robots" content="noindex">
There is another way of excluding duplicate pages from the search engine indexes, and that is to disallow the links with special characters in the robots.txt file. Note: Google has advised not to disallow pages on the basis of duplicate content using robots.txt, because if the URL is completely blocked there is a chance that search engine bots might find the URLs outside of the website via links and may treat these as unique pages. This means that search engines will probably choose this as the preferred page among all the duplicates, even though that was not your intention.
5. Google Search Console:
You can set URL parameters to drop duplicate pages from Google-bot indexing. This option is also available under Configuration in the sub-section URL Parameters, however, using this option may cause de-indexing of important pages if not properly configured, hence it is not recommended if you are not entirely sure how to do it. Learn more about URL parameters in our blog on Clean URLs for SEO and Usability.
6. Hash Tag Tracking:
Instead of using tracking parameters in URLs (which creates duplicate pages with the same content), try using the hash tag tracking method. Tracking parameters are used to track visits from specific sites to your site, for example, from an affiliate marketer's site. These parameters are usually present after a question mark (?) in the URL. With the hash tag method, we remove the question mark and use a hash tag (#). Why? Well, Google bots tend to ignore anything present after a hash tag. So, for example, you might have duplicate URLs like http://yoursite.com/product/ and http://yoursite.com/product/#utm_source=xyz. When you use the hash tag, Google sees both the links as http://yoursite.com/product/. To do this, use the _setAllowAnchor method, as illustrated here.
7. Content on Country-Specific Top-Level-Domains:
When you have businesses spread all over the world it is natural to have multiple domains for each location and it is likely not possible to create unique content for each of these sites when the product/service is the same. How do you handle content duplication within your country-specific domains? To start with, go to Google Search Console>Configuration>Settings in each of the country-specific domains and choose the country of the target audience for each site.
- If possible, use a local server for each country-specific domain.
- Enter local addresses and phone numbers on each of the country-specific sites.
- Use geo meta tags. These tags may not be used by Google, as you have already set the target users option in Google Search Console, but they may come in handy to let secondary search engines, such as Bing, know that your site targets a specific country.
- Use <a href="https://support.google.com/webmasters/answer/189077?hl=en" target="_blank">rel="alternate" hreflang="x"</a> to let Google bots know more about your foreign pages with the same content and to show which page should be returned for which audience in search results.
Some SEOs may suggest using rel="canonical " to cope with cross-domain duplicates, but it is not yet clear if using this to redirect multi-domain pages is the right solution, as it is necessary for geo-targeted sites to show up in search results for their respective country-specific searches. For now we recommend clarifying that your content is geo-targeted so that search engines know which content to show to which audience, avoiding confusion.
8. Paginated Content:
When you have content with cohesive components spread between multiple pages and you want to send users to specific pages via search results, use rel="next" and rel="prev" to let search engines know that these pages are part of a sequence. Learn more about implementing these rel attributes on the Google Webmaster Central blog on Pagination with rel="next" and rel="prev". There is another sort of pagination when it comes to blog comments. Disable comments pagination in your CMS, otherwise (on most sites) different URLs of the same content will be created.
Note: Once you have used these strategies to get rid of duplicate content, remember to update your XML Sitemap by removing duplicate URLs and leaving only the canonical URLs, then re-submit the Sitemap to Google Search Console. Read our guide to XML Sitemaps for more information.
There are also a few things you can do to fight duplicate content on your site regularly. For example, improve your internal linking, and link to preferred domains. As more links are found pointing to preferred URLs it becomes easier for search engines to judge which is the preferred page. Also, on e-commerce sites, when you have products that are categorized based on colors, sizes or anything else, every time a user clicks the size or color the URL changes due to a sorting parameter, and this creates duplicate content. In such cases, provide the option to choose selection criteria on the same page, such that the URL does not change.
Let us know in the comments if you have any questions about duplicate content on your site or if you have any suggestions for coping with duplicate content that have not been mentioned in this blog.