Last Updated on
What is Duplicate Content?
Duplicate content is content which is available on multiple URLs on the web or on a site. Since more than one URL shows the same content, search engines don’t know which URL to rank higher in the search results. Therefore, in most cases, they rank both URLs lower and give preference to other webpages.
Google defines duplicate content as:
“Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin.”
Now, duplicate contents are everywhere and are available on multiple locations on or off your site. Take these examples, your site could be available on both www and non-www or HTTP and HTTPS – or both at the same time, Or maybe your CMS uses excessive dynamic URL parameters that confuse search engines. Your AMP pages could count as duplicate content if not linked properly. Even if your content has been translated into another language or your content is on the multilingual website; it also is duplicate content.
Duplicate Content Vs. Copied Content
It gets really confusing, especially for beginners, to distinguish between duplicate content vs copied content. Copied content is any content that someone copied from another domain. It doesn’t matter if you give it a little spin or put in a few keywords, this behaviour is not acceptable to Google.
Google guidelines consider the following points as copied content:
- Content copied exactly from an identifiable source. Sometimes an entire page is copied, and sometimes just parts of the page are copied. Sometimes multiple pages are copied and then pasted together into a single page. Copied text that exactly matches another website is usually the easiest type of copied content to identify.
- Content which is copied, but changed slightly from the original. This type of copying makes it difficult to find the exact matching original source. Some people change a few words. Other times, people will change whole sentences. For example, someone makes a “find and replace” modification, where they replace one word with another throughout the text. People deliberately make these types of changes so that it is more difficult to find the original source of the content. We call this kind of content “copied with minimal alteration.”
- Content copied from a changing source, such as a search results page or news feed. You often will not be able to find an exact matching original source if it is a copy of “dynamic” content (content which changes frequently). However, we will still consider this to be copied content. Important: The Lowest rating is appropriate if all or almost all of the MC on the page is copied with little or no time, effort, expertise, manual curation, or added value for users. Google rates such pages as Lowest, even if the page assigns credit for the content to another source.
What are the Main Causes of Duplicate Content?
There are many reasons for duplicate content and most of these are technical. Basically, duplicate content is the problem mainly because of the search engine algorithms.
In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate Goole rankings and deceive users, Google will make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results.
In most cases, it’s not very often that a human decides to put the same content in two different places without making clear which is the original – it feels unnatural to most of us.
There are many technical reasons though and it mostly happens because developers don’t think like a browser or even a user, let alone a search engine spider – they think like a programmer.
Indexing Categories and Tags
Let’s say your article about ‘keyword a’ appears at
https://www.techurdu.net/keyword-a/ and the same content also appears at
https://www.techurdu.net/article-category/keyword-a/. Here basically the first URL is your main post URL while the second one in the case you index your categories. After being indexed it generates the second URL. Though these both link to the same post but these both have different URLs.
In this case,
https://www.techurdu.net/article-category/keyword-a/ if you ask the developer, they will say it only exists once.
Then let’s say your article has been picked up by several bloggers and some of them link to the first URL, while others link to the second. This is when the problem for your site occurs.
The duplicate content is a real problem because those links both promote different URLs. If they were all linking to the same URL, your chances of ranking for ‘keyword a’ would be higher.
That is the most common cause of duplicate content and not a mistake of search engines, though. It is usually an instance of internal duplicate content. This common mistake usually occurs by beginners.
The publisher uses the same URL for two different posts. For instance, a post contains URLs
https://www.techurdu.net/keyword-a/ and the other post contains exactly the same URL
https://www.techurdu.net/keyword-a/. Despite the fact, both posts contain different content, yet under such a situation, the search engine finds two exact URLs and doesn’t know which URL to rank higher due to duplicate URLs.
It is yet another example of internal duplicate content. Search Queries for example;
https://www.techurdu.net/?s=keyword-a generates a dynamic URL (different from the original one). And because that search query creates a new URL, and therefore duplicate content.
Not Using Excerpts
It simply means that when a WordPress blog doesn’t use excerpts but shows the entire blog post on the blog’s homepage. That means that the blog post is available on at least two pages: the homepage and the post itself.
And it’s probably on the category and tag overview pages as well. That’s four versions of the same article on your own website already.
This happens in the case of shopping sites – another problem of internal sites duplicate content.
You often want to keep track of your visitors and allow them, for instance, to store items they want to buy in a shopping cart. In order to do that, you have to give them a Session.
A session is the brief history of what the visitor did on your site and can contain things like the items in their shopping cart.
To maintain that session as a visitor clicks from one page to another, the unique identifier for that session – called the Session ID – needs to be stored somewhere. The most common solution is to do that with cookies. However, search engines don’t usually store cookies.
Now things get interesting, at that point, some systems fall back to using Session IDs in the URL. This means that every internal link on the website gets that Session ID added to its URL, and because that Session ID is unique to that session, it creates a new URL, and therefore duplicate content.
Another cause of duplicate content is using URL parameters that do not change the content of a page, for instance in Tracking Links. You see, to a search engine,
https://www.example.com/keyword-a/?source=rss are not the same URL. The latter might allow you to track what source people came from, but it might also make it harder for you to rank well – very much an unwanted side effect!
Content Syndication is when web-based content is re-published by a third-party website. Any kind of digital content can be syndicated, including blog posts, articles, infographics, videos and more. Think of it as a kind of barter arrangement. The third-party website gets free, relevant content.
Most of the reasons for duplicate content are either the ‘fault’ of you or your website. Other websites use your content, with or without your consent. They don’t always link to your original article, and therefore the search engine doesn’t ‘get’ it and has to deal with yet another version of the same article.
If you syndicate your content on other sites, Google will always show the version they think is most appropriate for users in each given search, which may or may not be the version you’d prefer. However, it is helpful to ensure that each site on which your content is syndicated includes a link back to your original article.
You can also ask those who use your syndicated material to use the noindex meta tag to prevent search engines from indexing their version of the content.
Remember, the more popular your site gets, the more you face this problem. So, try to resolve this issue on a continuous basis.
Order of Parameters
Another common cause is that a CMS (Content Management System) doesn’t use nice clean URLs, but rather URLs like
/?id=1&tree=2, where ID refers to the article and tree refers to the category. The URL
/?tree=2&id=1 will render the same results in most website systems, but they’re completely different for a search engine.
In WordPress, there is an option to paginate your comments. This leads to the content being duplicated across the article URL, and the article URL + /comment-page-1/, /comment-page-2/ etc.
This one is really interesting. If your CMS creates printer-friendly pages and you link to those from your article pages, Google will usually find them, unless you specifically block them.
So, if your site has a “regular” and “printer” version of each article, and neither of these is blocked with a noindex meta tag, Google’ll choose one of them to list.
Now, ask yourself: Which version do you want Google to show? The one with your ads and peripheral content, or the one that only shows your article?
WWW vs. non-WWW
This is one of the oldest in the book, but sometimes search engines still get it wrong: WWW vs. non-WWW duplicate content, when both versions of your site are accessible.
HTTP vs. HTTPS
Another, less common situation but one I’ve seen as well is HTTP vs. HTTPS duplicate content, where the same content is served out over both.
[VIDEO] Duplicate Content – Effects, Causes, Identification and Solutions | Part – 1 [Urdu/Hindi]
Watch this video to get a detailed insight about What is Duplicate Content? How it is different from Copied Content? and What are the main causes of Duplicate Content? Video language is Hindi/Urdu.
How to Identify Duplicate Contents Issues?
Now, you have known what are the main causes of duplicate content. Here comes the main question – how to identify duplicate content issues on your site or with your content?
Using Google to Spot Duplicate Content within Your Content
Using Google is one of the easiest ways to spot duplicate content.
If you’d want to find all the URLs on your site that contain your keyword A article, you’d type the following search phrase into Google:
site:techurdu.net intitle:"Keyword A"
(Type your site name instead of techurdu.net) Google will then show you all pages on techurdu.net that contain that keyword.
The more specific you make that
intitle part of the query, the easier it is to weed out duplicate content.
Using Google to Spot Duplicate Content across the Web
You can use the same method to identify duplicate content across the web. Let’s say the full title of your article was ‘Keyword A – why trees are important’, you’d search for:
intitle:"Keyword a - why trees are important"
And Google would give you all sites that match that title. Sometimes it’s worth even searching for one or two complete sentences from your article, as some scrapers might change the title a little bit.
Another Simple Way to Use Google
If you have a certain page that you’d like to check, simply go to that page. Copy a text snippet, preferably from a section that you think might be attractive for others to copy (Note that Google only takes the first 32 words into account). Insert the exact snippet in Google between double quotation marks and you’ll see all the sites containing the exact text. Visit on the site and you’ll know further about your content. If your content is more than 40% copied, then you need to contact that site owner on a priority basis.
Using Duplicate Content Checker Tools
There are a lot of tools to find duplicate content. One of the best known duplicate content checkers is probably CopyScape.com. It’s one of many tools, but this one’s free and easy to use. This tool works pretty easily: insert a link in the box on the homepage, and CopyScape will return a number of results, presented a bit like Google’s search result pages.
You can click the results for more details and to see which parts of your text are duplicate. Copyscape tells you the number of words, or (Percentage – %) of this post, were copied.
CopyScape clearly highlights the text they found to be duplicate, which gives an idea of how severe the copying is.
If it’s just a small percentage of the page, it is ok for me. If it’s like over 40%, and makes up quite a large part of the other page, I would simply email them to change the copied text or give the URL of the original post of mine with a dofollow link or add my URL as the Canonical URL.
One thing you need to keep in mind that, you won’t get unlimited scans for one website. If you want to dive a bit deeper into your duplicate content, CopyScape also offers a premium version for more insights.
Siteliner is the duplicate content checker tool that searches for internal duplicate content (ie duplicate content on your own site within your content). This tool will show you a lot of things, but limited to 250 pages and once every 30 days. Again, there is a premium version, but the free one will work fine.
Once you do the search, you’ll end up on the overview page. You’ll see the percentage of internal duplicate content at the top left.
How to Solve Duplicate Contents Issues?
Ex-Googler Matt Cutts said that 20% to 30% of the web consists of duplicate content. There is no denying in the fact that duplicate content continues to pop up on every site. The more a popular site gets; the more it faces the problem.
Google is smart at discovering and handling duplicate content. It figures out what to do with most of the duplicate content it finds.
If it finds multiple versions of a URL it will fold these into the version it finds best — in most cases, this will be the original article/page. To do so, Google needs complete access to these URLs. If you block Googlebot in your robots.txt from crawling these URLs, it cannot figure these things out by itself and you will run the risk of Google treating these pages as separate instances. So, never block duplicate content on your site.
So, what do you need to do to fix these issues? Follow the solutions given below to resolve duplicate content issues on and off your site.
Before we dive into the solutions of duplicate contents issues, first we need to understand Canonical URL.
If you have two similar pages, and both of these are eligible to rank for a certain keyword, the search engine simply doesn’t know which of the two URLs it should send the traffic to or rank higher. To solve this, you can select a preferred URL, this is what we call the Canonical URL.
A canonical URL is a technical solution for duplicate content. Suppose a product of your site is attached to two categories and exists under two URLs, like so:
If these URLs are both for the same product, choosing one as the canonical URL tells Google and other search engines which one to show in the search results and rank it higher.
Canonicalization also enables you to point search engines to the original version of an article. Let’s say, you’ve written a post for another party that is published on their website. If you’d like to post it on your site too, you could agree on posting it with a canonical to the original version. This option is available on all important WordPress SEO Plugins – as explained in the above video in very detail.
Once you’ve decided which URL is the Canonical URL for your piece of content or keyword, you have to start a process of canonicalization.
There are many ways of solving the problem, we’ll discuss these one-by-one:
By Avoiding Duplicate Content
Some of the causes of duplicate contents mentioned above have really simple solutions:
- Using excerpts instead of showing the entire blog post on the blog’s homepage.
- Disable Session ID’s in your URLs. These can often just be disabled in your system’s settings.
- Duplicate Printer-Friendly Pages are completely unnecessary: you should just use a print style sheet.
- You should just disable Comment Pagination feature (under settings » discussion) on WordPress sites.
- In most cases, you can use hash tag based campaign tracking instead of parameter-based campaign tracking to resolve tracking links issues.
- Have you got WWW vs. non-WWW issues? Pick one and stick with it by redirecting the one to the other. You can also set a preference in Google Webmaster Tools also known as Google Search Console, but you’ll have to claim both versions of the domain name.
301 Redirecting Duplicate Content
In certain cases, it’s impossible to entirely prevent the system you’re using from creating wrong URLs for content, but sometimes it is possible to redirect them.
If you do get rid of some of the duplicate content issues, make sure that you redirect all the old duplicate content URLs to the proper canonical URLs. For this, you can use popular WordPress Plugin like Redirection or SEO Plugins like Yoast or Rank Math, etc.
By Adding a Canonical Link Element to the Duplicate Page
In some cases, you don’t want to or can’t get rid of the duplicate version of an article, even when you know that it’s the wrong URL. To solve this particular issue, the search engines have introduced the canonical link element. It’s placed in the <head> section of your site, and it looks like this:
<link rel="canonical" href="https://techurdu.net/wordpress/seo-plugin/" />
Here, in the
href section of the canonical link, you place the correct Canonical URL for your article. When a search engine finds this link element, it performs a soft 301 redirect, transferring most of the link value gathered by that page to your canonical page.
This process is a bit slower than the 301 redirect, so if you can just do a 301 redirect that would be preferable, as mentioned by Google’s John Mueller.
By Linking back to the Original Content
If you can’t do any of the above, possibly because you don’t control the <head> section of the site your content appears on, adding a link back to the original article on top of or below the article is always a good idea.
You might want to do this in your RSS feed by adding a link back to the article in it. Some scrapers will filter that link out, but others might leave it in. If Google encounters several links pointing to your original article, it will figure out soon enough that that’s the actual canonical version.
Many SEO tools/plugins like Yoast allows you to add an extra line to your feed items (check out the Search appearance > RSS section of Yoast SEO plugin). That line could say “The article (article title) was first published on (your URL)”.
This ensures that, if people copy content from your website via your RSS feed, there will always be a link back to your website. Google will find that link and understand you are the original source.
When to Redirect, when to use a Canonical?
Unlike with redirects, users don’t see your Canonical. If you can redirect a URL without breaking your site. You should do it right away. But if redirecting makes your site illogical, setting the canonical is a viable solution.
Adding a Preventive Snippet
Plugins like Yoast support this feature. In the ‘Search Appearance’ > ‘RSS’ section of Yoast SEO plugin, there is predefined a snippet to add to your feed entry saying “This article first appeared on yourwebsite.com”. The link in this snippet makes sure that every scraper includes the link to the original article. Of course, this already helps to prevent duplicate content, as Google will find that backlink to your website.
Using Excerpts Instead of Entire Post
Using excerpts (rather than showing the entire post) has the advantage that the excerpt always has a proper link to the post. This link will tell Google that the original content is not on that blog/category/tag page but in the post itself. We often recommend the use of excerpts.
What to do when you find Duplicate Content on other sites?
Things are really simple by now. You’ve known that you have content duplicated on other sites.
If your content has been used via RSS, we’ve already discussed it above how you can deal with such a situation under By Linking back to the Original Content section.
In the first case, as my content is copied most of the time, I visit the sites and check how the content has been copied. Whether they have given me link back – especially a dofollow link or added a Canonical URL. In either case, I let it be there.
In the second case, I see my content on other sites but that is less than 40%. Despite I haven’t given a link back, I let that site keep that content especially if that is a new site or newly growing site. But, if that is a popular site and has experience of working in the market, I contact that site and ask for the link back.
In the third case, if my content is copied more than 40%. I insist on either total removal of the content or demand changing the content altogether or getting a Canonical URL of my original post. So that my Google ranking or traffic is not affected by that site or post.
Now here comes the question if the site publisher denies accepting any of your demands in any of the above case, then there are some ways to solve the problem.
First, you may have to use your copyright as the original author to have that content removed. Google suggests contacting the host of the website and filing a request at Google as well.
Secondly, you can seek Google AdSense help (if the site has AdSense enabled). You can also and ask the hosting company to take action.
Third, you can file a legal suit (as per your country’s prevailing laws). But remember contacting the site owner is imperative in either case and you must let him know of the consequences if he denies.
[VIDEO] Duplicate Content – Effects, Causes, Identification and Solutions (in 2020) | Part – 2[Urdu/Hindi]
In this video, we’ll have a very detailed look at how we can find duplicate content within and outside our content. We’ll also discuss how we can solve or fix these problems of duplicate content. The video language is Hindi/Urdu.
Conclusion – Fix Duplicate Content ASAP
As per Google, it does not recommend blocking crawler access to duplicate content on your website, whether with a robots.txt file or other methods. If search engines can’t crawl pages with duplicate content, they can’t automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages. A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the
rel="canonical" link element, the URL parameter handling tool, or 301 redirects as explained in the post.
You should keep it in mind that duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. If your site suffers from duplicate content issues, and you don’t follow the strategies listed above, Google will choose a version of the content to show in its search results.
Conclusively, duplicate content happens everywhere. It’s something you need to constantly keep an eye on, but it is fixable, and the rewards can be plentiful. Your quality content could soar in the rankings, just by getting rid of duplicate content from your site!