WordPress is certainly a very Search Engine friendly CMS, However at the same time it generates a lot of ‘duplicate content‘ out of the box. It could result from Archive Pages, Author Pages, Wordpress Pages, trackback URLs, post feed URLs etc. Duplicate Content is frowned upon by all Search Engines and Google in particular. Some SEO suggest that It could lead to penalization by Google and some suggest that Google simply filters out the Duplicate content and only Rank the pages in SERP which they deem are more appropriate. In any case, I think Duplicate content which refers to two ore more pages on a single domain or two different domains having same content or nearly identical content, adds no value both to the readers of your blog and Search Engines.
I am not an SEO, However I have learned some basics down the road to avoid duplicate content on my Wordpress Blog using Robots.txt file. So in my opinion an ideal Robots.txt file, which provides the directives to the search bots what to crawl on your blog and what not, should restrict the search bots from crawling below Directories and URLs for a Wordpress Blog.
User-agent: * Disallow: /wp-content/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /wp- Disallow: /feed/ Disallow: /trackback/ Disallow: /cgi-bin/ Disallow: /tag/ Disallow: /page/ Allow: /wp-content/uploads/
User-agent: * the Wildcard for user agents means all Robots and it will disallow all bots from crawling these directories that contain the files for example Wordpress configuration and admin files which should not Indexed. Disallowing Feed and trackback, tag, page is also a good idea, it would help you to prevent Duplicate content. And allowing Bots to crawl your Uploads folder is in place to get your Images indexed by Search Engines.
To my best knowledge all three Major Search Engines, Google, Yahoo and Bing Bots respect Wildcard use so here are the examples of usage of Wildcard to prevent certain files from getting indexed in SERP.
Disallow: /*.php$ Disallow: /*.js$ Disallow: /*.cgi$ Disallow: /*.xhtml$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: /*.txt$ Disallow: /*.php*
Above commands would restrict Search Bots from Crawling and Indexing files ending with these extensions. Why would you want Search Bots to see your CSS and Javascript files.
Disallow: */feed* Disallow: */trackback* Disallow: */comment-page Disallow: /*?* Disallow: /wp-*
These Commands helps you with to eliminate duplicate content issues with your WordPress Blog. Like feed, trackback, comment-pages (If you have enabled comment pagination) and /*?* means not to crawl any page having ? in its URL. Only use that if you are sure that none of your Post URLs contain ? in it and you have switched to pretty permalink structure. URLs containing ? otherwise are mostly generated by Search queries on your blog. Don’t just copy and paste above directives to your robots.txt file it’s for general guideline and you can use them all or any depending on your needs. and For more understanding you can have a look at my robots.txt file.


