A few days ago, Andy Brice did a superb exposé on download site award scams. I’ve known that the vast majority of download site “awards” are completely bogus, which is why I do not display them on my software pages. However, while the award scam is entertaining, it does minimal damage to authors except those foolish enough to wear those fake stars as badges of honor.
Well, here’s another “scam,” and it can cause Micro-ISVs actual damage, hurting their search engine rankings and even preventing their own websites from being indexed by Google.
Softpedia: Content Scraping at its Worst
Let’s start with the thievery. Here is my index page content next to Softpedia’s stolen copy (you can click the images for the full-size screen shots):
(The Softpedia page is currently at http://www.softpedia.com/get/Office-tools/Diary-Organizers-Calendar/Get-em-Done-ToDo.shtml but won’t be there for long. My page is at http://www.getemdone.com)
Basically, you can see that Softpedia just copy-and-pasted the copyrighted content from my main index page on to their page, starting with “Just a to-do list.” They even kept the first-person comments, as if I was writing for Softpedia! Ah, but they were diligent enough to remove my byline at the end.
This is not PAD file content. Softpedia did not get the text from my software description PAD file — text that they are allowed to copy. My PAD description is here (XML). Even the longest description in the PAD file is only one paragraph.
The damage: Softpedia prevents my own website from being properly indexed by Google!
This is the kicker. Waiting patiently for one’s site to be indexed by Google is something we all have to do. But when I went to Google to see how it was going, imagine my surprise when I saw this:
This is what Google shows when I do a search for “site:getemdone.com” — this tells Google to list pages indexed from my site. Notice something? The main index page is not indexed! In all the years I have done websites — I have done many — I have never noticed Google to completely ignore the main index page of one of my sites while indexing other pages on the site. It is always the other way around — Google indexes the main index/home page and eventually gets around to indexing the rest.
Why? Two words: Duplicate content. Google’s efforts to provide good search results for its users includes avoiding duplicate content. If its algorithm determines that a page is just a copy of another site elsewhere on the net, it is not going to give it a high priority.
Softpedia’s website has been around for years, and is continually crawled by Google. GetEmDone.Com has only been on the web for a few days. So Google favors the Softpedia text and assumes my site is duplicate content, deferring the indexing of my own page. Optimistically, eventually Google will determine that Softpedia is the duplicator of the content and that mine is the canonical site.
The Damage Done: As you can see, at the moment, Google is assuming my own website is nothing more than duplicated content. I took care to make different description text for download sites to use, but the Softpedia “editors” decided they didn’t want the same text as hundreds of other download sites and just copied my main index page.
By the way, you see what that is at the bottom of my original page? A copyright notice. Softpedia has no excuse. They could have simply written their own review, or quoted a paragraph or two. Instead, the thieving scum at Softpedia simply stole my content for their own.
Why do they do this? Simple: Softpedia needs visitors so they can serve Google AdSense ads and get clicks. Notice that the text I wrote is surrounded by Softpedia’s advertising content in an effort to make money off of the plagiarized text.
Download sites are a dime a dozen. Most of them are put online by webmasters trying to earn a quick buck from content ads, and just use text from PAD files. But there’s nothing wrong with that. Authors like myself put descriptions in PAD files exactly for that purpose.
But my website index page is not a PAD file. The text pirates at Softpedia did not request permission from me to copy text from my site on to theirs. They took a copyrighted web page and just scraped the content, harming me in the process. Shame on them.
Are you a software author? Better check and see if Softpedia is stealing your web text just so they can stay on top of the other download sites.
Oh, and in case you think I’m upset over nothing because it’s just more advertising for my software: look carefully at Softpedia’s pages. Below my text (which they stole from me) are links to competing products under “related downloads”. They ripped off my text and are using it to promote other people’s software!
Update: Google’s index catches up.
As I optimistically expected, Google’s algorithm has caught up and is now indexing the main page of GetEmDone.com. That’s good.
But does this still have a negative affect on authors who have their website content scraped and stolen by Softpedia? Consider this from Google:
“Google tries hard to index and show pages with distinct information… This filtering means, for instance, that if your site has a “regular” and “printer” version of each article, and neither of these is blocked in robots.txt or with a noindex meta tag, we’ll choose one of them to list. In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results.” (http://www.google.com/support/webmasters/bin/answer.py?answer=66359)
I think Google’s algorithm is smart enough to realize which content is the duplicate — after a while. But if you are an author, do you want Softpedia scraping content off of your web pages and reducing the uniqueness of that text?