Penguin Cases of Interest: Twin Studies

Like a lot of folks in the SEO world, I’ve been looking at sites which were “hit” by Google’s Penguin update.

At this point, I have done a detailed analysis of the first 7 cases (involving a total of 9 sites) on my list. In six of the seven, I have found what I could call “obvious” on site issues. Things like:

  1. Hidden text. Not necessarily “manipulative” but still, hidden text. 3 of 6 positive cases have hidden text that was easy for me to find, in no case was it clearly “manipulative” but it was clearly hidden.
  2. Alt and title attribute abuse – keywords stuffed into the alt and title attribute of images, and in the title attribute of anchor tags (links).
  3. “Holy cow” copy at the bottom of the home page and/or other pages, up to and including all pages. By “holy cow” I mean lengthy blocks of copy, stuffed with keywords and links, often in smaller type, gray text, etc.
  4. Hidden or disguised links – as in, the only way to know it’s a link is to mouse over the text.
  5. Blogs and “article” sections full of terrible content, stuffed with keyword links – where the links may or may not be disguised.

Since all of these things should be cleaned up regardless of whether the Penguin even cares about them, it’s been easy for me to advise the owners of those sites to clean that mess up. If you’re spamming, don’t wait for Google to tell you about it.

This leaves me (so far) with 3 very interesting cases…

  1. Two cases of “twin” sites – same owner, one site is “hit” by Penguin, and the other is not.
  2. One case where the site appears, on the surface, to be a “false positive” in terms of on site issues.

“Twin studies” are interesting in science and medicine, because studying twins (where you know the DNA is the same) helps distinguish between what is caused by genetics (nature) and what is caused by environmental factors (nurture).

In the case of the Penguin update, “twin sites” may tell us a lot about what’s going on, and what has changed in the environment. So let’s take these one at a time…

Case #1: The Hidden Text Case
Pretty simple – they have two sites in slightly different niches of the same larger market. The sites are linked together.

One of the sites has a significant amount of hidden text in the template, the other does not. The site with the hidden text lost significant traffic and rankings on April 24/25, the site without hidden text did not. The hidden text in this case involves the use of CSS to drop an image on top of the text.

This hints that hidden text is a bad idea. As if we needed a hint.

Case #2: Gray-on-Gray vs. Black, White, and Blue all over.
Similar to case #1 – two sites, same market, different niches. The sites are not linked together but the domain registration info is identical as is the physical address on each site, so it’s no secret that they are owned by the same company.

One of the two sites has “holy cow” copy at the bottom of the home page, text is in gray on a (lighter shade of) gray background. #666666 on #cccccc if you care. Links stuffed into the holy cow copy, are same color as the rest of the text, not underlined, the only way that a normal human visitor would know it was a link is by mousing over it. This site lost just under 40% of its referral traffic from Google at the Penguin update.

The second site has very similar “holy cow” copy, but the text is in black (#000000) on a white background (#ffffff), and the links are unstyled – that is to say, the links are blue and underlined, and obviously links. This site is up 3% since Penguin, but a sampling of rankings indicates no change, so this is simply a natural seasonal increase which both sites would have probably experienced. That is, if Penguin had not occurred.

This hints that disguising links is bad. As if we needed that hint.

Case #3: The prolific link spammer…
I mention this one because I simply haven’t been able to find any big issues with the site itself. It is possible that they are cloaking, and not telling me about it, because people sometimes do stupid things and lie to their doctor about what they have done. This site’s Google organic referrals are down by about 26% in the post-Penguin week vs. the week prior*. It’s very likely that I’ll find something “on site” once I dig a little deeper.

This site does have a pretty aggressive inbound linking profile – that is to say, a ridiculously implausible number of links coming in with anchor text precisely matching the queries where they were most affected. Whether this means that they have been “penalized” for over-optimizing their inbound links, or simply that part of Penguin is giving such links less weight in ranking, we can’t say. The latter seems more likely.

The truth is that we don’t even know if inbound links have anything to do with the Penguin thing. I know what you are reading out there. I read it too. Most of what’s being written is simply one person parroting what another person said.

The best bit of evidence about inbound links is inconclusive but highly informative, once you understand who the data was collected from, and how. Those who spam aggressively off site are likely spamming aggressively on site – correlation is not causation – tails do not wag dogs.

That doesn’t mean the data from Micro Site Masters is useless – far from it, it’s a strong hint that we should look at more cases like case #3 – and it gives us better ideas on hypotheses to test.

Sorry, no “miracle cure” today…

Anyway – sorry I don’t have a “7 steps to fix your problem” ready just yet. I don’t think it would be particularly responsible to post something like that, when we are all still trying to work out what’s happening.

However, given the clear number of positive cases where there was pretty obvious spam on the site, those who are affected might do well to consider whether their sites really represent “best practices,” or “what we thought we could get away with.”

We’ll have more for you soon – thanks for reading!

– Dan Thies

* At any change event, you need 7 days’ worth of data to have sufficient confidence about which search terms have been affected. We often see little bumps and bounces in rankings and search volume throughout the week, and it’s far more productive to analyze SERPs that we know have changed.

UPDATE: Just to be very clear – although I’m not convinced that inbound link spam is a factor in Penguin, it’s almost certainly a good idea to clean it up. If you’re going to clean up a bunch of link spam, document what you do – the links you were able to remove, the links you couldn’t remove and the reason why, etc. If you end up submitting a reconsideration request, it will go a lot better, and a lot faster, if you can provide a spreadsheet detailing what you’ve done.

Comments

  1. If these sites were doing that obvious of spamming on-site, imagine what they were doing off-site.

  2. Dan,
    Publish those 7 sites that were affected by penguin here and I will add them to my case study and let you know what I find by tomorrow. I don’t publish results so this would be private info I would share.

    • Hi Josh – confidentiality means that we don’t share the identifying information with others. Even others who promise to keep a secret. I will probably have some non-confidential examples in the next round that I could share with you.

  3. Dan, I’m totally with you on that.
    Had a chance to look at a couple of “innocent” sites – nothing to be proud of.
    If people don’t engage with their audience and just build to receive – I wouldn’t want them to rank. Why would Google?

  4. Thanks for posting this…and thanks for the color codes as well. :) I went and had a look…and I think that if using #666666 text on #cccccc background is going to be considered spam, then designers might be out of business soon…don’t you?

    I made a couple of images using those codes:
    http://www.golod.com/images/color_test.jpg <- Big
    http://www.golod.com/images/color_test2.jpg <- Little

    If that is true, it seems like a pretty slippery and subjective slope.

    • I agree Jason – Google may be on a very slippery slope, although they likely don’t care, since they’re only about making the top ten results a little bit better than before, on average.

      I reported on the color, but I don’t think that’s the issue. It’s more likely that the issue (if there even is one) would be with the styling of the links to be identical to the surrounding text. That is, the links are also #666666, not underlined, etc.

      Look at the way email spam scoring systems like SpamAssassin work. A whole bunch of factors, which individually don’t indicate spam in every case, are combined to make a score, based on which an email can be flagged as spam. You have a link that says “click here,” well, sometimes spam says that, so you get a point against you. What? You’re using HTML in your email? Spammers do that – 2 points!

      I doubt what Google is doing is quite as crude as SpamAssassin, but clearly it doesn’t spot all the spam. Just look at some SERPs to see that some completely horrible sites have bounced up into the top 10. Likewise, it’s clear that some very good web sites are dropping out of the top 10 results.

      It’s hard to argue that the SERPs at Google have gotten objectively better. I’m sure that the “quality rater” system they use gave this version of the algorithm a better score than the old algorithm, but that may simply mean that their rating system is flawed.

      I’m just trying to get at “what they’re doing.”

    • Here’s how the conversation went with the owner of that site, btw…

      Owner: Google took away our rankings with Penguin. Clearly this is not our fault, and it is a grave injustice.
      Me: I agree that it is a grave injustice. Were you doing any keyword-link-stuffing on the site? That seems to be a recurring theme so far.
      Owner: No, all of the keyword links that I use are for navigation, to help users.
      Me: Okay, what about all this text at the bottom of the page, why is it in gray on gray, and the rest of the site is black on white?
      Owner: Well, we didn’t think anyone would want to read it, so we put it in gray at the bottom of the page, and made the text smaller.
      Me: So… why are the links styled to look exactly like the rest of the text that you say nobody is going to read?
      Owner: Because we didn’t think anyone would want to click them.
      Me: Um… So… Are these links there for search engines, or for users?
      Owner: They’re navigation for users.
      Me: But the users can’t see them.
      Owner: Oh, right… so you think that might be the problem?
      Me: Yes, it might be the problem. We don’t know for sure. But it won’t hurt anything to at least make the links visible to users.

      We will see what happens.

      • LOL. That is good stuff.

        I am going to give it another week to see what, if anything, shakes out of this. Kinda like the 3 to 4 year Google storm.

        • Yeah – it’s way to early to know, and they’re clearly going to have to make some changes. If I had to guess, I’d say this one will be easier to understand than Panda. There’s a lot going on right now, but I don’t think Google can manage this level of change for an extended period.

          We have some cases where the pattern looked really weird – big “week over week” loss on 4/25, smaller increase on 4/28 and on. Then we heard there was a new Panda version rolled out on 4/27.

  5. Pete Morris says:

    NIce to finally read a level headed article about this!

    This all rings true in my case. I had 2 sites hit, both of which were built before I had any idea about SEO, and both optimised according to advice I received at the Warrior Forum! Both sites were keyword stuffed: too many keywords in the content and alt tags, hidden links, a stupid amount of site-wide anchor text etc etc etc.

    I hadn’t actually touched these sites in long time. I discovered SEO Fast Start some time ago, and realised that I’d have to forget everything I’d learned about SEO and start from scratch! After they got hit, I had a good look at them and I’d forgotten just how spammy they were… Rather embarrassed!

    Looking at the first page of SERPs, it seems that a lot of sites in those niches were over optimized, as they’ve all gone. That means there are a few smaller sites now ranking, but relevant articles from major publications are now ranking instead. I have to say, overall, I think anyone searching within that niche will be better served by the content now being displayed by Google.

    Incidentally, zero change in traffic on my newer, better, and squeaky-clean sites.

  6. Dan,

    Great article. You mentioned though you are “not convinced that inbound link spam is a factor”.

    But in their statement they said :

    “Tweaks to handling of anchor text. [launch codename "PC"] This month we turned off a classifier related to anchor text (the visible text appearing in links). Our experimental data suggested that other methods of anchor processing had greater success, so turning off this component made our scoring cleaner and more robust.”

    Do you mean then that you feel that they are not penalizing for inbound links and the anchor text within them – but you feel that it is more accurate to feel that they simply just devalue links.

    Our studies show that penalization based on inbound links does happen and they are not just simply turning off a “classifier related to anchor text”.

    We have also observed these penalties happening with sites that have not done any unnatural link building – but have done offline promotions that have caused lots of viral activity and lots of incoming links.

    Carlos

    • Carlos,

      Almost everything they say in these monthly updates can be interpreted however you like.

      They could say those exact words if they had turned off a classifier that called some links spam based solely on reading the anchor text, after discovering that another method (say, ignoring low-value links as anchor text) was doing a better job. They could also say those exact words if the opposite were true.

      There have been a lot of problems with the way Google handles anchor text, both within a site, and from other sites. Some of these problems we would call bugs, really. We know that they have been aware of these issues, but it’s a complex system – changing one piece can affect things in other areas. A lot of our tests around anchor text are returning different results the past few months, so it’s been an area of focus for them. Some of the bugs are still alive and well.

      I’m not convinced that inbound links are a factor in Penguin. It could be used as one of many signals, but you can find in the SERPs today:

      1. Sites that have insanely out-of-balance inbound link profiles which did poorly in Penguin.
      2. Sites that have perfectly natural inbound link profiles which did poorly in Penguin.
      3. Sites that have insanely out-of-balance inbound link profiles which did very well in Penguin.
      4. Sites that have perfectly natural inbound link profiles which did very well in Penguin.

      The data that Rob & MicroSiteMasters collected is interesting, and we got a little more info from him on that a couple days ago. It appears that the majority of sites are their own sites, which is about what I expected to hear. Even in that data, there were sites with ridiculously out-of-balance inbound links, which were not affected.

      It might be helpful to think of this more like email spam classifiers like Spam Assassin – no single factor marks an email as spam – it takes an accumulation of factors. So far, we’re well past 20 cases worked, and on site factors are in play with all of them (yes, we did resolve that “mystery case” and keywords were indeed being “stuffed.”).

  7. Pete Morris says:

    Something just occured to me. I’m using breadcrumbs at the top of each post – my WordPress theme links back to the Category page with the Category name as anchor text. It’s there to make navigation easier for the user, but I’m wondering if this is now dangerous post-Penguin?

    Might Google see it as an attempt to get a lot of keyword-rich anchor text?

    • Probably not, Pete, but it depends on what you’re doing. We’ve seen some pretty crazy cases.

      One blog had over 100 categories, all variations of a handful of keywords they were targeting. Each category had hundreds of little bitty posts. This was working just super for them before Penguin, even though the site is a complete and utter waste of electricity.

      In that case, the breadcrumb navigation contributes to the keyword stuffing because the entire site is basically designed to stuff keywords into links. But even without the breadcrumbs, the sitewide list of category links would be just as transparent in terms of keyword stuffing and keyword-link-stuffing. They’re removing the sitewide category links to see what happens, but making no improvements as to site quality since that would involve “actually making the effort to build a useful website.”

      In the “normal” case, you’d have a few categories, those categories would already have a sitewide link, the links to the categories contain the name of the category, and the TITLE on the category pages is either the category name, or the category name + the site name. All of which is pretty normal stuff. It’s possible to stuff keywords into the category name, but that’s not “normal” at all.

      The other issue that may crop up with breadcrumb navigation is using the “home” breadcrumb to stuff in keywords targeted for the home page. Again, not “normal” and requires some effort.

      • Pete Morris says:

        Thanks, Dan – sounds like the breadcrumbs on my site are the “normal” case, and they only go one category deep. There are no sub-categories, and each category is a completely different keyword, so I won’t worry about it.

        It’s a bit crazy out there at the moment! They’ve got me thinking twice about everything, which I suppose in the long run is probably a good thing in terms of creating well thought out sites.

  8. From what I’ve been reading it seems like so many people are up in a storm over Google’s recent updates.
    Yet it also seems that the only ones who need to worry are the ones who’ve been engaging in unscrupulous and manipulative techniques. When we look at the individual factors, we get nervous because the language used can be a bit ambiguous. For example, “Paid Links”. Many people participate as publishers in Advertising networks, where people pay for links…I don’t think they all need to run and hide (at least from my understanding).
    This write up really helped me to understand this a bit better, as the updates sure can be confusing!

  9. Dan, I use wordpress for my blog, and I’ve seen out there articles discussing plugins that add hidden text and/or links to the code. Just from looking at the code on my blog I find it difficult to work out if I have anything installed that might be causing issues. Do you have any advise on how to work out if a plugin might be causing a problem? Thanks.

    • The easiest thing to do, Leigh – examine a text only version of some representative pages from the site. Best to look at Google’s cached version in case they’re cloaking. With WordPress, representative pages would be the home page, a blog post, a Page, category & tag pages. You’re looking for any text or links that show up in the “text only” version that doesn’t show up when you are looking at the site in a browser.

      • Thanks Dan. So, I’ve gone and tried this and I can’t see any links on the cached, text only version that don’t appear on the normal version. However, there are plugins that display on the normal version but don’t on the text only version. For example, I use the Shareaholic social sharing plugin. It produces a ton of code which includes links, but those links don’t show up in the text only version. Should that be a concern?