Removing Spam from Google Analytics

By
sntc
2016-03-28
Google Analytics

To be honest, we tend to focus more on the marketing side of growth, and less on the other half – analytics. We can plan and execute dozens of campaigns but we won’t learn anything from that work if we’re not ready to track our results accurately. Ultimately, this means that our decisions can potentially be very wrong.

Here at ShuttleCloud, before we even push a new design into production, we first have to dedicate time to review our metrics and data collection for validity. Without the correct configurations to filter out unnecessary or incorrect information, our data will obviously be skewed.

After my colleague Eric Guo, who works primarily on our other product Gmail Meter, shared some articles on how to clean up Google Analytics data, I realized just how few resources exist on the subject. As we work on this clean up work on our end, I thought it might be valuable to share some of what we’re learning here at ShuttleCloud.

A. Creating Multiple Views for our Google Analytics Data

We always want a backup plan, because everyone screws up at some point. So one of the first things that we do once we setup a new Google Analytics account is to duplicate and create a second view of the data.

By creating this second view, we ensure data retention before we start playing with filters and exclusion rules. This provides us with one view that has all of the fine grain rules that we’ve added and another view that simply has the unfiltered, raw data. At any time, data between the two views can be compared to determine validity.

One additional step that we suggest is to even create a third view of the data and to use this third view as a testing environment for any new rules or filters you plan on implementing into the filtered view.

Views Google Analytics

The three views of the data at ShuttleCloud’s Google Analytics account

Remember, though, any new view will only be able to collect data starting from the time it was created. Furthermore, any data that is filtered out will not be collected by that specific view. Therefore, it’s impossible to retroactively get that data even if the filter is removed from that view. That’s why it’s so important to set these additional views up immediately after account creation.

Finally, one last note about views – we have our Google AdWords and our Google Analytics accounts linked but the AdWords data will not be properly imported into a new Google Analytics view unless we manually link the data to the new view by accessing the AdWords Linking settings found in the Admin panel.

Adwords and Analytics accounts

B. Excluding internal traffic from Google Analytics

From here, our next goal is to exclude website visits from our own team members. These visits are irrelevant and they’re one of the main contributors to data skew. There are a couple ways to address this problem, and it may very well be worth it to implement both.

1. Exclude IP Adresses

One tactic is to use a filter that excludes any IP addresses of our employees, specifically:

  • The IP address of our Madrid office
  • The IP address of our Chicago office
  • And the IP addresses of any remote employees

Management of Filters can found in the Admin settings for any View, and must be set exclusively for each View. When adding a Filter, we always choose to Create new Filter with a Custom filter type. While Predefined filters can work for specific use cases, it’s always better practice to get familiar with Custom filters. With Custom Filters, there are a number of choices but we primarily work with the Exclude or Include options.

Filters-GoogleAnalytics

In this example, we’re choosing Exclude and selecting IP Address as a Filter Field. Next will be to include all the relevant IP addresses for our company (sorry, can’t share those!), separating each with a vertical line “|”. Also, you have to add a “\” before each point. Remember that these IP addresses will have to be monitored regularly, as IP addresses are generally dynamic.

Example of Filter Pattern: 12\.23\.34\.45|12\.234\.45\.5

2. Install the Google Analytics Opt-out Add-on

The other tactic, which is much simpler if the entire team uses Chrome, is to have everyone install the Google Analytics Opt-out Add-on, created and maintained by Google. This Chrome extension blocks the Google Analytics JavaScript from sending data back to Google Analytics. This a great, as either an alternative or a complement to the above strategy, but unfortunately will not work when using other browsers.

C. Removing Ghost Spam and Other Irrelevant Traffic

In addition to excluding internal web visits, it’s also important to exclude external sources of irrelevant website traffic. There are two main types of spam that we want to remove from our Google Analytics data – Ghost spam and Crawler spam.

For today’s purposes, simply understand that these are both sources of irrelevant external traffic. Ghost spam is traffic from fake referrals created by people trying to get us to visit their site and Crawler spam encapsulates all of the traffic resulting from analytics and known web crawlers.

The latter Crawler spam is relatively easy to manage. There is an option to Exclude all hits from known bots and spiders found in the View Settings of the Admin panel.

Filter-Both-Spiders

While Crawler spam is actually fairly innocuous, Ghost spam, on the other hand, is actively aggressive and intentionally malicious. The trick with excluding them is that the referral traffic from Ghost spam is completely fake – the traffic never hits our site.

The easiest way to tell is to go to the Referrals reporting and apply a Secondary Dimension of Hostname, found under Behavior in the dropdown menu. This is adds another column in our data and from that column, we’re able to easily identify any fake referrals because the hostname is not a valid one that we use.

The process to create a filter to exclude this traffic uses the same basic premise. We assigned a large time frame and recorded all our valid hostnames, including translator sites and our own referral tools.

Once we had a comprehensive list of those, we applied a Custom Filter to Include only the traffic from these Valid Hostnames. While there are few options for excluding Ghost spam, this is the simplest and also most effective way to do so. Detailed instructions for how to do this can get very involved, so please refer to the same guide we referenced above.

Valid-Hostname

D. Cleaning Spam out from our historical data

With Google Analytics, it’s impossible to remove traffic and web hits once that data has been collected. So to clean up our historical data, we opted to create a segment within our Google Analytics account that contains only the relevant data we need. In order to do so, we built a Segment using two Conditions:

  • One to Include only traffic to valid hostnames
  • Another to Exclude known crawler referral spam

Settings for Segments can be found under the View column within the Admin panel. We made to sure to create a New Segment and added the above two filters as Conditions to the segment. The first condition is a Filter on Sessions to Include Hostnames that Matches the regex we used to filter out Ghost spam as explained previously.

The second condition is a little more complex and actually requires two parts connected by an AND conjunction. For the first half, we selected Medium (found under Acquisition in the dropdown menu) and Excluded traffic that Exactly matches the term Referral. From there, we clicked the AND button to add the second parameter of the condition, which calls for Referral Paths that Matches the following regex:

best|dollar|ess|top1)\-seo|(videos|buttons)\-for|anticrawler|^scripted\.|\-gratis|semalt|forum69|7make|sharebutton|ranksonic|sitevaluation|dailyrank|vitaly|video\-|profit\.xyz|rankings\-|dbutton|\-crew|uptime(bot|check|\.com)|responsive\-

This regular expression contains the most well known crawlers and bots, and can also be found in Escalera’s guide. If Crawler spam persists even in spite of the View setting changes, we always have the option of using an expression like the above to manually filter out Crawler spam in the a way similar to how we filtered out Ghost spam.

Segment-Analytics

Ultimately, this is simply a starting point for us as we need to regularly check our Analytics account for new instances of fake traffic. This way, we can identify whenever there are new sources of spam and stay on top of filtering this spam out accordingly.

In addition, this also coincides well with our practice of regularly reviewing IP addresses to exclude new internal traffic. Managing spam in Google Analytics is really a continuous process, requiring a bit of patience and a lot of specificity.

Share this post