Google Analytics + R = FUN!

The scope of this post it to show how simple it is to get data out of the Google Analytics and create your own reports (that you hope that they can be semi-automated at least) and you favourite statistical graphs (those that GA is currently missing). As you already know R is a favourite tool to me, so this will be the main tool to get the data, reshape them and depict them. You will need elementary knowledge of the R language and in the end you’ll soon relaize that google search is more than 40% of the code polishing stuff your code will ever need…

R packages

There are two packages (or libraries) that connect to the Google Analytics API and return data to you, the older one is RGoogleAnalytics and the new champion is rga. Both are excellent and I’ve used both at all occasions. rga seems a bit nicier but RGoogleAnalytics is certainly more robust and works under all occasions. Apart from the core, I will use ProjectTemplate for my personal organisation (you won’t see it however) and ggplot2 for graphics.

Google Authentication

First of all go to Google’s API Console and create a new API application after you make sure that you have Google Analytics Service enabled.

<Pause>

Because of the lengthy script that would ruin the flow of the post I have created a Github repo where all scripts reside [zip]

Ready,Set,Goooo!

Now, that the API access is set-up in the one side,we should make the connection to R. You should already know that RCurl is a bit tricky as I have oulined in the A tiny RCurl headache note. The solution proposed there is applied here as well. note that this issue will be solved in the next release of rga. On the other hand RGoogleAnalytics seems to be already on the spot. Have in mind that using

ssl.verifypeer = FALSE

isn’t the most secure way to use network communication in R.  You can use the following to create a connection to the API [rga_initiate_API_connection.R] This is heavily copy-pasted from Randy Zwitch’s (not provided): Using R and the Google Analytics API post.

One issue is how to get the Profile IDs. The hard way would be to go to Query Explorer and cycle through all profiles and write down the IDs that you are interested in. However, luck is all you got as there is a function that will return you all profiles that you have access with the account tied to the API access you created. (BTW, it is excellent that there is provision for access to the Management API in the rga package)

In the next I will assume that you have defined the ids that you are interested in.

The main hypothesis that I want to get a taste of is whether the different post categories (eg. measure, statistics, music etc) have different load times. This will be interesting given that all categories don’t have the same burden to get loaded (images vary, youtube videos scripts). To achieve the following you will need to use a filters vector and loop over it. Give appropriate names in the page.group vector and you will be done. Note, that we have created extra metrics

  • e-commerce rate : this is not meaningful in the case of a blog, but if you are advanced in analytics you might have implemented goals as e-commerce events as B. Clifton suggests.
  • bounce rate : the bounce rate should be correlated to the page load
  • buckets of page load time : we use a 4 seconds range for each bucket to be consistent with the the Apdex standard.

Because I want to get a more metrics than a single query allows (11) I use another query in the loop to ge the rest and then merge them.Now, if you run all these scripts you will have a data frame like this extracted in the end of the script using the head() function

Enough with scripting!

Now, that the data are on our console we can finally get some graphs. The following histogram is the aggregated page load speed histogram of this blog. You should note that there is a significant volume of sample units that belongs to the 12-16 bucket. I have the suspicion that they also belong to a specific country group as well as the host is providing good page load timings in the US and Western Europe. (Note to myself : I should add the ga:country dimension in the second query run).

Rplot

OK. This is not a nice picture at all! I know that I have experimented with various analytics scripts in the last months plus in the first 3 months of 2013 I was using a significantly heavier wordpress theme but I still think the the sample is skewed by the georgaphic distribution of the readers (a new post will come soon on this!)

Rplot01

Extend the script to your needs

In a modification of the script above I can use a loop on the web properties that I have access , so I use R to store data and create a roll-up report in a fast way. If you are looking at the comments section of the scripts you will notice the following.

# In the future we should only get data for increment dates. Don't we?
get.start.date<-min(final_dataset$date)

I use this to incrementally query and store data in the final_dataset data frame (this will help with the sampling that I will run into the first time of running the script for a long period of time). I am pretty sure a cron thing can be streamlined here, however I have no idea on cron jobs…

Head now to https://github.com/IronistM/R_Google_Analytics !

5 thoughts on “Google Analytics + R = FUN!

  1. How about sampling? Is there an easy way how to eliminate it (with using sub-queries as in Analytics Canvas, for example?

    1. Hi Petr!

      You can use the walk attribute, which will ‘walk’ through the data set day-by-day. The result will be unsampled data (set batch to TRUE to require ALL data).

      So the query will be

      ga$getData(ids[1], batch = TRUE, walk = TRUE,...)

Leave a Reply