Monday, November 27, 2017

Optimize your marketing efforts with the power of raw data

If you're in marketing (I'm not), you'd want to know how much bang-for-buck your campaigns are producing. Now I'm no Google Analytics expert, but I'm going to make a claim that you too might have an easy time measuring conversions across your campaigns, but not so much how your sign ups differ between them? Is one bringing in more high quality people than others?

If you are running a web shop or some other transaction-based business then sure, you can set up Google Analytics Ecommerce to follow up on customer life time value. But what if you are running a SaaS and you bill monthly? Or you have a freemium model and you hope to monetize the high volume users through some additional features, how would you compare your campaigns then? With only Google Analytics, it is tough to analyze how the users coming through different marketing campaigns differ from each other.

What you need is raw data and to know the path each of your user has taken as they sign up to your service whether paid, direct, organic or other. But Google doesn't give out this information easily. You could invest in Google Analytics 360, which means perhaps $150 000 a year. I love the quote they've selected to highlight by the way:
“Companies spend a lot of money on marketing, it’s difficult to understand which dollars are working and which dollars aren’t. The Google Analytics 360 Suite allows you to actually understand which half is working.”
(Shouldn't Google Analytics be able to do that already? No? Okay, my bad.) If you're unwilling to spend the $150 000 on 360, there is also a hack for querying raw data from Google Analytics. I don't particularly like this approach as it is still dependent on Google to allow this.

You can, however, simulate Google Analytics behavior with your own code and do whatever you wish with the data. Especially if your landing pages are in a different sub domain than your service, you'll might be more successful using a cookie instead of the local storage.

The workflow is simple:
  1. Store your data to the user's browser as they first arrive to your site. At that point you have access to data such as referring URL. This allows you to calculate almost everything, except e.g. auto-tagged adword campaigns (as of November 2017).
  2. Once your user successfully signs up to your service, you will need to fetch the marketing traffic data from the user's browser and
  3. Send the data off to somewhere, along with the ID of the new user.
With the simulated GA cookie script above, you can deduce quite a bit of information from the referrer URL, such as source and medium, and if the URL is UTM tagged, also campaignterm and content. For debugging purposes, it might be helpful to store the full referrer url as well.

We ended up storing the cookie value as a base-64 JSON string. In the below example our valueArray would contain information such as source, medium, campaign, term and content.
// when saving the array to cookie
cookieValue= btoa(JSON.stringify(valueArray));

// when reading the cookie
valueArray = JSON.parse(atob(cookieValue));
Remember to save the cookie domain with the dot in front for it to be accessible in all your sub domains.
document.cookie = 
"cookiename=cookieValue;expires=Sat, 18 Nov 2017 00:00:00 GMT;";
If you prefer to use something more modern than cookies, this library from Zendesk should help you with local storage across subdomains (haven't tried it though).

Once the data is in your analytics database, you will be able to analyze what kind of users you are getting from the different marketing sources and double down on those that bring in more quality sign ups.

It's probably easiest to implement this logic in Google Tag Manager. Have fun! Let me know if you're planning to try this and how it went!

Calculating active users and churn in a database

Have you ever struggled with calculating your active users per day, or month? How about your new activations or churned users? It's a common problem, but often the first solution you find will not scale.

You should avoid sub queries, and use a window function instead. But that's only a part of the solution.

Assuming your active user definition is "someone who has signed in within 7 days" and you have this data in a table with columns user_id, timestamp. The first thing you want to do is calculate when a user activates for the first time, when they churn and when they resurrect.

    , LEAST (LEAD("timestamp") OVER w, "timestamp" + 7) AS "activeExpiry"
    , CASE WHEN LAG("timestamp") OVER w IS NULL THEN true ELSE false AS "activated"
    , CASE
 WHEN LEAD("timestamp") OVER w IS NULL THEN true
 WHEN LEAD("timestamp") OVER w - "timestamp" > 7 THEN true
 ELSE false
 END AS "churned"
    , CASE
 WHEN LAG("timestamp") OVER w IS NULL THEN false
 WHEN "timestamp" - LAG("timestamp") OVER w <= 7 THEN false
 WHEN row_number() OVER w > 1 THEN true
 ELSE false
 END AS "resurrected"
   FROM "login"
   WINDOW w AS (PARTITION BY "user_id" ORDER BY "timestamp")
With this, you've added three boolean columns to your login data: activated, churned and resurrected, which tell you when a user activates for the first time, after which login they churn and which login makes them resurrect. Now your data is in such format that you can calculate active users over any date.

To calculate your daily active users you can apply the following query:
SUM(SUM(activated::int - churned::int + resurrected::int)) OVER w
, "timestamp"::date FROM vw_login
 GROUP BY "timestamp"::date
 WINDOW w AS (ORDER BY "timestamp"::date)
 ORDER BY "timestamp"::date;
Hold on. No you can't.

For churn, you also need to take into account the time-shift: your row indicates that the user churns after that login. The day your user becomes inactive is after 7 days of the login, not on the day of the login. To accomplish this efficiently it's best to first create a view that aggregates the data to a daily level and then do a join.

-- Aggregate daily
    SUM("activated"::int) "activated"
  , SUM("churned"::int) "churned"
  , SUM("resurrected"::int) "resurrected"
  , "timestamp"::date "date"
  FROM "vw_login"
  GROUP BY "timestamp"::date "date"
For the running total you should generate a time series first to ensure that all the dates you are joining the churn to exist. Also, take note on wrapping your values with a COALESCE function as if your SUM includes NULL values, it will return null.
-- Calculate a running total
 , SUM(COALESCE(a.activated::int,0)
   - COALESCE(a2.churned::int,0)
   + COALESCE(a.resurrected::int,0)) OVER w
 , d."date", a."activated", a2."churned", a."resurrected" FROM
 generate_series('2004-03-07'::date, CURRENT_DATE, '1 day'::interval) d
 LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
 LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '7 days')::date
 WINDOW w AS (ORDER BY d."date") ORDER BY d."date";
That's it. That will give you the number of active users per day and the number of users who activated, churned or were resurrected on that day. It's now up to you to figure out what your definition for an active user is and if there are dimensions you need to add. But the basic logic is clear enough.

Let me know if you found this useful!

Wednesday, November 22, 2017

Power BI, Tableau or QlikView

Ok ok I went along with Power BI but hear me out.

A year ago I joined Zervant to be responsible for the company's BI and analytics. Pretty soon I had to make a choice: which BI package should we take into use. Before this, Zervant didn't use anything apart from Excel and a Geckoboard. From my previous positions at Nokia and Microsoft, I have experience in Tableau, QlikView  and Power BI, in that chronological order. It quickly came down to a choice between these three. Or to be honest - between Tableau and Power BI.

Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime.

I'm a big believer of self-service BI. I would rather build an analytics team that enables others solve their decision needs rather than giving the right conclusions. Therefore, building something that can be used by anyone in the company was one of the goals. Also, joining a start-up meant cost was a concern.

QlikView is in my experience complex to set up and maintain. I might be wrong, but I've always found QlikView a bit unapproachable. Powerful, yes, but too much a commitment and an investment. I wanted something that was easy to take into use, and possibly change if we didn't get along. QlikSense Cloud has changed this fortunately, but I'm still getting an allergic reaction from Qlik* (sorry). Maybe it's also the fact that I haven't seen great examples of QlikSense in actual use. Also, when searching for tips on QlikSense it becomes clear that the user community must be significantly smaller when comparing to the other two, which means less help when you run into issues. I believe I'll eventually end up giving QlikSense an honest try, but I will need to have some very convincing reasons to switch. And I wouldn't do it for $25 / user per month.

I like Tableau: the way it looks, integration with R, scripting and the rest of the features. I actually prefer it to Power BI. But what I don't like is its price: $112 / month per user, when you need to connect to a PostgreSQL server and have someone else host the dashboards for you. I also often struggle with the way Tableau assigns my data to dimensions and measures. Back when I was making this decision Tableau would've meant shelling out 2500 € for a year's license, and I wasn't ready to make a commitment of that size just yet. $112 / month per user might be a different story.

I ended up choosing Power BI and while I sometimes get some dissatisfied comments on this decision from Tableau fans, I am completely satisfied with the decision. Power BI is on-par feature-wise with Tableau. You don't have as much power over design, the DAX language is often infuriating, and I have only recently learned to tolerate the way you are forced to use on-premise data gateway despite the fact that your databases are also in the cloud. All these shortcoming are offset with what I am able to accomplish with it for the price: free now and $10 / month per user later. I also really appreciate the custom visuals, despite them often being quite buggy. But keep in mind that Mac users will not be able to develop new reports.

So, if you are a price-conscious Windows user and need a BI tool that anyone can use, have a look at Microsoft's Power BI.

Thursday, November 16, 2017

Why WINDOW functions are changing the game for MySQL

Back when we were deciding on an SQL flavor at Zervant, the fact that MySQL didn't have WINDOW functions was a deal breaker. We had plenty of MySQL experience and not that much in PostgreSQL but I had to go with PostgreSQL anyway - just because of WINDOW functions. But that's about to change, as MySQL is getting this support soon, too.

Imagine you need to calculate the number of active users you have on any given day. Now imagine that the way you have defined an active user is as someone who has done something on your service within e.g. 30 days. The something might be e.g. uploaded something, downloaded something, made a purchase, edited their profile - doesn't matter. And imagine you want to split by dimensions.

Without WINDOW functions you don't have good options. You can accomplish this by either sub queries, which gets very expensive:
         (SELECT SUM(t2.count) FROM t as t2 WHERE  <= AS cumulative_sum
or MySQL variables, which cannot be used in a VIEW:
         (@running_total := @running_total + t.count) AS cumulative_sum
    JOIN (SELECT @running_total := 0) r
There might be a way to use a stored procedure and a table, but this gets complicated to implement.
Code credits: MagePsycho.

Window functions to the rescue

With WINDOW functions, MySQL takes a step towards being a viable alternative for an analytics database. With a WINDOW function you can easily calculate a running total over any dimensions. Placing this logic in a VIEW makes it even easier to do report
FROM active
Simple and effective - and enables the kind of calculations you have to do daily for analytics.

If I had to make the choice between MySQL or something like PostgreSQL for an analytics database, this change would make me reconsider. How about you?