Monday, December 18, 2017

Spot your churning users sooner with statistics

The girl (or guy) you're dating has called you on average five times a week lately. As a neurotic analytics enthusiast/stalker you happen to know this. Now it's been three days without a call. Should you be worried?

Monday, December 11, 2017

Becoming a data-driven company

How does a company become data-driven? Here's what I've learned at Zervant.

Radiate information

If you want to turn the culture more data-driven, radiate information. You'll never know what will stick. Radiating information also has the side-effect of evangelizing the work of analytics: people will remember you exist and they will take analytics into consideration more easily.
Oh right, when we're building this we need to talk to analytics about what to measure! 
Write an internal blog, give short presentations about an information nugget, etc. Frequency is more important than the perfect quality of content.


People love to learn new things. They will first tell you they're in no way analytical-minded and far from technical, but they will eagerly learn. De-mystifying analytics will help everyone in your organization to speak the same language and be on the same page about what analytics can do for them.

As I've held workshops on analytics, I've seen people light up and apply their newly discovered analytical mind in a sophisticated manner on their own fields including HR, branding, marketing and customer support. It doesn't take much, but it gives back a lot.

Be exemplary in prioritizing projects by their business potential

Who doesn't want to develop a fancy neural network that predicts accurately the next user to convert? But if you ask yourself what you can really do with that information and you struggle to answer, it's a sign that you should perhaps reconsider your priorities.

Everyone in the organization will have opinions on what the most important thing is. Amazingly, the most important thing tends to depend on the person and astonishingly often is related to their own field. This applies to us in data science too. Break free from that, be objective, realistic and show the right kind of example to the rest of the organization. If you have good, concrete arguments of a projects potential, it will be hard to oppose it with only opinions.

Create a data strategy: start from what your business needs, identify the points that have the largest impact, but equally importantly brainstorm new opportunities for use of data. Make the business case clear and be honest about it.

Experiment lightweight

It doesn't matter that you don't have a fancy A/B testing suite, you can still do lightweight experimentation. Are you planning to add a new feature? What if before implementing any of it you just create a mock up with a button to check who clicks it, who's interested? If you do this, don't be a douche. Do it with respect to your users.

The point is not to get it right the first time, but rather to show how easy it is to test something out. This will help people understand the power of lightweight experimentation and they will have lots of new ideas to try out.

Celebrate even the small wins

Everyone loves a good story, even if the story is about how the definition of your active user got improved - as long as you make it relevant to your audience. Celebrating even the small wins is not just good for morale, but it also highlights the impact of this work. In order to continue enjoying the trust of your top management, this is critical.

You won't make it alone

Data-driven doesn't mean you shove data down everyone's throats. Rather, try to find approaches that involves others. Analytics is woven into everything a company does. When it is done as an afterthought, it becomes an isolated activity - something to sprinkle on top by the analytics team before things go live. Preventing this from happening unfortunately means having good personal relationships with the key influencers in your organization. So work on your social networking skills.

Monday, December 4, 2017

Dramatically improve your conversions with session recording

Are you satisfied with your sign-up rate? Or activation rate? Do you think it could be higher? What do you think you can do to improve them? How do you know that?

One of the most powerful things that you can do to improve your activation rate is to watch what your users are doing: what confuses them, what path is logical to them and where they fail. When you do that, you will find out that your call-to-action is too mild, your layout is confusing, your responsive design isn't adapting well in all cases or that the navigation path you thought everyone takes isn't common at all. You cannot solve what you do not know. Luckily most of your issues are likely low-hanging fruits you can find easy remedy to.

There are a gazillion services you can use, HotJar being perhaps the biggest name. But it wasn't the right choice, at least for me. HotJar lets you record towards a quota and you can define pretty flexibly who you want to start recording and when. But HotJar doesn't work with scrollable DIVs and if you would like to see only sessions that include purchasing your product, HotJar will fail you.

If you're not too concerned about GDPR or privacy matters in general, then Yandex Metrica might be a great choice: it's free and from what I can tell fairly powerful. But I care about privacy and you should too.

Then there are decent looking services such as Fullstory, Inspectlet and Mouseflow, but honestly when comparing the bang-for-the-buck they came second to Smartlook. I'm especially fond of the fact that they are taking GDPR seriously.

If you end up going with Smartlook, I recommend taking a look at their tagging. With that, you can add identifier tags to the sessions, e.g.
        // tag a session with a user's email
 smartlook('tag', 'email', ''); 
This is pretty handy if you get feedback from a user about a bug and you just can't reproduce it. This lets you see what really happened.

Or you can tag a session that includes a specific action
 smartlook('tag', 'checkout', 'true'); 
 smartlook('tag', 'contacted', 'true');
Keep in mind that you cannot override a tag you've already used. The following will not work
 smartlook('tag', 'action', 'checkout'); 
 smartlook('tag', 'action', 'contact');
Also, I've found it quite handy to tag with a session number. If you otherwise won't have access to a session number, you can easily create something that would track it in the browser using the local storage for example.

This only gets you so far though. After the easy problems have been fixed, it becomes increasingly harder to find the signal from the noise. By that time though you will have tremendously improved your conversions.

Monday, November 27, 2017

Optimize your marketing efforts with the power of raw data

If you're in marketing (I'm not), you'd want to know how much bang-for-buck your campaigns are producing. Now I'm no Google Analytics expert, but I'm going to make a claim that you too might have an easy time measuring conversions across your campaigns, but not so much how your sign ups differ between them? Is one bringing in more high quality people than others?

If you are running a web shop or some other transaction-based business then sure, you can set up Google Analytics Ecommerce to follow up on customer life time value. But what if you are running a SaaS and you bill monthly? Or you have a freemium model and you hope to monetize the high volume users through some additional features, how would you compare your campaigns then? With only Google Analytics, it is tough to analyze how the users coming through different marketing campaigns differ from each other.

What you need is raw data and to know the path each of your user has taken as they sign up to your service whether paid, direct, organic or other. But Google doesn't give out this information easily. You could invest in Google Analytics 360, which means perhaps $150 000 a year. I love the quote they've selected to highlight by the way:
“Companies spend a lot of money on marketing, it’s difficult to understand which dollars are working and which dollars aren’t. The Google Analytics 360 Suite allows you to actually understand which half is working.”
(Shouldn't Google Analytics be able to do that already? No? Okay, my bad.) If you're unwilling to spend the $150 000 on 360, there is also a hack for querying raw data from Google Analytics. I don't particularly like this approach as it is still dependent on Google to allow this.

You can, however, simulate Google Analytics behavior with your own code and do whatever you wish with the data. Especially if your landing pages are in a different sub domain than your service, you'll might be more successful using a cookie instead of the local storage.

The workflow is simple:
  1. Store your data to the user's browser as they first arrive to your site. At that point you have access to data such as referring URL. This allows you to calculate almost everything, except e.g. auto-tagged adword campaigns (as of November 2017).
  2. Once your user successfully signs up to your service, you will need to fetch the marketing traffic data from the user's browser and
  3. Send the data off to somewhere, along with the ID of the new user.
With the simulated GA cookie script above, you can deduce quite a bit of information from the referrer URL, such as source and medium, and if the URL is UTM tagged, also campaignterm and content. For debugging purposes, it might be helpful to store the full referrer url as well.

We ended up storing the cookie value as a base-64 JSON string. In the below example our valueArray would contain information such as source, medium, campaign, term and content.
// when saving the array to cookie
cookieValue= btoa(JSON.stringify(valueArray));

// when reading the cookie
valueArray = JSON.parse(atob(cookieValue));
Remember to save the cookie domain with the dot in front for it to be accessible in all your sub domains.
document.cookie = 
"cookiename=cookieValue;expires=Sat, 18 Nov 2017 00:00:00 GMT;";
If you prefer to use something more modern than cookies, this library from Zendesk should help you with local storage across subdomains (haven't tried it though).

Once the data is in your analytics database, you will be able to analyze what kind of users you are getting from the different marketing sources and double down on those that bring in more quality sign ups.

It's probably easiest to implement this logic in Google Tag Manager. Have fun! Let me know if you're planning to try this and how it went!

Calculating active users and churn in a database

Have you ever struggled with calculating your active users per day, or month? How about your new activations or churned users? It's a common problem, but often the first solution you find will not scale.

You should avoid sub queries, and use a window function instead. But that's only a part of the solution.

Assuming your active user definition is "someone who has signed in within 7 days" and you have this data in a table with columns user_id, timestamp. The first thing you want to do is calculate when a user activates for the first time, when they churn and when they resurrect.

    , LEAST (LEAD("timestamp") OVER w, "timestamp" + 7) AS "activeExpiry"
    , CASE WHEN LAG("timestamp") OVER w IS NULL THEN true ELSE false AS "activated"
    , CASE
 WHEN LEAD("timestamp") OVER w IS NULL THEN true
 WHEN LEAD("timestamp") OVER w - "timestamp" > 7 THEN true
 ELSE false
 END AS "churned"
    , CASE
 WHEN LAG("timestamp") OVER w IS NULL THEN false
 WHEN "timestamp" - LAG("timestamp") OVER w <= 7 THEN false
 WHEN row_number() OVER w > 1 THEN true
 ELSE false
 END AS "resurrected"
   FROM "login"
   WINDOW w AS (PARTITION BY "user_id" ORDER BY "timestamp")
With this, you've added three boolean columns to your login data: activated, churned and resurrected, which tell you when a user activates for the first time, after which login they churn and which login makes them resurrect. Now your data is in such format that you can calculate active users over any date.

To calculate your daily active users you can apply the following query:
SUM(SUM(activated::int - churned::int + resurrected::int)) OVER w
, "timestamp"::date FROM vw_login
 GROUP BY "timestamp"::date
 WINDOW w AS (ORDER BY "timestamp"::date)
 ORDER BY "timestamp"::date;
Hold on. No you can't.

For churn, you also need to take into account the time-shift: your row indicates that the user churns after that login. The day your user becomes inactive is after 7 days of the login, not on the day of the login. To accomplish this efficiently it's best to first create a view that aggregates the data to a daily level and then do a join.

-- Aggregate daily
    SUM("activated"::int) "activated"
  , SUM("churned"::int) "churned"
  , SUM("resurrected"::int) "resurrected"
  , "timestamp"::date "date"
  FROM "vw_login"
  GROUP BY "timestamp"::date "date"
For the running total you should generate a time series first to ensure that all the dates you are joining the churn to exist. Also, take note on wrapping your values with a COALESCE function as if your SUM includes NULL values, it will return null.
-- Calculate a running total
 , SUM(COALESCE(a.activated::int,0)
   - COALESCE(a2.churned::int,0)
   + COALESCE(a.resurrected::int,0)) OVER w
 , d."date", a."activated", a2."churned", a."resurrected" FROM
 generate_series('2004-03-07'::date, CURRENT_DATE, '1 day'::interval) d
 LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
 LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '7 days')::date
 WINDOW w AS (ORDER BY d."date") ORDER BY d."date";
That's it. That will give you the number of active users per day and the number of users who activated, churned or were resurrected on that day. It's now up to you to figure out what your definition for an active user is and if there are dimensions you need to add. But the basic logic is clear enough.

Let me know if you found this useful!

Wednesday, November 22, 2017

Power BI, Tableau or QlikView

Ok ok I went along with Power BI but hear me out.

A year ago I joined Zervant to be responsible for the company's BI and analytics. Pretty soon I had to make a choice: which BI package should we take into use. Before this, Zervant didn't use anything apart from Excel and a Geckoboard. From my previous positions at Nokia and Microsoft, I have experience in Tableau, QlikView  and Power BI, in that chronological order. It quickly came down to a choice between these three. Or to be honest - between Tableau and Power BI.

Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime.

I'm a big believer of self-service BI. I would rather build an analytics team that enables others solve their decision needs rather than giving the right conclusions. Therefore, building something that can be used by anyone in the company was one of the goals. Also, joining a start-up meant cost was a concern.

QlikView is in my experience complex to set up and maintain. I might be wrong, but I've always found QlikView a bit unapproachable. Powerful, yes, but too much a commitment and an investment. I wanted something that was easy to take into use, and possibly change if we didn't get along. QlikSense Cloud has changed this fortunately, but I'm still getting an allergic reaction from Qlik* (sorry). Maybe it's also the fact that I haven't seen great examples of QlikSense in actual use. Also, when searching for tips on QlikSense it becomes clear that the user community must be significantly smaller when comparing to the other two, which means less help when you run into issues. I believe I'll eventually end up giving QlikSense an honest try, but I will need to have some very convincing reasons to switch. And I wouldn't do it for $25 / user per month.

I like Tableau: the way it looks, integration with R, scripting and the rest of the features. I actually prefer it to Power BI. But what I don't like is its price: $112 / month per user, when you need to connect to a PostgreSQL server and have someone else host the dashboards for you. I also often struggle with the way Tableau assigns my data to dimensions and measures. Back when I was making this decision Tableau would've meant shelling out 2500 € for a year's license, and I wasn't ready to make a commitment of that size just yet. $112 / month per user might be a different story.

I ended up choosing Power BI and while I sometimes get some dissatisfied comments on this decision from Tableau fans, I am completely satisfied with the decision. Power BI is on-par feature-wise with Tableau. You don't have as much power over design, the DAX language is often infuriating, and I have only recently learned to tolerate the way you are forced to use on-premise data gateway despite the fact that your databases are also in the cloud. All these shortcoming are offset with what I am able to accomplish with it for the price: free now and $10 / month per user later. I also really appreciate the custom visuals, despite them often being quite buggy. But keep in mind that Mac users will not be able to develop new reports.

So, if you are a price-conscious Windows user and need a BI tool that anyone can use, have a look at Microsoft's Power BI.

Thursday, November 16, 2017

Why WINDOW functions are changing the game for MySQL

Back when we were deciding on an SQL flavor at Zervant, the fact that MySQL didn't have WINDOW functions was a deal breaker. We had plenty of MySQL experience and not that much in PostgreSQL but I had to go with PostgreSQL anyway - just because of WINDOW functions. But that's about to change, as MySQL is getting this support soon, too.

Imagine you need to calculate the number of active users you have on any given day. Now imagine that the way you have defined an active user is as someone who has done something on your service within e.g. 30 days. The something might be e.g. uploaded something, downloaded something, made a purchase, edited their profile - doesn't matter. And imagine you want to split by dimensions.

Without WINDOW functions you don't have good options. You can accomplish this by either sub queries, which gets very expensive:
         (SELECT SUM(t2.count) FROM t as t2 WHERE  <= AS cumulative_sum
or MySQL variables, which cannot be used in a VIEW:
         (@running_total := @running_total + t.count) AS cumulative_sum
    JOIN (SELECT @running_total := 0) r
There might be a way to use a stored procedure and a table, but this gets complicated to implement.
Code credits: MagePsycho.

Window functions to the rescue

With WINDOW functions, MySQL takes a step towards being a viable alternative for an analytics database. With a WINDOW function you can easily calculate a running total over any dimensions. Placing this logic in a VIEW makes it even easier to do report
FROM active
Simple and effective - and enables the kind of calculations you have to do daily for analytics.

If I had to make the choice between MySQL or something like PostgreSQL for an analytics database, this change would make me reconsider. How about you?