Ilkka Peltola

ThumbmarkJS: A free, open source device fingerprinting JavaScript library for the web

2023-12-30T01:18:00.008+02:00

I needed a decent JavaScript fingerprinting library. I wanted something that was 'good enough': not crappy, but didn't need to be perfect. I noticed the great FingerprintJS, but sadly, they changed their license to a paid one. Boo! What is a good alternative to FingerprintJS?

There are alternatives out there too, but to be honest, they all have faults.

FingerprintJS is great, but they're monetizing their product in a way that I don't like. I might need hundreds of thousands of requests per month, but I can't pay thousands of dollars. It doesn't need to be perfect either, so I don't want to pay such a high premium.
ImprintJS used to be a thing, but it's now archived for a few years already.
ClientJS hasn't been updated for a few years either. It is promising, but I find it a little too complicated to extend and I see nowhere any statistics on how good is it.
BroprintJS is the new kid on the block and hats off for trying, but it's very limited in its capabilities. And I have my doubts about how much will happen with it.

So, why not create my own, right? So here we go: a free, open-source, easily extendable, easy to use fingerprinting library ThumbmarkJS.🎉🎉🎉

Yes, please check out the GitHub repository and star it if you could 😊. Feel free to fork it too. It's open source. You can also find the npm package at @thumbmarkjs/thumbmarkjs.

I'm at the point where it is meaningful for me to get some field-data from others. If you want to help, please go to the following page:

The page is a demo of the free fingerprinting library. It will give you your fingerprint as well as show the components it consists of. It will log your fingerprint for analysis. The data is anonymous and is only used to improve the library.

How to use it

Usage is simple. just include the library from the CDN and call it, like so:

<script src="https://cdn.thumbmarkjs.com/latest/Thumbmark.js"></script>
<script>
    ThumbmarkJS.getFingerprint().then(
        function(fp) {
            console.log(fp);
        }
    );
</script>

How does it work?

Like so many other fingerprinting libraries, this one relies on common things such as

audio fingerprint
canvas fingerprint
webgl fingerprint
available fonts and how they render
videocard
browser languages and time zone
browser permissions
available plugins
a ton of screen details including media queries

... and a bunch of smaller things

If you are interested in a more technical explanation of how the library works, head over to the more technical details document.

What makes ThumbmarkJS better than the rest?

Well, it's certainly not yet the technical implementation. It doesn't have unit tests and it hasn't been field-tested very well. But it's very well structured and clean. Anyone who knows Typescript will have no issue understanding how to work with it or improve on it. Also, my field is in data analytics and I've put more effort in setting up ways I can easily analyze the data friendly viewers have logged through the Log your Fingerprint page. (Please do me a favor and pay that page a visit would you?)

How to get source and medium programmatically with JavaScript - just like Google Analytics

2022-03-16T14:59:00.005+02:00

To make smart marketing decisions, you need to know what the Return On Marketing Investment (ROMI) is. When you're a webshop that does immediate transactions, it's easier to set up Google Analytics to serve you. However, if you need to understand longer customer relationships, Lifetime Values, etc, you need to get your hands on raw data.

The obvious idea that comes to mind is: "I just need to get the source, medium, campaign etc. data per each visitor" and you quickly realize that Google Analytics doesn't allow you to do that.

You need a custom solution. I have found two online that solve the problem

This one with FirstSession and ReturningSession cookies set
The Lunametrics one, which is a bit more verbose, but has a more extensive list of search engines

These solutions are just fine, but, no-one's maintaining them. They're old. They don't evolve. And they don't support other paid channels than Google search. What about Bing? Facebook?

Also, both of these solutions combine two very separate problems into one:

Deciphering what the traffic data is (source, medium, campaign etc), and
Storing it for later retrieval.

And the 2nd part happens through cookies. It's 2022 and Cookies could soon be dead. Then what?

VisitDataJs: the open source Google Analytics data emulator

I've written a small javascript open source library to tackle the first problem - getting Google Analytics data such as source and medium, programmatically through JavaScript - called VisitDataJS.

Usage is simple and you should refer to the readme, but simply include the javascript and run it, like so:

<script src="https://cdn.jsdelivr.net/npm/visitdata/dist/visitdata.umd.js"></script>
<script>
	console.log( visitData.get() );
</script>

Here's the results for you personally, how you arrived to this page:

GlobalStorage: the open source, cross domain browser storage that doesn't run on cookies

GlobalStorage is an open source javascript library you can include from anywhere, and use it just like you use the browser's localStorage, except that you can access the same data from any website! Here's how it works:

<script src="https://cdn.ilkkapeltola.com/global-storage/latest/globalStorage.js"></script>
<script>
  globalStorage.init().then(() => {
    globalStorage.setItem("your-key-name", { foo: "bar" } );
  });
</script>

The above code will store the object { foo: "bar" } to the key 'your-key-name', which can then be fetched from another web page by calling getItem. You should check the readme for the library as there are some default behavior of this you need to be aware of, such as how do you control who has the right to read the data.

Years ago already, Zendesk did a marvelous job with Cross Storage, but there was one thing I think they got wrong: forcing you to host your own 'Hub' file to have control over who has access to your data. So I took a lot from Zendesk's cross storage but made a completely new version that achieves something similar, but more easily. While Zendesk's Cross Storage requires you to also host a hub.html file and configure it for yourself, my solution works easily without the need for you to host your own file. You can just use the one I hosted - but you still have control over your data.

Incredible value having access to this data

We've gained an incredible amount of value by having access to this data. It's not about being able to replicate GA:s reports, but instead, enriching our user base with new dimensions that we can use as a lens into business metrics. For example:

Are users acquired by paid acquisition more likely to churn than those who we acquire by organic means?
Which campaigns are most likely to convert people into active and paying customers?
What is the lifetime value of a new conversion, split by cpc vs organic?

What we've found out for example is that CPC traffic can convert, often better than organic, but the bottom of the funnel often performs worse. We've been able to identify Google Ads campaigns that bring in signups, but never bring paying customers, and cancel them. We've seen how affiliate networks we've tried have lead to fraudulent accounts and no business. These things are difficult to prove without granular, per account data.

How to access AWS S3 with pyspark locally using AWS profiles tutorial

2021-04-14T09:55:00.008+03:00

At Zervant, we currently use databricks for our ETL processes, and it's quite great. However, there's been some difficulty in setting up scripts that work both locally and on the databricks cloud. Specifically, databricks uses their own prorpietary libraries to connect to AWS S3 based on AWS hadoop 2.7. That version does not support accessing using AWS profiles.

Internally, we use SSO to create temporary credentials for an AWS profile that then assumes a role. Therefore, reading the ACCESS_ID and ACCESS_SECRET from the .credentials file is something we don't want to do.

In order to accomplish this, we need to set two hadoop configurations to the Spark Context

fs.s3a.aws.credentials.provider
com.amazonaws.auth.profile.ProfileCredentialsProvider

This is done by running this line of code:

sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.profile.ProfileCredentialsProvider")

Note! You need to set your environment variable AWS_PROFILE to the profile name you want to use.

As databricks uses hadoop 2.7, that version doesn't have support for the abovementioned credentials provider. Instead, we need to use a newer hadoop version 3.

The full list of files in our current setup is as follows:

spark-3.0.2-bin-hadoop3.2
hadoop-aws-3.2.2
aws-java-sdk-bundle-1.11.563

It's more obvious to pick the correct version of hadoop-aws-3.2.2 based on the fact that our hadoop is 3.2 in the spark installation file. Picking the right aws-java-sdk-bundle version requires that you look into the hadoop-aws-3.2.2 dependencies.

Our setup is compatible with the following MySQL, PostgreSQL and Snowflake libraries:

snowflake-jdbc-3.12.17
spark-snowflake_2.12-2.8.4-spark_3.0
mysql-connector-java-8.0.23
postgresql-42.2.19

When everything is done, running this will work:


import findspark, os
os.environ['AWS_PROFILE'] = "your-aws-profile" # In case you haven't already set the ENV variable
findspark.init()
import pyspark
from pyspark.sql.session import SparkSession

sc = pyspark.SparkContext(appName="Pi")
sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
spark = SparkSession(sc)

os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] --jars spark-snowflake_2.12-2.8.4-spark_3.0.jar,postgresql-42.2.19.jar,mysql-connector-java-8.0.23.jar,hadoop-aws-3.2.2,aws-java-sdk-bundle-1.11.563.jar'

s3File = sc.textFile("s3a://bucket/file")
s3File.count()

Don't use gross churn to compare SaaS businesses

2021-01-08T17:28:00.011+02:00

Whether you're optimizing your ad spend or in talks with investors about what your SaaS business is worth, you need to calculate your customer lifetime value right. And while there are good shortcuts, using them can easily lead to a completely wrong number, resulting in really bad decisions. In my role I've come to realize that the shortcuts rarely work well enough.

In this article I'll cover the following challenges in calculating churn:

Fluctuating / seasonal churn
Churn-and-return customers
Churn of different price tiers is asymmetric
Churn is non-linear over time
Saturated markets

And finally the best way to calculate lifetime value that account for these challenges.

Calculating lifetime from retention

Typically lifetime value is calculated by dividing your average monthly revenue per account (ARPA) by your monthly churn. You can get your ARPA by dividing your monthly recurring revenue (MRR) by the number of paying customers you have. And so, if your ARPA is 10 € and your monthly churn is 2%, your lifetime value would be 10/0.02 = 500.

Have you thought about this rule of thumb, or why it would work? E.g. if you start with 100 customers and your monthly churn is 2% this rule of thumb would give you a lifetime of 50 months (1/2% = 50). Your intuition might tell you that indeed, if you have 100 customers and you lose 2 of them every month, after 50 months you've reached zero.

But this doesn't make any sense.

The point of the exercise is to figure out the expected lifetime of a single customer, right? In the above example, only 2 customers out of 100 reached a lifetime of 50 and the rest churned earlier, therefore the average lifetime is nowhere near 50. In fact, if the above chart was the correct intuition, the lifetime would be only 25 months, not 50.

So why do we use the shortcut of LT = 1 / Churn? Does it work even? It does.

If you're a math person, you'll realize that the remaining population in percentages follows this formula:

pop(m) = 0.98^m

where m is the month. Every month 98% remains from the previous month.
At month 50, pop(50) = 0.98^50 = 0.36.

Plotted over months it looks like this:

The average lifetime is actually the area below the blue line, when m approaches infinity. If you take the integral of 0.98^m, you'll eventually arrive at -1/ln(0.98), which is about 49.5. Not far from the 50 the shorthand gives.

So the shortcut works in this scenario, but what if your churn fluctuates a lot month-over-month? Which percentage is correct to use?

Traps to avoid

Fluctuating churn

If your business is seasonal, you might see your churn go up and down on a monthly basis. Can you just take the long term average? Can you use an arithmetic average, or should you rather use a geometric one? What lifetime value should you be using when trying to figure out what you can afford to pay for acquiring a new customer?

Fluctuating churn has fairly little impact in itself. Consider the following churn sequence: 0%, 4%, 0%, 4%, 0%, 4%, 0%, 4%. It's obvious the arithmetic mean is 2%. What is the geometric mean? It's 2.0204%. So hardly any effect.

But fluctuation like this tells about a phenomenon. Is your business seasonal? Are customers coming back after subscribing? Are you using gross churn to try to estimate lifetime value? If so, you're horribly undervaluing your business.

Churn-and-return customers

Whether your business is seasonal or not, you might have a lot of customers who pay for the subscription for a while, churn and after a while re-subscribe again. Typically, investors estimate the LTV of a SaaS business by its gross churn, in order to have a "comparable benchmark". This assumes that the behavior of subscriptions is similar, and that very insignificant amount of customers come back after they've churned once.

Especially when comparing freemium-model businesses to pure SaaS businesses with only free trials, this benchmarking fails miserably. Consider Zervant vs. Invoice2Go for example. A customer on the "Starter" plan in Zervant might need to send an e-invoice every now and then, but not continuously throughout the year. When they churn, they will fluently continue to use the free tier of Zervant and will subscribe again once they need e-invoicing.

Someone on Invoice2Go "Lite" plan will unlikely cancel their subscription unless they are closing their business completely or switching their invoicing platform.

On Zervant, returning customers will offset the easier churn due to the free tier. If the goal is still to estimate lifetime value, then using gross churn is grossly wrong and even as a benchmark does not work.

A good way to combat this is using net churn instead, where you substract returning customers from those who have churned. This is a great step forward, but still not ideal.

Churn of different price tiers is asymmetric

It's easy to assume that churn is the same over different price tiers, unless you've studied them. What I expect to often happen though is that customers on higher price plans tend to be more committed to the service and therefore less likely to churn. If this is the case, what you will see is your ARPA keeps climbing a little bit month-over-month.

If every time you update your LTV calculations you notice your ARPA to have again increased a little bit, you might start to wonder what long-term ARPA should you use then.

A simple solution for this is to calculate the LTVs separately for the different price tiers and then take a weighted average.

So now we've realized that both your churn and ARPA are not quite exact values, despite what many popular articles would have led you to believe. So what to do then?

Churn is non-linear over time

A customer who's been with you for two years has a much lower likelihood of churn, than someone who just started to use your service. If you simply look at a month's churn, how much of that is due to "young customers" and how much of that describes your long term churn?

Ceteris paribus, you will notice is that churn will decrease overtime. You'll keep adding new customers to your product, the hasty ones will churn and the loyal will stay, growing in numbers. Taking a churn rate today doesn't represent the long-term churn that could be used to calculate an accurate lifetime value.

Saturated market illusion

The saturated market illusion is an extreme example of the previous topic: non-linear churn over time.

Imagine you're comparing different markets you're operating on and see these two alternatives:

Market A:

500 CAC
50 ARPA
2% Churn

Market B

800€ CAC
50 ARPA
1% Churn

With a quick calculation the Market B has an LTV over CAC of

Calculate lifetime through survival

You need to look at your customers through survival where the time aWxis is not a calendar month, but relative to when they became a customer.

When you observe your users like this, you'll notice something interesting. If you place your paying customers on the same duration axis, starting from month 0, and observing how long they stay as paying customers, you'll probably see that there's very high churn in the beginning, but it starts to level off.

A while back I wrote about the Kaplan Meier survival analysis that helps you understand churn through survival. When you run that analysis, you can see what percentage of your customers are left at months one, two, ten etc. all the way that you have history for. It's exactly like a cohort analysis, but it summarizes all the cohorts together.

What you will notice is that churn is not uniform over time. Instead, the longer someone has been your customers, the less likely they are to churn the next month, but in the beginning, churn is very high.

Survival looks like the blue line below.

If survival looks like this for you too, you need to find a more accurate way to calculate the area under the red line. If at month 12 your survival shows 59%, then you get your average churn by taking the 12th root of 59% and substracting the result from 100%. You should arrive to a monthly churn of 4.3%. The quickly calculated lifetime for that is 1/4.3% = 23 months.

But that is pessimistic for you, as with that, you are essentially calculating the area under the red line above, not the blue. To get a more accurate number, you can observe survival for a longer period, or you can try to see at what point monthly survival starts to stabilize and calculate the area under the curve before that and after separately.

What I've noticed in our customer base is that observing through survival analysis, the retention from month to the next improves (i.e. churn decreases) slowly, but exponentially. For the math people:

churn(m) = b * a^m

Where b is the base starting churn, a is the rate at which churn decreases and m is the month.

Revenue-based survival

A long-term customer is more likely to continue next month, that much is certain, but when taking revenue into the picture the same applies: a customer who's already paid a lot is more likely to pay even further. The reason for this is unlikely the sunk cost fallacy, but rather that a more committed customer is likely to pay more.

If your SaaS business has different value tiers for example, this applies to you. Someone who is willing to pay more for your premium packages likely has a very good fit to your offering, and thus is again less likely to churn.

When inspecting from a lifetime value point of view, someone who pays 5x your typical ARPA in a month will reach your average LTV much sooner than normal and even overshoot it.

Thus, one way to look at average lifetime value is to do a "revenue survival" plot. Instead of calculating the survival from 0 months to 1,2...etc, you calculate survival from 0€ to 10€,20€ and so on. Someone on a higher value tier will much more quickly advance this axis than someone on a lower tier, which in a more fair way gives weight to the high-value customers.

If you plot your customers this way, you might notice that indeed, customers who have paid more have less churn. The revenue survival chart might look like this:

To get the customer lifetime value from this, you simply need to figure out the area under the curve for one customer. You might notice it is higher than by multiplying LT with ARPA.

Conclusions

Calculating lifetime value is much more than ARPA over gross churn. Whether you're an investor or a business owner, using that shorthand is dangerous as it will, depending on your circumstances, lead to over

Snowflake UPSERT operation (aka MERGE)

2021-01-08T14:08:00.004+02:00

You want to insert data to a table, but if a corresponding row already exists (by some rule, e.g. unique key) you want to update that instead of adding a new row, keeping the dataset's unique requirements intact. That's an "UPDATE AND INSERT" operation, or UPSERT. Some SQL languages have native support for it. PostgreSQL has UPSERT as native. Also MySQL supports the operation with INSERT and ON DUPLICATE KEY UPDATE. How do you do UPSERT on Snowflake?

Here's how:

Snowflake UPSERT i.e. MERGE operation

Snowflake's UPSERT is called MERGE and it works just as conveniently. It just has a different name. Here's the simple usage:

MERGE INTO workspace.destination_table d
  USING workspace.source_table s ON d.id = s.id AND d.val1 = s.val1
  WHEN MATCHED THEN update SET d.val2 = s.val2, d.val3 = s.val3
  WHEN NOT MATCHED THEN INSERT (id, val1, val2, val3)
  VALUES (s.id, s.val1, s.val2, s.val3);

Here the destination_table and source_table are of similar form, but they don't have to be. The above query uses both id and val1 as the unique combination and updates values to val2 and val3, or if the combination of id and val1 doesn't exist, will insert that combination as a new row.

AWS Glue python ApplyMapping / apply_mapping example

2019-04-27T16:12:00.001+03:00

The ApplyMapping class is a type conversion and field renaming function for your data. To apply the map, you need two things:

A dataframe
The mapping list

The mapping list is a list of tuples that describe how you want to convert you types. For example, if you have a data frame like such

 | old_column1   | old_column2 | old_column3 |
 | "1286"        |          29 | "foo"       |
 | "38613390539" |         386 | "bar"       |

And you apply the following mapping to it:

your_map = [
    ('old_column1', 'string', 'new_column1', 'bigint' ),
    ('old_column2', 'int',    'new_column2', 'float'  )
    ]

This would rename old_column1 to new_column1 and cast its string contents to bigint. Similarly it would rename old_column2 to new_column2 and cast it from int to float. old_column3 will be omitted from the results. Rows that cannot be mapped in the way you instruct will be filtered.

 | new_column1  | new_column2  |
 |         1286 |         29.0 |
 |  38613390539 |        386.0 |

to apply:

# you need to have aws glue transforms imported
from awsglue.transforms import *

# the following lines are identical
new_df = df.apply_mapping(mappings = your_map)
new_df = ApplyMapping.apply(frame = df, mappings = your_map)

If your columns have nested data, then use dots to refer to nested columns in your mapping. If your column names have dots in them (e.g. you have relationalized your data), then escape column names with back-ticks.

For example:


your_map = [
    ('old.nested.column1',      'string', 'new.nested.column1', 'bigint' ),
    ('`old.column.with.dots1`', 'int',    'new_column2',        'float'  )
]

ApplyMapping returns only mapped columns. Columns that aren't in your mapping list will be omitted from the result. So you need to include all fields in mapping that you want to include in the result, even if no conversion is made.

The Glue code that runs on AWS Glue and on Dev Endpoint

2019-04-05T09:49:00.004+03:00

When you develop code for Glue with the Dev Endpoint, you soon get annoyed with the fact that the code is different in Glue vs on Dev Endpoint

glueContext is created in a different manner
there's no concept of 'job' on dev endpoint, and therefore
no arguments for the job, either

So Mike from The MIS Theorist asked if there was a simpler way. And sure there is!

Template boilerplate Glue code

import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# Some common stuff not needed in this boilerplate
#from pyspark.sql.functions import *
#from awsglue.dynamicframe import DynamicFrame
#from awsglue.transforms import *

_dev_ep = False
try:
    ## There's no JOB_NAME in args, so code will raise an exception here
    args = getResolvedOptions(sys.argv, ['JOB_NAME', 'DAYS'])
    sc = SparkContext()
    glueContext = GlueContext(sc)
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)

except Exception as e:
    print("Exception:", e)
    _dev_ep = True
    args = {'JOB_NAME': 'Your Glue Job', 'DAYS': '1'}
    glueContext = GlueContext(SparkContext.getOrCreate())

spark = glueContext.spark_session

## Do your thing after this line
datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "your_database_name",
    table_name = "your_table_name")
datasource0.printSchema()

## Don't change the rest
if not _dev_ep:
    job.commit()

This makes developing Glue code easier, since you can copy-paste your development code directly into Glue and it still works.

Let me know if you found this helpful 👇🏻. Cheers! 🙃

AWS Glue, Dev Endpoint and Zeppelin Notebook

2019-03-22T21:54:00.001+02:00

AWS Glue is quite a powerful tool. What I like about it is that it's managed: you don't need to take care of infrastructure yourself, but instead AWS hosts it for you. You can schedule scripts to run in the morning and your data will be in its right place by the time you get to work.

The downside is that developing scripts for AWS Glue is cumbersom, a real pain in the butt. I first tried to code the scripts through the console, but you end up waiting a lot only to realize you had a syntax error in your code.

The documentation of glue programming is not great in my opinion. Also, the script on AWS Glue console differs slightly from the one you would run on the Dev Endpoint (e.g. Dev Endpoint doesn't know about Job objects or parameters).

So, it takes a bit of trial and error to get going, but once you do, knowing how to launch a Glue Dev Endpoint radically increases the speed at which you can iterate on your Glue ETL scripts.

Starting point for this post

Firstly, the point of a Glue dev endpoint is that you get a dedicated Glue instance, just for you, and you don't need to wait. What you need to know about a AWS Glue Dev Endpoint is:

It's Reserved instances to you, they cost money when they're up
It runs Spark. You can develop with scala or python (pyspark)
You need to use SSL certificates and SSH tunneling to connect to your Dev Endpoint

Also note that we'll use a local Zeppelin Notebook to develop our Glue scripts against the development endpoint. You could host your Zeppelin Notebook on AWS, neatly attached to your Dev Endpoint, but running the notebooks locally also persists them, so I like it better this way.

Pre-requisites:

An IAM role for the Glue Dev Endpoint with the necessary policies. E.g. AWSGlueServiceRole. There are more instructions here.
You have a table in Glue Data Catalog and the necessary connections.
I assume you know your way around VPC networking, security groups etc. E.g. the Dev Endpoint requires a security group that allows the port 22, since we need that for the SSH tunneling.

Have an SSL key pair at hand

If you don't have a pair, you can create one under EC2 -> Network & Security -> Key Pairs

You need the contents of the public key. The contents look something like this:

ssh-rsa AAAAB3NzaC1qph6lpY7D6MSTRIlnU/kc2EAAR4g5QK44bwTuNXPcEFijps1jy6yt243llWn3oScxfNQ/4+tR9m2LxAxkmJRVNcCLTnBTTd0uRxd5nLdGGDYB0xEGNT3VMjxCa2X19BH3gAABJQAAAQEAn6RSOBt/spb6J8HAza6sxjxBXRjAPr7Q66TB/xxA81/oNHWu8WT2eiI/XoVGLCzRPHzreoOytjuqd+u4gqXQM8tpJkAyxXGUm8dzNGIWCgHCvgS7TkcE3eZlQV6oTtK2Q7miJhDk58vqOEiUOHLnkqsgRQ7cVoHA27M+Ng7KlstGbMq1N6cLsOgTTuoioz6Y1V3rVDz2p73kvaYeDlLJg+eApw== rsa-key-20180821

You need the private key (either .pem file when using ssh or .ppk when using PuTTY)
If using PuTTY and you only have the .pem, convert .pem to .ppk (Putty Private Key)

Got the contents of your public key, and the .pem (ssh) or .ppk (PuTTY) at hand? Cool. Let's move forward.

Spin up the development endpoint

Create by going to AWS Glue -> Dev endpoints -> Add endpoint and you should see this:

Give it any name
Choose an IAM role that has e.g. the AWSGlueServiceRole policy attached.
I'm only using 2 DPUs, since that's the minimum and you get charged in proportion to them.

Click Next.

If your Glue tables are not in S3, then you need to also add the VPC & security group info. Easiest is to choose the connection attached to the table you will be using, and let Glue figure it out for you.
When choosing your subnet, consider your VPC settings: your Dev Endpoint will need a public address.

Click Next.

Since we are going to use a local Zeppelin Notebook server, then you will need to provide the public SSH key for the Dev Endpoint. This is needed for establishing the SSH tunnel to the Dev Endpoint.

Go ahead, review and launch. It often takes close to 10 minutes for my Dev Enpoint to provision.

SSH tunnel for Glue Dev Endpoint

When your dev endpoint is provisioned, check that its network interface has a public address attached to it and take note of it (e.g. ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.com). If it doesn't, you will need to create an elastic IP and attach it to the network interface of the dev endpoint.

create SSH tunnel, using PuTTY:

putty.exe -ssh glue@ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.com `
-L 9007:169.254.76.1:9007 `
-i your_certificate.ppk

using ssh:

ssh glue@ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.com \
-NTL 9007:169.254.76.1:9007 \
 -i ./your_certificate.pem

Zeppelin Notebook

Download Zeppelin the newest version might not work (check this page), but the 0.7.3 does.
Unpack to a folder.
Launch Zeppelin under bin. This can take a few minutes. When ready, it will output:

Done, zeppelin server started

The original instructions are here, but I'm paraphrasing:
Go to localhost:8080
Top right, click anonymous -> interpreter. Search for spark. Click Edit.
Have Connect to existing process checked
set host to localhost and port to 9007
Under properties, set master to yarn-client
Remove spark.executor.memory and spark.driver.memory properties if they exist.

Save, Ok.

Write your first Glue script with Dev Endpoint

Under Notebook, click + Create new note and copy-paste below code.
Replace your database and the table name with your own (The ones in your Glue data catalog).

Notice the way I am creating the glueContext. Also notice how there's no Job object or arguments.

%pyspark

import sys
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext

glueContext = GlueContext(SparkContext.getOrCreate())

# Load data and print its schema
datasource = glueContext.create_dynamic_frame.from_catalog(
  database = "yourGlueDatabase",
  table_name = "yourGlueTable")
datasource.printSchema()

Shift + Enter and your code will be shipped off to your Glue Dev Endpoint and in a while, you'll see the schema of your table below the cell.

Voilà. 🎉

As per Mike's comment below, what if you'd like to use code that works on both the Dev Endpoint and the "managed" Glue? Solved: The Glue code that runs on AWS Glue and on Dev Endpoint

Using survival plot to analyze churn in Power BI

2018-10-02T16:47:00.002+03:00

I did not guess I'd be working with Kaplan-Meier survival plots so soon.

Analyzing churn and trying to figure out what kind of users churn more likely is not so easy. In order to calculate churn, you need to have a good volume of users that you can follow month over month, having a number of active users per month and what share of them are left behind every month. But what if you want to select a different set of users, another segment? Producing an analytics cube with the necessary dimensions takes time. And if you end up with a segment that doesn't have high volumes every month, interpreting the results can be quite tricky.

So needless to say, using churn to analyze segments or micro-segments in your user base is not so very easy. What if there was a tool you could use to quickly analyze churn in any arbitrarily selected group of accounts? For this, retention is a great proxy for churn. And even better, survival plot.

A survival plot will show you what share of signups (y) will survive a certain number of days (x). Compared to the most obvious calculation (out of users who have signed up more than x days ago, what share was still active at x days) the Kaplan-Meier method uses all data and can therefore be more accurate.

With Kaplan-Meier method each individual day is calculated a daily survival rate, and a "to-date" number is the product of the earlier consecutive days. This way, even if only a fraction of users are 100 days old, they still get to contribute to the day 0, 1, 2, 3, etc. survival rates.

This said, I'm sure you can already imagine how the Kaplan-Meier survival rate is calculated, but what if you have an urgent need right now to make this happen in Power BI? It's very simple. You need to have the following columns in your customer dataset:

Have they churned
How many days they (have) survived

You could do a pure DAX solution like described in this post, but the problem I at least was not able to solve is: What if you want to use filters? See, calculated tables are evaluated at data load and they are not affected by filters. If you find a way of doing this with a table variable, let me know. What I eventually ended up doing is using an R-script, this:


library(plyr)
library(reshape2)
library(ggplot2)

km_data <- function(this_data) {

  date_range = data.frame(seq(0, max(this_data$days), by=1))
  colnames(date_range) <- c('days')
  
  summary <- ddply(this_data, .(days), summarize,
                   count = length(id),
                   dns = sum(dns_int)
  )
  patients = sum(summary$count)
  
  merged = merge (x= date_range, y=summary, by="days", all.x=TRUE)
  merged[is.na(merged)] <- 0
  merged$started <- patients - cumsum(merged$count) + merged$count
  merged$survival_rate <- 1 - merged$dns / merged$started
  merged$km <- cumprod(merged$survival_rate)
  merged
}

my_dataset <- data.frame(dataset$days, dataset$dns_int, dataset$id)
my_dataset = my_dataset[complete.cases(my_dataset),]
names(my_dataset) <- c("days", "dns_int", "id")
km1 <- km_data(my_dataset)

km1 = km1[,c("days", "km")]

ggplot(data=km1,
       aes(x=days, y=km)) + geom_line() + scale_y_continuous(name="survival", limits=c(0,1))

That's the simple solution. Things you might still want to consider implementing additionally are:

Comparing to a base line
Confidence intervals

Let me know if you have an elegant solution to these.

Create a funnel analysis tool with Redshift and Power BI in 5 minutes

2018-09-13T08:36:00.000+03:00

If you're not collecting events from your product, get started right away!

Events are a great way to collect behavioral data on how your users use your data: what paths they take, what errors they encounter, how long something takes etc. When you have events, there isn't a lot you cannot analyze.

But once you start collecting events, you might get a little overwhelmed with analyzing those events. How about creating everyone in your company a nice funnel analysis tool?

I like simple solutions. I also like self-service analytics. This tool takes 5 minutes to set up, is quite powerful and can be used by anyone in your company. The way this funnel tool works is very simple:

Choose a mandatory event
Choose from a list of events to compare the first event to

Disclaimer: this simple tool will not consider the order of the events, but I did say it's a simple tool, didn't I?

We'll use Redshift with PowerBI, as it allows for Direct Query (which is especially useful when you might have millions of event records). Let's assume your event is a table in Redshift with (at least) the following columns:

A name of the event (e.g. "Logged In")
An owner for the event (e.g. a user identifier)

That's it. That's all you need. Sure, if you have stuff such as timestamps, you can take this further. But we don't need that for now. Let's create a couple views:

CREATE VIEW vw_event_name_owner AS SELECT DISTINCT event_name, owner FROM event;
CREATE VIEW vw_event_owner AS SELECT DISTINCT owner FROM event;

Be sure to pull these in with the Redshift connector, using Live Query.

The first, vw_event_name, we will use to select the mandatory event. We will map that to the second view, vw_event_name_owner, with a many-to-one two-way filter. We will then use the vw_event_owner to filter our event data, leaving only those events that have been produced by owners who have also produced the event we have selected in vw_event_name. Here's the relationship diagram.

Now, put a single-choice drop-down filter from the vw_event_name_owner -view and a multi-select filter from your event table. With any luck you should have something like this:

Voilà! You can publish it for others to use on PowerBI cloud. Once PowerBI cloud is provided with Redshift credentials and assuming your enterprise gateway is set up, the report will have a live connection to Redshift.

How to ETL in Amazon AWS? AWS Glue for dummies

2018-09-06T18:33:00.001+03:00

You can do ETL in AWS in a few different ways:

Glue
DataPipeline
A custom solution, e.g. a Docker

I've used a custom solution for a while, but recently decided to move to Glue, gradually. Why? Because when it is set up, you have so much less to worry about. Glue is the preferred choice when you need to move data around. If you're unsure what route to take, stick to Glue. If you find it doesn't fit your needs well, only then look elsewhere.

The building blocks you need to understand

Glue Data Catalog
Crawlers
Jobs

Glue Data Catalog

The tables in your Glue Data Catalog hold the metadata for your data (where it is stored, what format it is in). Tables are grouped by databases, which is simply a way to group your tables. Lastly, connections describe how Glue can connect to your data, e.g. through a JDBC connection.

You need the tables in your data catalog in order to run ETL jobs.

Crawlers

Crawlers help you catalog your data into the Data Catalog. Your data needs to be in S3, DynamoDB or you need to have specified a JDBC connection for it. A crawler will have a look at your data and generate the tables in your Data Catalog - interpreting the schema from the data.

Jobs

Jobs do the ETL work and they are essentially python or scala scripts. When using the wizard for creating a Glue job, the source needs to be a table in your Data Catalog. You can load the output to another table in your data catalog, or you can choose a connection and tell Glue to create/update any tables it may find in the target data store. You can schedule jobs with triggers.

Advanced stuff

A role for Glue

You'll obviously need a role for Glue. The policies I have added are AmazonS3FullAccess, AWSGlueServiceRole, AWSGlueConsoleFullAccess and AmazonAthenaFullAccess. If you are a strickler on security, I'm sure you can narrow done the rights.

Script

When you create a job with the wizard, it creates a bunch of code for you. Here's what it looks like in python. It includes annotations (lines starting with ##). These are only necessary for drawing the diagram on the left.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "analytics", table_name = "events", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "analytics", table_name = "events", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("category", "string", "category", "string"), ("timestamp", "bigint", "timestamp", "bigint"), ("environment", "string", "environment", "string"), ("source", "string", "source", "string"), ("owner", "string", "owner", "string"), ("id", "string", "id", "string"), ("sessionid", "string", "sessionid", "string"), ("payload", "string", "payload", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("category", "string", "category", "string"), ("timestamp", "string", "timestamp", "bigint"), ("environment", "string", "environment", "string"), ("source", "string", "source", "string"), ("owner", "string", "owner", "string"), ("id", "string", "id", "string"), ("sessionid", "string", "sessionid", "string"), ("payload", "string", "payload", "string")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_cols", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [catalog_connection = "Analytics DB Testprod", connection_options = {"dbtable": "events", "database": "reporting"}, transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "Analytics data warehouse", connection_options = {"dbtable": "event", "database": "analytics"}, transformation_ctx = "datasink4")
job.commit()

The most interesting lines start from datasource0 =. The default code works like this:

glueContext.create_dynamic_frame.from_catalog extracts data from a data catalog table
ApplyMapping maps the source columns to output columns. More info on my post about ApplyMapping.
ResolveChoice is used to instruct Glue what it should do in certain ambiguous situations
DropNullFields drops records that only have null values
glueContext.write_dynamic_frame.from_jdbc_conf takes a JDBC connection I've specified along with some other parameters and writes the data frame to its destination.

So fairly simple stuff. Except when you run into problems.

Developing the ETL scripts

What you'll quickly find is that you need to further develop the ETL script. After modifying the ETL script you launch it, only to realize it takes a good few minutes for it to start, and a few others to run. Iterating the script gets tedious, but luckily you can use Glue's Dev Endpoints.

Dev Endpoints are reserved instances that you can use to iterate ETL scripts faster. Setting these up is a whole another story though and would deserve their own blog post.

An insights strategy for winning companies

2018-08-27T21:01:00.002+03:00

An executive summary

Companies struggle to gain maximum benefit from analytics and insights since

Analytics is seen as a support function, not a business partner and therefore is not prioritized high enough
Analytics is separated from business processes and insights are produced away from execution
Under-resourced, inflexible analytics stack, which doesn’t enable speed to react to changing needs

Conversely, in high-performing organizations, analytics and insights is as much a business function as the rest. This ensures analytics and insights gets prioritized with the same focus to the rest of the functions, rather than being secondary. Analytics and insights is a fully-fledged business partner.

Business processes are imbued with analytics and insights right in the core. This is achieved by forming systems of insight; teams that have full insights-to-execution ownership and are manned accordingly. These teams are independent from a centralized analytics and insights department, which ensures they are frictionless in implementing optimal experiences with insights.

Winning organizations invest in a flexible analytics stack for the long term, which doesn’t solve one problem well, but rather is capable of adapting unforeseen business needs quickly. Purposefully building speed and flexibility into the stack ensures long-term competitiveness, as needs change all the time.

Imagine this

A new user registers to your product and based on the information they provide you are immediately able to describe them in a meaningful manner. Based on where they are from, how they behave and their public social media profiles you are able to build heuristic multi-profiles that are extremely relevant to your business. These multi-profiles fuel your on-boarding and recommendation engine, individually tailoring messages to your new signups that optimize their experience as well as the likelihood of them converting to paying customers.

You are prioritizing your roadmap and deciding the order of projects A, B and C. As you understand thoroughly how these projects affect your customer experience, you know the gaps in your offering and the customer preferences. As you have analyzed the causal effect the improvements in different experiences have on different conversion metrics, you are able to make highly accurate predictions the different projects will have on your overall business. As the whole company is aligned behind the same KPIs, there is very little discussion about what the priorities should be.

A question pops into an employee’s mind. Unfortunately the Data Science team is very booked for the next couple of days, so there would be a few days’ lead time to get this information through them. Luckily, since your self-help tools are so powerful and everyone has access to them, the employee is fully capable of answering their own question quickly and efficiently.

Analytics is no longer a support function

It might have been that in the past analytics belong to the IT department and was irregularly summoned to existence to answer specific business questions. This kind of setup puts data and business in a high-lag dialogue, not unlike talking to a space station orbiting Mars. Nowadays data analysis is a key differentiator for companies. Compared to computers, humans are undeniably better at innovating and coming up with strategies and perhaps fundamentally for that reason data analysis will not (ever?) be the leading function in a company. Among the winning companies though, data analysis is an equal business partner at the leadership table, integrally embedded in everything through people, processes and technology.

Winning Insights & Analytics groups are evolving from the traditional support function role to one of a pro-active team player, providing scenario planning and ecommendations, and ultimately being regarded by the rest of the business as a fully integrated business partner that has a seat at the leadership table, driving strategy and real-time execution together with Marketing, IT and Finance colleagues.

This is a quote from a comprehensive survey study conducted by Millward Brown Vermeer (2016, p.5) and in that they describe three categories that separate over-performers from the under-performers: customer obsession, designing total experiences and something they call the ‘Insight Engine’.

Customer obsession could be characterized by insisting to take the voice of customer into account at every business decision or having performance KPIs that measure the customer’s success rather than the company’s. Designing total experiences essentially means designing consistent purpose-lead solutions and pursuing that relentlessly, backed-up with data.

The last point, an insight engine, is the equal business partner component. Compared to under-performers, over-performers are four times more likely to state that insight and analytics are leading the business

The insights transformation happens in all functions

In a successful insights transformation, all individuals in a company get empowered through insights. In the old setup, data and business was in an infrequent dialogue, the business submitting requirements to data and data trying its best to reply with answers after a delay. Similar to the findings in the Millward Brown Vermeer study, also Forrester has concluded that to win, insights needs to be tightly embedded into everything (Figure 1). Forrester’s study found that the over-performers had multiple systems of insight embedded into the processes. These systems (teams) are able to use the right data from all possible data, analyze it and draw the effective conclusions.

And this needs to happen everywhere.

(Figure 1. Systems of insight, adapted from Forrester 2015)

One crucial implication of this is that being able to draw basic day-to-day insights from data is less dependent on a centralized analytics team. The systems need to work without a tin can telephone to head office. More accessible self-help analysis tools will support that, but more importantly it is important to identify the critical insights-hungry processes, ensure they have insights people embedded and purposefully work on improving data intuition.

The end result is an organization of multi-disciplinary teams, formed around critical processes, with ownership of the whole insights-to-execution flow. Those who manage to do this will reduce friction of insights dramatically and enable the insights to flow into actions at the operative level. Suddenly, almost magically, all the right decisions will start to be made in the teams and the role left to a central team will be to provide tactical support and improve the platform.

Systems of insight have full-stack ownership of data, from generation to visualization - and all the steps between. In order to achieve this while also executing on the findings, the multi-disciplinary teams will need a wide range of skills: business, technology and analytics.

The insights team should not be dependent on your development department, but rather fused with it. Rather than having separate teams for insights and dev, there is only one around a single topic. Having all the necessary skills in a single team enables them to move significantly faster. Deloitte calls these ‘purple teams’: teams which have combined business and technical talent to deliver analytics edge ("The key to delivering analytics advantage: Your people" ,Deloitte, 2017, p. 21).

“If you say you’re data-driven, but everything has to go through an analyst, you’re not actually data-driven”

-Fareed Mosavat, Group Product Manager at Slack

Speed is key, and a centralized analytics team creates bottlenecks. Not everyone needs to be a data scientist, but the organizations who win will have made data and analysis accessible for everyone. This holds true especially in smaller teams.

In the Lean Startup methodology, one interesting idea is to have ‘Validated’ as step in the ‘definition of done’ (The Lean Startup, p.138). The point is to ensure all stories that are done with the intention to improve the product also have a mandatory validation step before they can be signed off. This fits extremely well with the thinking of systems of insights and cross-functional teams that are responsible for the full insights-to-execution process.

(Figure 2: The Lean Startup methodology, adapted from theleanstartup.com)

It is all about continuous learning: you act with an assumption, test it, learn and form new assumptions. If you are not measuring the impact of your work, you are not learning, but hoping. On page 126 in the book, Erik Ries gives an example about a team that is not producing results. In the example, the management concludes that the team was not working hard enough, although the problem was that the process the development department was expected to follow did not enable learning to take place.

An insights platform for long-term success

Although data is a critical component, surprisingly the battle isn’t about who has the best data. In an interview, Lori Sherer (Bain & Company, 2017) points out that more data isn’t always better and goes on to say that “a simpler solution that is designed with the end users in mind is going to win out 100% of the time”. Her point is that while data is powerful and access to it can give an edge over a competitor, it ultimately comes down to your ways of working. Approaching customers’ needs in an analytical way is the winning component, not having the best data to support it. This doesn’t mean you shouldn’t go after good data. It means you are not necessarily doomed if the data you have access to is inferior to your competition, as long as your ways of working are more customer and goal oriented.

We engineers often think of the technical solution first, but the insights stack should be focused on only once the ways-of-working are in place, There’s plenty of technical work when building an insights platform that has the speed and flexibility to support all business needs. Having those needs in mind will ensure that the right thing is built. A very brief generalization is depicted in Figure 3.

(Figure 3: A generic analytics stack with some example AWS technologies)

While developing machine-learning algorithms can seem like the most important task to produce value, the foundations and enablers to do so are actually laid out much earlier. By storing all data in its raw format, by creating convenient pre-processing flows and ensuring data quality stays good likely have a higher impact on the end result than endlessly polishing a neural network.

First party data - the kind that you create and no-one else has - is what most likely makes you different from the rest. If you have access to second or third party data, so will others. Enriching your first party data can be important, as it might enable you to find insights otherwise invisible to you. Think of how different industries use the product differently and how it might be difficult to see that, unless you have access to that meta-information.

Has access to all (relatively) latest data through a robust ingestion and storing process
Enables joining all data in any format effortlessly, even if the data is imperfect
Allows changes in schema easily through e.g. virtualization
Exposes analytics with production-ready interfaces
Respects users’ privacy while not compromising quality

The most interesting question to start with is: "How will the insights stack support my business?" For example, if your end goal is to plug in insights components to your SaaS product for your customers to see, and you anticipate these components to need to do heavy computing on massive volumes of real-time data, then the stack you will need to create is very different from if you only need to do back-office analysis and you know you will do just fine with a data lag of one day.

Final words

Companies can fail to reap the benefits of analytics, since

Analytics is a support function, not a function that sits in the leadership table.
Analytics is separated from business processes, when it should be tightly embedded in them.
Analytics stack is under-resourced and inflexible, unable to meet changing needs.

To fix this, organizations need to acknowledge that data analytics needs to be involved in strategic decision making. If it isn't, your organization will always lag behind and the investments you do make in analytics will yield much lower return.

Consider embedding analytics responsibility in to the business processes: from insight to execution. Rather than being forced in an inefficient dialogue between business and analytics, teams themselves will have everything they need to reach their purpose. Data scientists and analysts should be members of the business teams, enabling the team to be both accountable and empowered to use tight feedback loops to continuously improve.

Lastly, make sure you invest in your analytics stack adequately, from the perspective of your needs. Don't build something that will be unused, but rather, together with your CIO, create a purposeful insights strategy that is flexible and resourced. This infrastructure should be there to serve the ever-changing needs of your business processes, so make sure it is flexible enough.

A more holistic purpose of analytics

2018-06-04T09:51:00.000+03:00

Why do people make bad decisions? Mostly because of insufficient or wrong information, but not always. So what should be done about it?

People make decisions all the time. In fact, some studies suggest we make a whopping 35,000 decisions every day. Other studies suggest that it's much less, perhaps 70 conscious decisions per day. Nevertheless, we make decisions all the time, even without noticing. And the same applies to work.

At work though, the decisions aren't about which shirt to wear or which flavor of ice-cream to buy. At work we are expected to make the right decisions. And how does that happen? By having the right input and arriving to the right conclusion. A key ingredient of course is information.

What do you think should be the purpose of an analytics function/team/department?

It's a tough question. Many senior leaders will struggle with purpose. Being responsible of data engineering, warehousing or analytics projects isn't a purpose, nor is providing data for decision making. Or even information. These things are not a purpose. They could still be important activities or responsibilities.

The purpose of analytics is to ensure the right decisions are made.

- me

If you ask me what I do, I gravitate very quickly to ensuring we can make right - or at least better - decisions. If you are in analytics and this isn't your purpose, I would very much like to hear what is.

Yet, what often happens is that people confuse their job description with their purpose, leading them to spend time on tasks they've agreed with their manager six months back, not the things that improve decision making.

If the ETL pipe you are building is not helping people to make better decisions, why are you building it? Or the neural network, or the dashboard. What is the purpose of spending your precious time on them?
If you can do something that will lead to better/right decisions, isn't it your purpose to do it? What if it is sharing your skills to non-analytics people? What if it is facilitating culture change?

Think about the following:

Are people in the habit of making truth-seeking decisions?
What kind of decisions need to be made?
What would help make better decisions? (Tools, information, skills)
What prevents them? Where's the bottleneck?

So why am I writing about this? It is because there are probably some less obvious things that you can do to help make better decisions. Some of these things will shift control away from you to the colleagues around you, but it is completely aligned with your purpose. So if it is analytics you do, I urge you to think a little out of the box what actions can help your org make better decisions.

Some ideas:

Create a vision that motivates people to start making decisions.
Have workshops for anyone interested about data driven decision making.
Facilitate culture change towards data-driven decision making.
Share success stories inside your company to encourage others to seek your services.

Sometimes the best way to fulfill your purpose is not to crunch more numbers.

Simple Big Data setup on Amazon AWS

2018-05-28T13:50:00.000+03:00

Everyone wants to do some big data stuff, right? In all honesty, no-one cares if your data is big or small - size doesn't matter. What matters is your ability to take any size of data and generate understanding from it. At some point the data you are gathering might become inconvenient to process with more traditional tools. It might be that some big data tools might help you - or not. The bottom line is, it is a tool you want to have in your toolbox.

So you want to do some big data stuff. You've got two options

Run stuff locally, or
Run stuff in the cloud

If you don't have a cloud, then surely you can just go with option #1. For that the Hortonworks Sandbox in a Docker seems like a great place to start (https://hortonworks.com/tutorial/sandbox-deployment-and-install-guide/section/3/). But I would not recommend that. It's a hassle and there are very little benefits, except that you only pay for your own electricity.

If you don't have a cloud, just get an AWS account and start doing things the way it will be done. If you already have data in e.g. an AWS EC2 instance, then I've got just the right thing for you.

Oftentimes one of the biggest obstacles others in your company have for not doing simple data analysis themselves is that there's a lot of overhead. In it's simplest form big data is just SQL, like in the case of Apache Hive. Now if big data is just SQL, why aren't more people doing it? It's because of the complicated overhead. Do you know:

...which applications you need?
...how to launch and destroy clusters?
...how to write and debug a job?
...how to deploy jobs to the cluster?
...how to check the results?

If your job script is perfect, then you only need to figure out once how to harness an EMR cluster to run your script and terminate once it's done. Usually, however, you need to iterate your jobs quite a bit before they're just right. If developing your job includes a lot of overhead, chances are you won't do it. If instead you can iterate very efficiently and setting up tools is a breeze, chances are other people will do it too.

While running a local sandbox is fine, it doesn't solve the problem of data availability. What if your big data is in your EC2 instance, how easy is it for you to access your data there, or make copies of it for your sandbox? Instead, it's a lot easier to launch your cluster to the AWS EC2 instance next to your data. You want to explore data, too!

For example, I have some JSON data in an S3 bucket. Ideally, I would like to have an interactive browser editor for writing SQL-like queries, which would get run as "Big Data -thingies" to my S3. I would like to see results in my browser, like I do with my SQL desktop clients, or be able to store results in S3.

If you read to the end, you'll know how you can do this, easily.

Hue, an open source Analytics Workbench for self service BI.

Launching an EMR cluster on AWS is super easy with the UI. When you launch an EMR cluster on AWS, you can include Hue, which is excellent. Hue is a browser-based editor that lets you run all sorts of scripts on your cluster and instantly see the results. Did I say it was excellent?

The only problem with Hue on EMR is that... there's no open internet access for EMR clusters. Your cluster's master node has a hostname something along the lines of ec2-###-##-##-###.us-west-2.compute.amazonaws.com but you will not reach that hostname without some tricks.

You need two tricks:

An SSH tunnel to your EC2 instance, and
A SOCKS proxy for your browser.

The first creates the information tunnel and the second instructs your browser to use it for specific traffic. To create an SSH tunnel, you will need a private key file. This is a .ppk file on a Windows platform (Putty Private Key) or a .pem file on other platforms (used by ssh).

For this to work, you also need to have set up aws cli. Additionally, I am assuming you are using profiles to assume IAM roles with aws cli. If not, then, well, you'll know not to set the profile when calling boto3.session.

A list of prerequirements:

aws cli configured with a profile called sandbox
python 3 with boto3 and configparser
plink on Windows (comes with Putty), ssh otherwise
An EC2 private key (create from the EC2 console. On windows you'll need to convert to .ppk.)
Chrome browser (If you know how to configure the SOCKS proxy, any browser will do)

You can get the code from the git gist, place it in the same directory with the private key file and run it. Here're some parts of the code. Let's define some variables first. If you are using the Git gist, it will do a lot of this through a config file.

import tempfile
chromepath    = 'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe'
key_extension = '.ppk' if platform.system() == 'Windows' else '.pem'
keyname       = 'your_key_name' # expects the .pem or .ppk file to be same name
keyfile       = keyname + key_extension
profile       = 'sandbox'
region        = 'eu-central-1'
instance_type = 'm4.large'
subnet        = "" # you can specify a subnet. I choose this randomly later.
release_label = 'emr-5.13.0'
instances     = 1
apps          = [{'Name': 'HIVE'}, {'Name': 'Hue'}]
profiledir    = tempfile.gettempdir() + "\\" + "chrome_emr_socks_session"
cluster_name  = 'your cluster name'

You might want to choose a subnet, but if you don't, a random one from your VPC will be selected.

import boto3

session = boto3.Session(profile_name = profile)
ec2 = session.resource('ec2')

first_vpc = list(ec2.vpcs.all())[0]
subnets = list(first_vpc.subnets.all())
subnet = random.choice(subnets).id

The following launches the cluster. You'll need to have suitable roles for the EMR_EC2_DefaultRole and EMR_DefaultRole. Mine have the AmazonElasticMapReduceforEC2Role policy with trust relationship to EC2 and AmazonElasticMapReduceRole with trust relationship to elasticmapreduce, respectively.

client = session.client('emr', region_name = region)

response = client.run_job_flow(
    Name=cluster_name,
    ReleaseLabel=release_label,
    Instances={
        'MasterInstanceType': instance_type,
        'SlaveInstanceType': instance_type,
        'InstanceCount': instances,
        'KeepJobFlowAliveWhenNoSteps': True,
        'TerminationProtected': False,
        'Ec2SubnetId': subnet,
        'Ec2KeyName': keyname,
    },
    Applications=apps,
    VisibleToAllUsers=True,
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole'
)

# better store the job flow id
job_flow_id = response['JobFlowId']

If you want to check how it's doing, have a look at the EMR console, or with a command

cluster_info = client.describe_cluster(ClusterId = job_flow_id)
cluster_info

Once the cluster is set up, you'll see it as waiting and it has a public master DNS name.

master_public_dns_name = cluster_info['Cluster']['MasterPublicDnsName']

We'll use that to create an SSH tunnel to it.

# This opens the SSH tunnel on Windows
if platform.system() == 'Windows':
    process_id = sp.Popen(['cmd.exe', '/c', 'echo', 'y', '|'
        , 'plink.exe', '-i', keyfile, '-N', '-D', '8157'
        , 'hadoop@' + master_public_dns_name]
        , shell=True)
# I haven't tested this, but on Linux / Mac this should work.
else:
    process_id = sp.Popen(['ssh', '-o', "'StrictHostKeyChecking no'"
        , '-i', keyfile, '-N', '-D', '8157'
        , 'hadoop@' + master_public_dns_name]
        , shell=True)

# Just ensure your keyfile points to the .pem or the .ppk file

One last thing before you can connect: SOCKS proxy. With chrome this is pretty easy.

command = [chromepath
           , '--proxy-server=socks5://127.0.0.1:8157'
           , '--user-data-dir=' + profiledir
           , 'http://' + master_public_dns_name + ':8888'
          ]
sp.call(command, stdout=sp.PIPE,stderr=sp.PIPE)

Nice! You can now create credentials to your Hue and start scripting! When you're done, the python script will destroy the cluster for you, like so:

# Stop the SSH tunnel
process_id.kill()

# Terminate the cluster
response = client.terminate_job_flows( JobFlowIds=[job_flow_id] )

All of the above is in a neat gist, which you can get here. You just need to have your private key file.

Simple way to query Amazon Athena in python with boto3

2018-04-30T10:00:00.002+03:00

ETL takes time and it's a lot to maintain. Sometimes it breaks when you didn't expect a string to contain emojis. You might decide the transformation needs to be changed, which means you need to refresh all your data. So what can you do to avoid this?

Data virtualization helps tremendously with that. It's not the answer to everything of course, but it makes certain things loads easier. When you don't have to worry too much about the structure in which to store your data, you can choose to keep it in its richest, untouched format. All of it. So when you need it in a specific format, it's all there. With data virtualization, you impose a schema on read, not on write, and you can change that schema as you wish.

Towards the end of 2016, Amazon launched Athena - and it's pretty awesome. We can e.g. store our raw JSON data in S3, define virtual databases with virtual tables on top of them and query these tables with SQL. And on top of everything, it is quite simple to take into use.

To give it a go, just dump some raw data files (e.g. CSV, JSON or log files) into an S3 bucket, head over to Amazon Athena and run a wizard that takes you through a virtual table creation step-by-step. If you are familiar with Hive, you will find that the Data Definition Language is identical. Once a table is created, it's ready to be queried.

If you wish to run queries to Athena from e.g. python, you have a few options, for example

PyAthenaJDBC
JayDeBeApi
boto3

From a user experience point of view the PyAthenaJDBC would have been my preferred order too, as the first two would have let me query easily into a pandas DataFrame, but I was too lazy to compile the PyAthenaJDBC on my Windows machine (would've required Visual C++ Build Tools which I didn't have). JayDeBeApi looked like a hassle to set up.

Boto3 was something I was already familiar with. With boto3, you specify the S3 path where you want to store the results, wait for the query execution to finish and fetch the file once it is there. And clean up afterwards. Once all of this is wrapped in a function, it gets really manageable.

If you want to see the code, go ahead and copy-paste this gist: query Athena using boto3. I'll explain the code below.

First let's start with our configurations. Fill these with your own details of course.

import boto3
import pandas as pd
import io
import re
import time

params = {
    'region': 'eu-central-1',
    'database': 'databasename',
    'bucket': 'your-bucket-name',
    'path': 'temp/athena/output',
    'query': 'SELECT * FROM tablename LIMIT 100'
}

session = boto3.Session()

The following function will dispatch the query to Athena with our details and return an execution object. Our query will be handled in the background by Athena asynchronously.

def athena_query(client, params):
    
    response = client.start_query_execution(
        QueryString=params["query"],
        QueryExecutionContext={
            'Database': params['database']
        },
        ResultConfiguration={
            'OutputLocation': 's3://' + params['bucket'] + '/' + params['path']
        }
    )
    return response

To make things a little easier for us, the next function will do a few things for us. It will:

Dispatch the query to Athena
Poll the results and once the query is finished
Return the filename in S3 where the query results are stored

def athena_to_s3(session, params, max_execution = 5):
    client = session.client('athena', region_name=params["region"])
    execution = athena_query(client, params)
    execution_id = execution['QueryExecutionId']
    state = 'RUNNING'

    while (max_execution > 0 and state in ['RUNNING', 'QUEUED']):
        max_execution = max_execution - 1
        response = client.get_query_execution(QueryExecutionId = execution_id)

        if 'QueryExecution' in response and \
                'Status' in response['QueryExecution'] and \
                'State' in response['QueryExecution']['Status']:
            state = response['QueryExecution']['Status']['State']
            if state == 'FAILED':
                return False
            elif state == 'SUCCEEDED':
                s3_path = response['QueryExecution']['ResultConfiguration']['OutputLocation']
                filename = re.findall('.*\/(.*)', s3_path)[0]
                return filename
        time.sleep(1)
    
    return False

If you want to clean things up after, this piece of code removes all objects in the path:

# Deletes all files in your path so use carefully!
def cleanup(session, params):
    s3 = session.resource('s3')
    my_bucket = s3.Bucket(params['bucket'])
    for item in my_bucket.objects.filter(Prefix=params['path']):
        item.delete()

Best be careful with that one ;)

Once you have all these defined, you can call

# Query Athena and get the s3 filename as a result
s3_filename = athena_to_s3(session, params)

# Removes all files from the s3 folder you specified, so be careful
cleanup(session, params)

If you enjoyed these instructions, would you mind commenting below 👇🏻?
Also, if there's something this did not help you with regarding querying Athena with boto3, I'd be happy to improve the article.

Last update 25.3.2019

Know exactly how much you pay to acquire any user: Python with Google API

2018-02-13T12:56:00.000+02:00

So you've read about how to optimize your marketing efforts through data. With that, you should know the kind of users different marketing campaigns are bringing in. Some campaigns might be bringing more high-quality users than others. Can you do that now?

And that's only one side of the coin. What about how much the campaigns actually cost you? How much are you paying for an active user for example? How valuable would it be to you to know exactly, for each individual user, how much you've paid to acquire them? And combining with the churn rate of your different campaigns, what is the Customer Lifetime Value of an active user per different campaign, considering the Customer Acquisition Cost?

I've been discussing with digital marketing professionals and it seems this many companies do not get to this granular level. Many companies do not have the means to know exactly how much they've paid to acquire a specific user. Although they could. And imagine the possibilities! In Zervant's case for example:

We were able to identify people coming through our Google display marketing campaigns. We paid very little for those users (in terms of sign-up), but they never became active users. Thus, the cost per active user was still very high. So we cancelled these campaigns.
We are able to say if a campaign, where we e.g. advertise our paper-as-a-service delivery results in users who send paper invoices. We can calculate how much we pay for each acquired user, how many of them activate, how much they invoice in paper and at what rate they churn. Overall, we will have a detailed idea of how well the marketing message reaches the users who appreciate the paper invoicing and what the revenue is per user. Even if the cost per active user might be higher, if it attracts the right kind of users who benefit from our premium services, their churn might be lower, resulting in a highly favorable CLV/CAC ratio.

Now, to do it the easy way, I would fully recommend having a look at Supermetrics. With Supermetrics you can pull in your marketing spend data into a Google Sheet very easily. No hassle. They will even let you schedule the refresh daily, if you prefer. If you then need to read this data into your analytics database, it's a matter of exporting the Google Sheet. I'm pretty sure it will work ;)

But that's not the path I took. I wanted to query the data myself, directly from the marketing APIs. Why? I'm not sure. But it was worth it I'm sure.

To accomplish this, you need to do the following things:

Create a Google API service account (here).
Your VIEW_ID
Some way you will need to grant the service account rights to read Google Analytics data.

The original instruction for this is found here.

Create Service Account

You can create a Service Account on the IAM & Admin portal: https://console.developers.google.com/iam-admin/serviceaccounts

You might need to create a project first. Read more about Google APIs OAuth2 ServiceAccounts here.

VIEW_ID

When on Google Analytics, go to the account selector (top left). When you expand that, you see Analytics Accounts, Properties & Apps and Views. The number below the view you are interested in is your VIEW_ID.

Grant rights

When you try to first time run your script, you will likely get an error such as below:

HttpError 403 when requesting
https://analyticsreporting.googleapis.com/v4/reports:batchGet?alt=json
returned "Google Analytics Reporting API has not been used in project [...] before or it
is disabled. Enable it by visiting
https://console.developers.google.com/apis/api/analyticsreporting.googleapis.com/overview?project=...
then retry. If you enabled this API recently, wait a few minutes for the action to propagate
to our systems and retry."

As you follow this link, just enable to API. Alternatively, you can use the Google's Setup tool.

You need to add your service account email address to your Google Analytics account. The service account email address is visible on the IAM & Admin page and ends with iam.gserviceaccount.com. Just copy the email address, head over to your Google Analytics admin settings and add the email as a reader to your GA account.

Code

The following sections can be viewed as a full Github Gist here, but is explained in more detail below.

Setting up, we import a couple of Google libraries, pandas and datetime.

from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials

import pandas as pd
import datetime

#SCOPES = ['https://www.googleapis.com/auth/analytics.readonly']
SCOPES = []
KEY_FILE_LOCATION = './path-to-your-private-key-file.json'
VIEW_ID = '123456789'
from_date = datetime.datetime(2017,12,1,0,0,0).strftime("%Y-%m-%d")

initialize_analyticsreporting creates the authenticated object that can call the API.
get_report calls the API, combines each result page into a list of responses
strip_ga_prefix just cleans up the column names from the responses
get_dataframe combines the list of responses from get_report into one big pandas dataframe

def initialize_analyticsreporting():
  """Initializes an Analytics Reporting API V4 service object.

  Returns:
    An authorized Analytics Reporting API V4 service object.
  """
  credentials = ServiceAccountCredentials.from_json_keyfile_name(
      KEY_FILE_LOCATION, SCOPES)

  # Build the service object.
  analytics = build('analyticsreporting', 'v4', credentials=credentials)

  return analytics



def get_report(analytics, query):
    """Queries the Analytics Reporting API V4, combines multiple pageSize
       together and returns a pandas dataframe.

    Args:
    analytics: An authorized Analytics Reporting API V4 service object.
    query: the json that describes the query
    Returns: a pandas dataframe.
    """
    nextPageToken = '0'
    responses = []
    while (nextPageToken != None):
        query['reportRequests'][0]['pageToken'] = nextPageToken
        result = analytics.reports().batchGet(
          body=query
        ).execute()
        responses.append(result)
        if 'nextPageToken' in result['reports'][0]: 
            nextPageToken = result['reports'][0]['nextPageToken']
        else:
            return responses
        
    return responses
    
# The columns from GA come with a ga: -prefix. This simply removes that where it exists.
def strip_ga_prefix(string):
    if string[:3] == 'ga:':
        return string[3:]
    else:
        return string


def get_dataframe(responses):
  """Parses the Analytics Reporting API V4 response to a pandas dataframe

  Args:
    response: An Analytics Reporting API V4 response.
  """
  rowlist = []
  for response in responses:
      for report in response.get('reports', []):
        columnHeader = report.get('columnHeader', {})
        dimensionHeaders = columnHeader.get('dimensions', [])
        metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', [])

        for row in report.get('data', {}).get('rows', []):
          new_row = {}
          dimensions = row.get('dimensions', [])
          dateRangeValues = row.get('metrics', [])

          for header, dimension in zip(dimensionHeaders, dimensions):
            new_row[strip_ga_prefix(header)] = dimension

          for i, values in enumerate(dateRangeValues):
            for metricHeader, value in zip(metricHeaders, values.get('values')):
              new_row[strip_ga_prefix(metricHeader.get('name'))] = value

          rowlist.append(new_row)

  df = pd.DataFrame(rowlist)
  return df

the query constructs what we want to fetch from the API

query = {
    'reportRequests': [{
        'viewId' : VIEW_ID,
        'pageSize' : 10000,
        
        'dateRanges': [{
            'startDate' : from_date,
            'endDate'   : 'today'}],
            
        'dimensions' : [
            {'name' : 'ga:date'},
            {'name' : 'ga:source'},
            {'name' : 'ga:medium'},
            {'name' : 'ga:campaign'},
            {'name' : 'ga:keyword'},
            {'name' : 'ga:adGroup'}
            ],
          "orderBys": [
       {'fieldName' : 'ga:adCost', 'sortOrder': 'DESCENDING'},
  ],

        'metrics':[
            {'expression' : 'ga:adCost'},],
    }]
}

and putting these together: authenticating, fetching the response list and building a DataFrame of it.

analytics = initialize_analyticsreporting()
response = get_report(analytics, query)
df = get_dataframe(response)

That's it. Next what you will want to do is store that dataframe contents to your database, where you can blend the data with other data such as new user traffic.

Google is easy and they have good instructions themselves. I have learned that Facebook and Bing might be trickiers. In the following posts I will give step-by-step instructions for how to get spend data from Bing and Facebook.

What is the easiest way to train a neural network?

2018-01-16T09:13:00.003+02:00

We've been surveying our customers since forever, but recently we've started to become more and more hungry on data. As an invoicing service provider for micro entrepreneurs, Zervant's customers are of all varieties. One interesting data snippet we've found out is that roughly half of our customers are part-time entrepreneurs.

But which half is it?

As you can perhaps imagine a part-time entrepreneur's needs can differ greatly from those of a full-time one. To be able to cater for those needs, we should know which half they each belong to.

This will not be a post about how to train a neural network in R, all the steps and nuances of the process or some big revelation that turned the seemingly doomed project to success. This will be a lot shorter. In fact, once I had my data in the format I wanted, it took me 5 minutes to train a neural network and a boosted decision tree for my classifier. On Azure ML Studio. 5 minutes.

If it's a simple algorithm you need to try out and just don't feel like coding it up in R or Python, then have a look at the Azure ML Studio. From what I've used, it's the quickest way to have something up and running and see results. The workflow is implemented in a visual programming 'language' of sorts, like so:

If you don't have experience in Azure ML Studio, I would recommend having a look at this free course on edX. The course isn't too advanced even if you don't have experience in machine learning algorithms.

There are some downsides, too. Once you have trained an algorithm, it might not be very easy to get that algorithm out of Azure ML. Azure ML scales parameters automatically, so even if you manage to get e.g. a regression model out, you might still have trouble knowing how the data was scaled before training the model. But that one is still do-able with a regression model. With something like a neural network it gets impossible. Or a boosted decision tree.

Azure ML Studio wants you to publish your trained algorithm as a web service rather than taking it out of the portal and implementing it yourself. And, getting any substantial volume through that API will mean you need to pay for the service.

At least for me Azure ML Studio lowers the threshold to try out a model at a problem. By doing so, it enables you to easily evaluate a range of models and see how well they perform. This can help you decide more quickly the ones to go after, in case you wish to implement them in e.g. Python.

Prioritize your features with the Kano method

2018-01-02T12:53:00.001+02:00

Prioritizing new features requires intuition, but it can definitely be made better by using data. As there's always a limited amount of bandwidth to do stuff, you need to relentlessly prioritize what to do next. An easy way to fail is to be too busy doing the wrong things.

Beware letting your customers do prioritization for you

Open innovation was a big thing in the early 2000's. Companies such as iStockPhoto, Threadless and Lego rode to success in collaboration with their customers. We read The Wisdom of Crowds by James Surowiecki, published in 2004, on how to count how many jelly beans there are in a glass jar. In 2006, the Wired magazine coined the term Crowdsourcing to describe how users create valuable content for companies. Companies such as GetSatisfaction and UserVoice were created on the premise that it's the most efficient way of hearing the Voice of the Customer.

Everyone jumped on the bandwagon to democratize feedback. But do we read about those anymore? No. Why? Because it doesn't really work that well.

Have a look at some of the User Voice sites, for example Microsoft's Xbox and Azure. You would think that with such massive user bases the top requested items should be pure gold, but they're not. To make things worse, most of them have received no reply as to why they're not being considered.

After instructed to do so by their customer support, I recently added my 2 cents to a request at HotJar, something that made it impossible to continue using it at Zervant. After that, it's the usual treatment: silence. While this was critical for me, it's clear it's either complicated to implement, not that common a problem or doesn't fit HotJar's product vision.

There's the whole problem: your customers don't know what's valuable to your other customers, how much effort something takes to do and what you have planned for the future roadmap. You might still be focusing on core features, while your success has already aroused the interest of the Chinese (1.2 bn), Spanish (400m), Hindi (370m), Arabic (250m) and Portugese (215m) speaking populations eagerly asking for you to add language support. Imagine what listening to that would do to your product roadmap.

Feature priorities in Kano model

Use a proven method: Kano

Like said, prioritizing feature requests is quite difficult. If you ask a customer if they would like features X, Y and Z, they will want them all. You also don't necessarily want to keep on adding more features, unless they are really critical to your competition. There are plenty of products and services that have succeeded in avoiding feature fatigue (a term coined as late as 2005 in a publication by Thomson, Hamilton and Rust), think of Dropbox, When stakeholders are proposing new features, you need a way to understand how they actually fit with your customer base.

I recommend reading the very thorough piece on the Kano model on foldingburritos.com. The reason I like the Kano model is that it puts you back on the driver's seat:

You choose what features you want to prioritize. That way you can be sure they will fit your roadmap.
You choose who you ask, so you know they are a sample of your target group.
You control the results: it's not out there for anybody to hold against you.
You get rich data.

So how do you run a Kano study? I still suggest you have a look at the foldingburritos article, but in short:

Gather a list of the value propositions you are considering. This can include existing value propositions you are considering to drop.
Formulate two questions from each, a functional and dysfunctional one, functional meaning the case where the value can be provided and a dysfunctional meaning it is missing. The question needs to be such that it asks how the respondent feels about such a situation. For example:
If data can be easily shared between multiple devices, how do you feel? and

If sharing data between multiple devices requires significant effort, how do you feel?
Give the same alternatives for both on an axis from pleasure to avoidance of displeasure, e.g.: I like it, I expect it, I'm neutral, I can live with it and I dislike it.
Optionally, add a question for how important the topic is for the respondent. This makes the survey cognitively heavier, but is sometimes worth it.
Run your survey. When you analyze your results, depending on how an individual has answered both questions of the same value proposition, you can say how many of your respondents consider the item a must have, a "more is better" -type of a thing, a delightful improvement, negative or irrelevant. Be sure to check the foldingburritos article on how to do the categorization and how to weigh different answers.

That's it, nothing more to it really. Let me know if you found this helpful.

Spot your churning users sooner with statistics

2017-12-18T11:00:00.000+02:00

The girl (or guy) you're dating has called you on average five times a week lately. As a neurotic analytics enthusiast/stalker you happen to know this. Now it's been three days without a call. Should you be worried?

Remember Gamma-distributions?

If you would model the delay between calls (by a number of different girls you're simultaneously courting, you stud) and you have a lot of girls calling you, then the distribution of those delays is likely exponential. However, if you are modeling an individual girl's call delays, the events no longer are independent, and therefore, a Gamma distribution is more likely. (read more about the different distributions)

In order to use the Gamma distribution you need two parameters: the shape α = k and either scale θ, rate β = 1/θ or mean μ = k/β. You need to do some work to figure out which parameterization fits your data best. Let's assume our shape k = 2, and our mean μ = 1.75, which means our rate β = 2 / 1.75 = 1.14 and scale θ = 1.75/2 = 0.875.

We check how likely this scenario is. There's a handy online tool for that. Or if you have R, just run this:

> pgamma(3, 2, scale=0.875, lower.tail = FALSE)
[1] 0.1436329

... which is to say that with these parameters over 14% of the delays are longer. Sounds reasonable. But what does this have to do with your business? And is it anybody else's business than yours whether she calls you or not?

How your users are dating you

You can apply the above logic to e.g. how often your user signs in or does something of value from your perspective. And you can easily apply this on a user-specific level, getting a quick feeling of whether their feelings for you are mellowing. If a user tends to sign into your service every three days (μ = 3 and if again we assume k = 2, then θ = 1.5 ) and you haven't seen them now for a week (waiting time = 7), you can say statistically that this is getting rare. Only 5 out of 100 delays in your user's distribution will be this long. But what does that mean?

That most likely they're on vacation and they'll still come back to you.

Until they don't. The longer you wait, the more certain you can be that they're not coming back.

How quickly can you spot a churn?

Let's assume that you define churn as a user who doesn't come back in a while, e.g. a week. The ratio of the current delay to an average delay determines the outcome of the gamma distribution probability function. If you select a few dates randomly from the past and check what that ratio is for all of your users and use it in a logistic regression model to predict if they will churn or not, you will find out that there is plenty of prediction power there. And why shouldn't there be? After all, it's just a more accurate measure of the relative delay a particular user has been away.

So what then?

It helps to understand how your users are different. You might still choose to have the same definition of being active for all your users, but at least you will understand how that has an effect on your numbers. Using statistics like this will most likely give some early warning on churn. It is also quite powerful to point a few users who have been active, but quit cold-turkey. This raises a lot of interesting questions: What happened? What could we have done to prevent it? Should we ask them? Were there signs already before we could have noticed?

Indeed, were there?

Becoming a data-driven company

2017-12-11T10:22:00.000+02:00

How does a company become data-driven? Here's what I've learned at Zervant.

Radiate information

If you want to turn the culture more data-driven, radiate information. You'll never know what will stick. Radiating information also has the side-effect of evangelizing the work of analytics: people will remember you exist and they will take analytics into consideration more easily.

Oh right, when we're building this we need to talk to analytics about what to measure!

Write an internal blog, give short presentations about an information nugget, etc. Frequency is more important than the perfect quality of content.

Teach

People love to learn new things. They will first tell you they're in no way analytical-minded and far from technical, but they will eagerly learn. De-mystifying analytics will help everyone in your organization to speak the same language and be on the same page about what analytics can do for them.

As I've held workshops on analytics, I've seen people light up and apply their newly discovered analytical mind in a sophisticated manner on their own fields including HR, branding, marketing and customer support. It doesn't take much, but it gives back a lot.

Be exemplary in prioritizing projects by their business potential

Who doesn't want to develop a fancy neural network that predicts accurately the next user to convert? But if you ask yourself what you can really do with that information and you struggle to answer, it's a sign that you should perhaps reconsider your priorities.

Everyone in the organization will have opinions on what the most important thing is. Amazingly, the most important thing tends to depend on the person and astonishingly often is related to their own field. This applies to us in data science too. Break free from that, be objective, realistic and show the right kind of example to the rest of the organization. If you have good, concrete arguments of a projects potential, it will be hard to oppose it with only opinions.

Create a data strategy: start from what your business needs, identify the points that have the largest impact, but equally importantly brainstorm new opportunities for use of data. Make the business case clear and be honest about it.

Experiment lightweight

It doesn't matter that you don't have a fancy A/B testing suite, you can still do lightweight experimentation. Are you planning to add a new feature? What if before implementing any of it you just create a mock up with a button to check who clicks it, who's interested? If you do this, don't be a douche. Do it with respect to your users.

The point is not to get it right the first time, but rather to show how easy it is to test something out. This will help people understand the power of lightweight experimentation and they will have lots of new ideas to try out.

Celebrate even the small wins

Everyone loves a good story, even if the story is about how the definition of your active user got improved - as long as you make it relevant to your audience. Celebrating even the small wins is not just good for morale, but it also highlights the impact of this work. In order to continue enjoying the trust of your top management, this is critical.

You won't make it alone

Data-driven doesn't mean you shove data down everyone's throats. Rather, try to find approaches that involves others. Analytics is woven into everything a company does. When it is done as an afterthought, it becomes an isolated activity - something to sprinkle on top by the analytics team before things go live. Preventing this from happening unfortunately means having good personal relationships with the key influencers in your organization. So work on your social networking skills.

Dramatically improve your conversions with session recording

2017-12-04T15:10:00.000+02:00

Are you satisfied with your sign-up rate? Or activation rate? Do you think it could be higher? What do you think you can do to improve them? How do you know that?

One of the most powerful things that you can do to improve your activation rate is to watch what your users are doing: what confuses them, what path is logical to them and where they fail. When you do that, you will find out that your call-to-action is too mild, your layout is confusing, your responsive design isn't adapting well in all cases or that the navigation path you thought everyone takes isn't common at all. You cannot solve what you do not know. Luckily most of your issues are likely low-hanging fruits you can find easy remedy to.

There are a gazillion services you can use, HotJar being perhaps the biggest name. But it wasn't the right choice, at least for me. HotJar lets you record towards a quota and you can define pretty flexibly who you want to start recording and when. But HotJar doesn't work with scrollable DIVs and if you would like to see only sessions that include purchasing your product, HotJar will fail you.

If you're not too concerned about GDPR or privacy matters in general, then Yandex Metrica might be a great choice: it's free and from what I can tell fairly powerful. But I care about privacy and you should too.

Then there are decent looking services such as Fullstory, Inspectlet and Mouseflow, but honestly when comparing the bang-for-the-buck they came second to Smartlook. I'm especially fond of the fact that they are taking GDPR seriously.

If you end up going with Smartlook, I recommend taking a look at their tagging. With that, you can add identifier tags to the sessions, e.g.

<script>
        // tag a session with a user's email
 smartlook('tag', 'email', 'ilkkapel@gmail.com'); 
</script>

This is pretty handy if you get feedback from a user about a bug and you just can't reproduce it. This lets you see what really happened.

Or you can tag a session that includes a specific action

<script>
 smartlook('tag', 'checkout', 'true'); 
 smartlook('tag', 'contacted', 'true');
</script>

Keep in mind that you cannot override a tag you've already used. The following will not work

<script>
 smartlook('tag', 'action', 'checkout'); 
 smartlook('tag', 'action', 'contact');
</script>

Also, I've found it quite handy to tag with a session number. If you otherwise won't have access to a session number, you can easily create something that would track it in the browser using the local storage for example.

This only gets you so far though. After the easy problems have been fixed, it becomes increasingly harder to find the signal from the noise. By that time though you will have tremendously improved your conversions.

Optimize your marketing efforts with the power of raw data

2017-11-27T20:10:00.000+02:00

If you're in marketing (I'm not), you'd want to know how much bang-for-buck your campaigns are producing. Now I'm no Google Analytics expert, but I'm going to make a claim that you too might have an easy time measuring conversions across your campaigns, but not so much how your sign ups differ between them? Is one bringing in more high quality people than others?

If you are running a web shop or some other transaction-based business then sure, you can set up Google Analytics Ecommerce to follow up on customer life time value. But what if you are running a SaaS and you bill monthly? Or you have a freemium model and you hope to monetize the high volume users through some additional features, how would you compare your campaigns then? With only Google Analytics, it is tough to analyze how the users coming through different marketing campaigns differ from each other.

What you need is raw data and to know the path each of your user has taken as they sign up to your service whether paid, direct, organic or other. But Google doesn't give out this information easily. You could invest in Google Analytics 360, which means perhaps $150 000 a year. I love the quote they've selected to highlight by the way:

“Companies spend a lot of money on marketing, it’s difficult to understand which dollars are working and which dollars aren’t. The Google Analytics 360 Suite allows you to actually understand which half is working.”

(Shouldn't Google Analytics be able to do that already? No? Okay, my bad.) If you're unwilling to spend the $150 000 on 360, there is also a hack for querying raw data from Google Analytics. I don't particularly like this approach as it is still dependent on Google to allow this.

You can, however, simulate Google Analytics behavior with your own code and do whatever you wish with the data. Especially if your landing pages are in a different sub domain than your service, you'll might be more successful using a cookie instead of the local storage.

The workflow is simple:

Store your data to the user's browser as they first arrive to your site. At that point you have access to data such as referring URL. This allows you to calculate almost everything, except e.g. auto-tagged adword campaigns (as of November 2017).
Once your user successfully signs up to your service, you will need to fetch the marketing traffic data from the user's browser and
Send the data off to somewhere, along with the ID of the new user.

With the simulated GA cookie script above, you can deduce quite a bit of information from the referrer URL, such as source and medium, and if the URL is UTM tagged, also campaign, term and content. For debugging purposes, it might be helpful to store the full referrer url as well.

We ended up storing the cookie value as a base-64 JSON string. In the below example our valueArray would contain information such as source, medium, campaign, term and content.

// when saving the array to cookie
cookieValue= btoa(JSON.stringify(valueArray));

// when reading the cookie
valueArray = JSON.parse(atob(cookieValue));

Remember to save the cookie domain with the dot in front for it to be accessible in all your sub domains.

document.cookie = 
"cookiename=cookieValue;expires=Sat, 18 Nov 2017 00:00:00 GMT;domain=.yourdomain.com";

If you prefer to use something more modern than cookies, this library from Zendesk should help you with local storage across subdomains (haven't tried it though).

Once the data is in your analytics database, you will be able to analyze what kind of users you are getting from the different marketing sources and double down on those that bring in more quality sign ups.

It's probably easiest to implement this logic in Google Tag Manager. Have fun! Let me know if you're planning to try this and how it went!

Calculating active users and churn in a database

2017-11-27T10:55:00.001+02:00

Have you ever struggled with calculating your active users per day, or month? How about your new activations or churned users? It's a common problem, but often the first solution you find will not scale.

You should avoid sub queries, and use a window function instead. But that's only a part of the solution.

Assuming your active user definition is "someone who has signed in within 7 days" and you have this data in a table with columns user_id, timestamp. The first thing you want to do is calculate when a user activates for the first time, when they churn and when they resurrect.

CREATE OR REPLACE VIEW "vw_login" AS 
 SELECT *
    , LEAST (LEAD("timestamp") OVER w, "timestamp" + 7) AS "activeExpiry"
    , CASE WHEN LAG("timestamp") OVER w IS NULL THEN true ELSE false AS "activated"
    , CASE
 WHEN LEAD("timestamp") OVER w IS NULL THEN true
 WHEN LEAD("timestamp") OVER w - "timestamp" > 7 THEN true
 ELSE false
 END AS "churned"
    , CASE
 WHEN LAG("timestamp") OVER w IS NULL THEN false
 WHEN "timestamp" - LAG("timestamp") OVER w <= 7 THEN false
 WHEN row_number() OVER w > 1 THEN true
 ELSE false
 END AS "resurrected"
   FROM "login"
   WINDOW w AS (PARTITION BY "user_id" ORDER BY "timestamp")

With this, you've added three boolean columns to your login data: activated, churned and resurrected, which tell you when a user activates for the first time, after which login they churn and which login makes them resurrect. Now your data is in such format that you can calculate active users over any date.

To calculate your daily active users you can apply the following query:

SELECT
SUM(SUM(activated::int - churned::int + resurrected::int)) OVER w
, "timestamp"::date FROM vw_login
 GROUP BY "timestamp"::date
 WINDOW w AS (ORDER BY "timestamp"::date)
 ORDER BY "timestamp"::date;

Hold on. No you can't.

For churn, you also need to take into account the time-shift: your row indicates that the user churns after that login. The day your user becomes inactive is after 7 days of the login, not on the day of the login. To accomplish this efficiently it's best to first create a view that aggregates the data to a daily level and then do a join.

-- Aggregate daily
CREATE OR REPLACE VIEW "vw_activity" AS
SELECT 
    SUM("activated"::int) "activated"
  , SUM("churned"::int) "churned"
  , SUM("resurrected"::int) "resurrected"
  , "timestamp"::date "date"
  FROM "vw_login"
  GROUP BY "timestamp"::date "date"
  ;

For the running total you should generate a time series first to ensure that all the dates you are joining the churn to exist. Also, take note on wrapping your values with a COALESCE function as if your SUM includes NULL values, it will return null.

-- Calculate a running total
SELECT
 d."date"
 , SUM(COALESCE(a.activated::int,0)
   - COALESCE(a2.churned::int,0)
   + COALESCE(a.resurrected::int,0)) OVER w
 , d."date", a."activated", a2."churned", a."resurrected" FROM
 generate_series('2004-03-07'::date, CURRENT_DATE, '1 day'::interval) d
 LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
 LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '7 days')::date
 WINDOW w AS (ORDER BY d."date") ORDER BY d."date";

That's it. That will give you the number of active users per day and the number of users who activated, churned or were resurrected on that day. It's now up to you to figure out what your definition for an active user is and if there are dimensions you need to add. But the basic logic is clear enough.

Let me know if you found this useful!

Power BI, Tableau or QlikView

2017-11-22T10:45:00.001+02:00

Ok ok I went along with Power BI but hear me out.

A year ago I joined Zervant to be responsible for the company's BI and analytics. Pretty soon I had to make a choice: which BI package should we take into use. Before this, Zervant didn't use anything apart from Excel and a Geckoboard. From my previous positions at Nokia and Microsoft, I have experience in Tableau, QlikView and Power BI, in that chronological order. It quickly came down to a choice between these three. Or to be honest - between Tableau and Power BI.

Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime.

I'm a big believer of self-service BI. I would rather build an analytics team that enables others solve their decision needs rather than giving the right conclusions. Therefore, building something that can be used by anyone in the company was one of the goals. Also, joining a start-up meant cost was a concern.

QlikView is in my experience complex to set up and maintain. I might be wrong, but I've always found QlikView a bit unapproachable. Powerful, yes, but too much a commitment and an investment. I wanted something that was easy to take into use, and possibly change if we didn't get along. QlikSense Cloud has changed this fortunately, but I'm still getting an allergic reaction from Qlik* (sorry). Maybe it's also the fact that I haven't seen great examples of QlikSense in actual use. Also, when searching for tips on QlikSense it becomes clear that the user community must be significantly smaller when comparing to the other two, which means less help when you run into issues. I believe I'll eventually end up giving QlikSense an honest try, but I will need to have some very convincing reasons to switch. And I wouldn't do it for $25 / user per month.

I like Tableau: the way it looks, integration with R, scripting and the rest of the features. I actually prefer it to Power BI. But what I don't like is its price: $112 / month per user, when you need to connect to a PostgreSQL server and have someone else host the dashboards for you. I also often struggle with the way Tableau assigns my data to dimensions and measures. Back when I was making this decision Tableau would've meant shelling out 2500 € for a year's license, and I wasn't ready to make a commitment of that size just yet. $112 / month per user might be a different story.

I ended up choosing Power BI and while I sometimes get some dissatisfied comments on this decision from Tableau fans, I am completely satisfied with the decision. Power BI is on-par feature-wise with Tableau. You don't have as much power over design, the DAX language is often infuriating, and I have only recently learned to tolerate the way you are forced to use on-premise data gateway despite the fact that your databases are also in the cloud. All these shortcoming are offset with what I am able to accomplish with it for the price: free now and $10 / month per user later. I also really appreciate the custom visuals, despite them often being quite buggy. But keep in mind that Mac users will not be able to develop new reports.

So, if you are a price-conscious Windows user and need a BI tool that anyone can use, have a look at Microsoft's Power BI.

Why WINDOW functions are changing the game for MySQL

2017-11-16T20:51:00.001+02:00

Back when we were deciding on an SQL flavor at Zervant, the fact that MySQL didn't have WINDOW functions was a deal breaker. We had plenty of MySQL experience and not that much in PostgreSQL but I had to go with PostgreSQL anyway - just because of WINDOW functions. But that's about to change, as MySQL is getting this support soon, too.

Imagine you need to calculate the number of active users you have on any given day. Now imagine that the way you have defined an active user is as someone who has done something on your service within e.g. 30 days. The something might be e.g. uploaded something, downloaded something, made a purchase, edited their profile - doesn't matter. And imagine you want to split by dimensions.

Without WINDOW functions you don't have good options. You can accomplish this by either sub queries, which gets very expensive:

SELECT t.id,
         t.count,
         (SELECT SUM(t2.count) FROM t as t2 WHERE t2.id  <= t.id) AS cumulative_sum
    FROM TABLE t
ORDER BY t.id

or MySQL variables, which cannot be used in a VIEW:

SELECT t.id,
         t.count,
         (@running_total := @running_total + t.count) AS cumulative_sum
    FROM TABLE t
    JOIN (SELECT @running_total := 0) r
ORDER BY t.id

There might be a way to use a stored procedure and a table, but this gets complicated to implement.
Code credits: MagePsycho.

Window functions to the rescue

With WINDOW functions, MySQL takes a step towards being a viable alternative for an analytics database. With a WINDOW function you can easily calculate a running total over any dimensions. Placing this logic in a VIEW makes it even easier to do report

SELECT date, SUM(COUNT(*)) OVER w
FROM active
GROUP BY date
WINDOW w AS (PARTITION BY 1 ORDER BY date)

Simple and effective - and enables the kind of calculations you have to do daily for analytics.

If I had to make the choice between MySQL or something like PostgreSQL for an analytics database, this change would make me reconsider. How about you?