[meetup] The freelance toolkit

Almost a year in travelling from one data challenge to the next, I was invited to speak for Berlin Digital Analytics in MHP's office in Berlin. I presented what I perceived being the most recurrent tasks each data team has once accomplished on the road to data-driven decisions.

The data structures of each project evolve with its growth. Starting with a business analyst and Google Analytics is reasonable until we need access to raw data. Custom trackers are then defined, and a data engineer will start designing a common place for all sources of data to land cleaned and enriched: the data warehouse. After covering visual reporting, alerting and structures for deep-dive analysis, the data can be used to enhance other internal services (CRM, User Profile, Recommendation).

THE GOLDEN RULES

Tracking is key. Without event tracking, personalisation will be limited by the compatibility between analytical tools and the application. Setting it up as soon as possible for having historical records to analyze. Snowplow Analytics offers a great managed pipeline.

Translate your challenge in KPIs. We are scientists, feelings make us feel weird. Articulating data questions into our plain SQL can be challenging, but clear dimensions and metrics help to translate. A/B tests are cool too.

Nobody puts the data scientist in the corner. Data Engineers put data together, Data Analysts look into nice data, Data Scientists build applications to feed the data back in a product. Some of us like to mix things up, but beware project scopes.

Keep it simple. We are serving a business, not running for the Fields. Sometimes, a dirty integromat hook can solve problems in an hour that no RandomForest can. Despite all those mathematicians voices calling for advancing the state of the art, keep the focus on the business value of your development time.

THE FREELANCER TOOLKIT

Building four data warehouses in three continents this year, I might have re-used some code here and there and hope to spare a bit of time to those reading this post.

Flatten metadata An application (backend) schema often follows a start structure: many small entities, to be linked only with a painfully long list of complex joins. Analysts are happier with flat dimensions and fact tables. Keeping the logic of table joins in ETLs makes reporting easier.

Define page type Moving further with centralising business definitions, it is recommended to enrich source data with business attributes. Definitions can be stored in UDFs (user-defined functions available in SQL or Python for Redshift for instance) and maintained in a single place.

Exclude internal traffic

Listing and excluding IP ranges or application IDs from all analysis is crucial, especially when automated tests in place. Reliable user-agent patterns are available all over the internet for excluding also bot traffic.

Sessionization

Be aware that Google Analytics has its own definition of a "session" and a complex attribution model. Unfortunately, this model is not easily replicable on custom tracked events. In order to deep-dive into cohort analysis and LTV calculation, it is crucial to carefully design the marketing touchpoint classification and the hannel attribution model.

Reconcile data sources

Let's just skip this sad one for now and think of a post about QA.

OTHER DIGITAL NOMAD THOUGHTS

#travel

Next step for this data journey: the emergence of data marketplaces?

#brexit

How the European Union supports our start-up world with awesome projects

* Link to an eu-funded list of projects around data marketplaces

THE FIN

Link to meetup Berlin Digital Analytics @MHP

Link to slides Data Challenges - meetup slides

[meetup] The freelance toolkit

Recent Posts

Comments