top of page
Search

Notes for a first data science project

  • Writer: Sixtine Vervial
    Sixtine Vervial
  • Aug 20, 2019
  • 3 min read

"Let's hire a data scientist and do some machine learning!" is a wonderful aspiration, very common among startupers today. Unfortunately without being backed-up by some serious product development methodology, a data science intern's work will often end up in cold storage. Here are a few tricks to make sure your code makes it to production.


Objective

Innovation lies more in the use of data analysis and learning techniques than the development of those. Therefore, the methodology of product development should be your first focus, and specifics regarding the choice of packages will come later on. Because you will be using more than machine learning to develop your tool, I would refer to "the construction of a model", "the combination of several algorithms" rather than using the term "machine learning" straight ahead.


Data input

The selection and cleaning of the source data will be key to the project. I recommend that you aim for generating a large dataset (at least thousands of rows) and make the columns (attributes) as discriminant as possible.


Define the success

Keep in mind that the model development should be driven by a specific business case. In that regards, start by defining a first KPI for assessing the performance of the feature (and the model). Keep in mind some additional success measurements might come into play later on.


Data mining / trends searching / statistical preliminary research

(all those mean the same to me) This is the step where you will have the first look at your data, look at all possible combinations of attributes against your performance KPI, and figure out whether the training set is well suited (in a mathematical sense) to produce a robust model. Distribution of attribute values, dataset balance, variable importance are many things to investigate before choosing an algorithm because each has its specific requirements. I also recommend that this analysis helps in shaping a second version of the input data, after identifying attributes that produce interesting patterns. This will concretely be done with basic Excel research, diagram plotting and such.


Choosing your algos

Even though research is booming in all artificial intelligence and machine learning directions, make sure to stick to the basic models for your first production run. Even if you plan on using plug-and-play tools for developing your model, is important to have a rough ideas of the mathematics lying underneath. Identify if your challenge is supervised (eg. class prediction) or non-supervised (eg. classification), and look-up interesting notions like: entropy, cold start problem.


Feedback loop

Your KPI is defined, now imagining how you will get it back from the user in the app (visuals, question wording, timing) will be a crucial step to infuse the least bias possible in this metric. The technicity of "machine learning" (improving the model with new daily data) is "just" a technical implementation, nothing crazy to worry about. However, the integration of your model's input and results in your product is a game-changer.


Test design

Because the first model won't be the best one, and you will want to test many methods to improve your user satisfaction, designing proper tests will be also an interesting step. For A/B testing: would you present different methods to different user groups? (Think carefully about your audience definitions, the length of the test, the statistical significance of the results) Or combine recommendations issued from different methods for each user and making the methods compete with each other?



>> As a freelance data scientist, I offer to mentor data science projects through weekly progress assessment and next steps design. Get in touch today!





 
 
 

Kommentare


Sixtine Vervial - Data Services
French Auto-Entrepreneur
SIRET 80897424000018

All pictures taken from real travel stories - subject to copyright

  • GitHub-Mark

©2018 by Sixtine Vervial

bottom of page