A Data Science Project Quick-Start Guide

Posted on Mon 07 March 2022 in Data Science, Project Planning, Productivity, Business Value, Business Goals, Data Products

Why care about how to start a Data Science project?

When new to Data Science, starting a project often means that one wants to be as innovative as possible and apply the newest and fanciest tech and Data Science solutions.By focusing only on the trendy technological, statistical and machine learning part of Data Science one might overlook business values and business goals - which in the end contributes to the holistic approach Data Scientists in the Industry should be focused on.

Not considering the purpose and context or success metrics apart from e.g. the model validation metrics itself - which may differ a lot from business success metrics can possibly lead to ineffective and non business benefitial data project outcomes.

In the following we'll quickly go over some hard-won lessons to make the transition from a nebulous problem to a usable prototype more easy without repeating the usual starting errors.

Understanding the purpose and context

One should start by understanding the purpose - the "why" behind each Data Science project. Asking one of the following questions can be very helpful:

What is the expected customer or business benefit?
What's wrong with the way things are now and is it worth to improve them by using Data?
Why is it important to solve exactly this issue now and not something else?

Imagine working in e-commerce and a product manager has heard of fancy AI and approaches you to build a product recommendation engine.

You might start by asking: "What's the expected benefit?"

"Customers will find products easier and we increase engagement", she replies.

One can then follow-up with, "How do we define engagement? Clicks? Purchases?"

How we define engagement will determine the features used for training (e.g. user interactions in form of clicks or purchases). If purchases is the goal, we should distinguish between conversion and revenue. Optimizing for conversion is relatively straightforward - we can train on sales and predict a purchase probability. If the goal is revenue, we might want to additionally take the price into account and weigh purchase features by item price, similar to how YouTube weighs videos by watched time.

Additonally, one shall try to get as much context as possible. Imagine building a feature store database. One could build one that will be used by the entire organization. Contrary, one could go for oneused by a single application only. The former will need to be robust, scalable, and well documented. In contrast, the latter can be implemented dealing with a hacky solution and a fraction of the effort and time. Understanding the context enables us to scope the solution appropriately. In the beginning one should roughly know what is needed and be included in the solution.

Definition of requirements, constraints, and metrics

Here the question would be what should be achieved for the project to be a success? One should be describing it from the customer's or business' point of view. What's the benefit for them using Data Science?

Business Requirements & Constraints

If one is applying time series forecasting, we might have requirements to detect outliers and anomalies in regard to e.g. Loss minimization. If one is building a recommendation engine, requirements include increasing the number of clicks and/or attributed purchases on the recommendation widget. If one is automating a manual categorization process, one will definde targets on the proportion of products automatically categorized with high confidence, and manpower saved.

Requirements could also be easily framed as constraints. What can our Data Science/Machine Learning system not do? If e.g. an insurance fraud detector requires manual investigation for each flagged claim, we might constrain the number of false positives, a proxy for wasted effort, to be less than 25%. If, on the other hand, one wants to introduce new, cold-start products in our recommendations one might set a constraint that overall conversions should not drop by more than e.g. a predefined percentage number.

Generally, one can keep in mind that Data Science can help to generate business value regarding the following:

Market share / customer growth and retention
Revenue growth / increase sales, higher margins
Cost efficiency / reduce cost-income ratio
Loss minimization / less waste, avoid fraud, avoid fines
Capital optimization / increase ROI

Technical Requirements and Constraints

It is also very helpful to consider production requirements (though we're just starting the project). If engineering has a requirement on latency/speed one might not consider techniques that are extremely costly to deploy at scale. There could also be resource constraints. Not having a real-time feature store in place (yet) would preclude live session-based recommendations, while not having the budget for a GPU cluster may mean starting with simple model is the way to go. I discussed a very effective and computational inexpensive recommendation engine model in another blog post.

While these constraints may be limiting and sometimes frustrating when the newest and fancy tech is not asked for, they clearly help by narrowing our investigation phase and save us the unnecessary effort of considering too fancy solutions which can't be used in the end anyway. Remember: In the end it is about bringing Data Science or Machine Learning into production and generate some business value. Clearly defined constraints free us to do anything except breach those constraints, empowering us to innovate. You'll be surprised how much the team can do with a frugal mindset.

Success Metrics

To measure how one is doing on the requirements one needs a set of metrics. This may require us to internally test our own data solution to understand how the customer experiences it. Setting up a proper success metric is crucial - Here one should consider that often the model validation metrics itself may differ a lot from business success metrics.

Digging into the data early enough to see what's immediately possible without further data wrangling and building a baseline model

Ideally, one is able to apply this step before the requirements are finalized. I would even say it is a must. Exploring the data might reveal the proposed requirements to be too much of a stretch.

Data Quality and Accuracy

Assume one is asked to build a product classifier that categorizes products based on title, and some metadata. While inspecting the data, we find a portion of existing products to have the same metadata and title, but being put into different categories. In this case, instead of directly starting the modeling work and achieving poor accuracy because ofinconsistent labeling, one should start the project with an initial phase of label cleaning and refactoring the product category tree.

Baseline Model

Another huge gamechanger is to implement a quick and most simple ML baseline model as part of data exploration. How quick? A day or two. The baseline may provide information on potential challenges in achieving target metrics.

For example, stakeholders have an initial requirement for a fraud detection model with e.g. boosted decision trees Catboost to achieve >95% recall and precision. However, one's baseline model is only able to achieve ~70% recall and precision. While closing the gap between 70% and 95% may not be impossible, it could be a very challenging and time consuming effort. Thus, we can manage expectations quicker and adjust target metrics, as well as make trade-offs between recall and precision (a common issue when workin with real world data).

Consulting domain experts, and open-source code, and Papers

One shortcut to getting up to speed on unfamiliar problems is to observe how others already solved it. Practical Data Science is mostly about applying existing solutions - not reinventing the wheel.

Automation of Heuristics

Trying to automate a manual process (e.g., insurance claims fraud detection)? One could sit with the investigators and learn the heuristics that guide their process so far, and e.g.turn those heuristics into features for machine learning. Need to solve an unfamiliar machine learning problem? Read papers and tech blogs on how others have done it. Want to try a new algorithm or model? Search on GitHub.

Standardization and Automation of the Experiment Pipeline

Most data scientists do their early experiments and prototyping in Jupyter notebooks. To prevent them from getting too messy, it can be very helpful to refactor these notebooks weekly. Code snippets commonly used across notebooks are refactored into functions or modules instead of copy-pasting code cells. Manual steps and commented-out cells — which will be forgotten after a few weeks anyway — should be also pruned. This results in a notebook that can run from start to finish, without any manual intervention, with metrics and plots at the end.

With time one also learns to automate as much as possible - so building proper functions and modules for similar logic like e.g. data fetching, model training, and model inference make things a lot easier. For tracking metrics and versioning models, MLFlow is free and easy to use. I wrote about setting up a local ML-workbench with MLflow for experimentation here.

Final Conclusions

If one considers all the above before heading blindly into the next Data Science Project, one will have a good understanding of the business purpose, requirements, and constraints. One will be able to better assess time and labor, potential upcoming bottlenecks, and the business value of one's Data Science projects and products. One will also immediately get a feeling for the data, a set of initial papers and code to explore, and an experiment pipeline to iterate quickly. We can dive deeper into the data and try more sophisticated techniques as we confidently solve the right problem and are able to deploy a usable solution.