Principled Datascience Workflow: Clarify the business needs

As a follow up to my article about the most important step I want to talk more about why good business understanding is the most crucial part in developing a solution.

Clarifying the business needs has some immediate advantages:

It guarantees that we answer the right question
It makes it easier to present the solutions in the same language that business people speak and understand well
It ultimately makes it easier to get stakeholder buy-in

The success of Data Science is coupled with how much stakeholder buy-in exists for Data-related solutions. They often cannot tell the difference between a good and a bad technical implementation of a model, but they can tell the difference between good and a bad business understanding of the presenter. It’s easier for them to trust your results, if they feel like you understand the business.

Cut down tree — "If I only had an hour to chop down a tree, I would spend the first 45 minutes sharpening my axe."
- Lincoln, who would've been a very slow lumberjack or developer

Even from a technical perspective, it is important that all goals are aligned (algorithm <-> business question <-> KPI of departement/how success of the departement is measured).

Understanding the business has a big impact on the implementation:

Stating success criteria: When is the solution good enough, how does it translate to business value, comparison to a benchmark possible?
Defining guard rail metrics - metrics that should not be effected negatively while testing/using the model
Loss function: Business Impact of False Positives vs False Negatives, how much deviation is bad, what general metrics fit the business intuition the best?
Whatever metric we focus on, will the focus on this metric lead to unwanted incentives? (Goodhart’s Law)
Human-in-the-loop: How can human interaction overwrite the model if necessary? Is it even necessary?
Look at available data: Which data do I need for it? (Business knowledge useful/ business person can help)
Which data is available and are there any special things to consider? (Data Anaylsts know a lot about data, data engineers know about potential quirks in etl-pipeline

If all of this is handled well, you can in theory create almost the entire final presentation before having trained any model or written any line of code! A general outline usually looks like this:

State business problem
How are we trying to solve it? (Mapping business problems to quantifiable statistics / KPI for model/analysis, how a model will be deployed to solve it)
Assumptions we made along the way (different kinds of loss in classification, data assumptions, missing important variables etc.)
Solution, showing metrics and KPIs in pretty graphs

Statistics Tidbits