Data maturity: key to successful AI projects
Senior Data Scientist and member of the Scientific Community
Louis de Vandière
Posted on: 13 June 2019
If artificial intelligence (AI) is the newest buzzword, then data maturity is what keeps the buzz going. To be successful in any AI strategy, data scientist and top-line management need to assess the company’s data maturity before embarking on an AI journey.
Too often, business partners start with choosing a model or technology. This is no different from building a house by starting with the roof. Most of the problems encountered in AI projects happen before starting the modeling phase and are directly related to the data. Difficulties in extracting a dataset, or realizing in the process that some of the information is missing or not of high quality, is a recurrent issue that causes major delays and sometimes delivery failure.
Machine Learning solutions highly depend on the data, making data accessibility, quality and pertinence key factors in determining both the feasibility and success of any AI project.
To determine data maturity, break down what is known in three ways: data, information and knowledge. First, understand there is a hierarchical connection between the three terms: knowledge is acquired based on information and information is acquired based on data. Data is key. But does data have relevant information, and can it be converted into knowledge?
The Oxford dictionary defines data as facts and statistics collected together for reference or analysis. The important word here is collected, meaning the data is actively gathered and stored. This is the very first step in the assessment. Is data collected? If yes, how is it collected? Where is it stored? Where does it come from? How do we access it? What type of data, structured or unstructured? Each answer to questions like these will help in assessing how accessible and processable the data is.
If for any of these questions there is no clear answer, then the data is not ready to be used in an AI project. However, it is the right time to plan for a data strategy and pipeline, and many tools are available to help. At this stage, there’s great room for flexibility and trying different solutions for choosing the best architecture. The objective is to be able to collect and transform available data into valuable and exploitable information.
When designing and implementing a machine Learning model, some forget that a bad accuracy does not necessarily come from the type of model chosen or the parameters used. Sometimes the information is just not there or incomplete, which makes it difficult for any type of machine Learning to learn from the data.
The second step in the data assessment process is to ensure that the information expected in the data is present and aligned with the business reality. To accomplish this, a data exploration analysis needs to be conducted, as well as interviews with the main stakeholders, to ensure both the quality and reliability of the data.
Take the example of a sensor measuring temperature. What does the temperature mean? Is the temperature inside or outside? Is it in Fahrenheit, Celsius or Kelvin? Answering these questions provides context that allows data scientists to set expectations and implement meaningful, quality checks. This will ensure the presence of relevant information in the raw data and enable the third and last part in the data maturity assessment process – can we use this information to acquire knowledge?
We have identified the available information, we can start envisioning how it can be used to produce knowledge. How? The process of gaining knowledge on the data involves different people with different skills set and mindset. People from the technical and business sides need to work together to understand the data and discuss what is possible or not. Usually, the discussion is driven by one or more business use-cases and we debate if we have the necessary knowledge to solve them.
Let’s take the example of sales prediction. In general, it is quite easy to validate the access and trustworthiness of a sales history dataset but a few more verifications are still required. For example, do we have enough history? Do we have trends (seasonality) in the data? However, sales history is often not enough to accurately predict sales. Unusual peaks of sales can happen when marketing is proposing promotions or low sales can be explained by a competitor launching a new product. Without this information, a machine learning solution would not be optimal.
More generally, if there is not enough knowledge to solve the business use-case, we need to look back at the available information and data, possibly restarting the data maturity process. At this stage, the data is not fully AI-ready for the given use-case, but data limitations and risks are known, and expectations are set on the outcomes of a project.
On the other hand, if there is enough knowledge, the data is AI-ready. Based on this knowledge, it is now possible to define more precisely the use case like:
- Identifying end-users and how they are going to interact with the solution. A very well-built model with good performance will become useless if the end-users do not trust it or do not know how to use it.
- Agreeing on the interpretability of the model (auditability).
- Agreeing on measurable success criteria to align all the stakeholders on the same objectives.
To conclude, we have seen and explained the three stages to measure the maturity of the data for an AI project:
- Data: Ensure existence, collection and accessibility of the data.
- Information: Augment the data by processing it and giving a technical context
- Knowledge: Validate that the information is pertinent to a business use-case and of high-quality to be fed to one or more machine learning models.
A shortfall in any of these three stages will greatly impact what can be delivered, and as such, it is important to gauge how a business’ data compares to these ideals early on. In the end, it is about ensuring we have solid foundations to build a successful AI journey.