Deep learning is the engine that powers the smarts behind artificial intelligence (AI). Through deep learning, AI mimics how the human brain learns by understanding data and uncovering hidden patterns at levels that transcend humans’ capabilities. To properly train AI systems, data that’s fed into deep learning models must be well documented and descriptive, and must cover a variety of patterns or possible representations.
Many managers, however, might not know whether they are working with quality data. As reported in the Harvard Business Review, one recent study found that only 3% of company data meets basic quality standards. This is a dire finding for organizations engaged in deep learning. When applied to AI systems, poorly structured or incomplete data will drive inaccurate results and might compound issues across the enterprise. If AI fails to deliver on its expectations, people will be less inclined to trust future results, potentially reducing further adoption of this powerful technology.
All enterprise applications — such as reporting, analytics and AI — that enable business decisions rely on quality data. Enforcing data rules and maintaining information that holds attributes of relevance and context is an ongoing challenge for organizations. Consider the example of a poorly defined data-entry process in which analysts enter single-word notes or, even worse, leave a “comments” field empty. This lack of complete data tagging could lead to erroneous results and loss of business-critical context.
To ensure optimal AI outcomes, we have uncovered five potential data bottlenecks and the following best practices to guard against them.
An insufficient amount of data will taint results.
Deep learning and pattern recognition are most practical when the model can access a significant amount of historical data. A limited dataset will result in insufficient patterns, which will skew and bias results and produce faulty outcomes.
When collecting data, instead of capturing only specific fields that are assumed critical for analysis, it is much better to collect and store all available features. The specific fields that will feed deep learning can be chosen later, when you’re building the model. As a caveat, ensure that historical data is stored for potential future use and is not scheduled to be purged at preset times. Retained historical data can be used later to support model training and retraining.
Data-entry shortcuts are meaningless.
The richer the data, the more reliable and usable the predictive outcomes. The deep learning process identifies patterns by using the context in which the data was created. To support pattern recognition in key data fields, avoid short one- to two-word descriptions or hints, such as months, project names or generic ”issue” or “fix” references. Data serves a higher purpose than simply being a placeholder to fill fields. Rich data delivers high usability and immense predictive value.
To improve data usability, establish data-entry standards for important data fields, specifically for those that are free-text, such as “summary” or “description,” which might be essential for further interpretation. Treat data as though it holds the key to future predictions. Ensure standard procedures mandate a minimum number of words for key descriptive fields to ensure sufficient detail.
New trends and patterns hit the blind spot.
Deep learning models require continuous learning. By understanding incremental relationships and patterns, the models learn from their mistakes and course-correct. This is critical because data elements change over time. Therefore, it is important to view data gathering as an ongoing, repeatable and recurring process that enables continuous model training.
Laying the right groundwork can make all the difference. Establish a data-extraction process that captures incremental data, along with the true results. Use this data to retrain the model and ensure the model does not miss any new trends and patterns that may evolve over time.
True production data is missing in action.
Use true data from production systems from the initial AI exploratory phase all the way through to deployment. When building models, do not use sample or dummy rows. If you do, the model likely will fail when it encounters actual data from production system. Also, set stakeholder agreements regarding data sharing to ensure production data is made available. Data compliance and security issues need to be addressed at the planning stage and to avoid surprises down the line.
Disparate and runs rampant.
A well-defined data-gathering strategy is essential, especially if data is scattered across disparate systems. Data might exist in different formats and be duplicated across systems. This is especially true in cases where smaller projects create applications in silos.
To ensure data is fit for analysis and modeling, combine all relevant data that exists across systems, and remove redundant fields and entries through a de-duplication process. A data lake, which is a storage repository for vast amounts of raw data across sources, can be used as a starting point for analytical systems. Capture, consolidate and validate all data required for model building, and include internal experts and stakeholders in the data-gathering and planning processes.
To avoid these bottlenecks and streamline AI adoption, also consider creating new roles. A data curator can act as a custodian who defines strategies for data collection and upholds data capture standards while ensuring all data conformity mandates are met. Moreover, assign an AI specialist who identifies and certifies that AI can be applied to specific cases; evaluates AI readiness; validates data suitability; and designs AI road maps. These two roles will work collaboratively and with the entire team to accelerate data readiness.
All data stakeholders, application owners, and business and IT teams should understand that data quality is a priority. Identify data impediments, take remedial actions, ensure proper oversight and set data-entry ground rules. A comprehensive data hygiene program is required to help organizations avoid these bottlenecks. The program’s purpose is to analyze enterprise data, and to perform data quality audits, assessments, and verifications to detect and eliminate all data quality issues.
By taking these steps, organizations will avoid bottlenecks and be well prepared to make their AI initiatives realize their full potential.
This article was written by Gopinath Kesavan, Director; Krishnan Srinivasan, Senior Architect; and Jyoti Ranjan Panda, Architect, for Cognizant’s Digital Systems and Technology, Enterprise IT Automation Practice.