Like shale production, data science is challenged by extracting, refining, and controlling the input that makes it productive.
If data is the new oil, it’s a lot more like shale than fresh crude.
Here’s why: When someone compares data to oil — a conference-keynote favorite — it brings to mind the image of crude gushing out of a derrick. The reality of data science looks a lot more like the production of shale oil, which sits between layers of shale rock and impermeable mudstone and is obtainable only by fracking — fracturing the rock with pressurized liquid.
For all the transformative potential of industrial AI, most big data projects fail just like the early days of fracking.
Estimates from Gartner place the failure rate of AI projects at 60 percent, with some sources like Pactera and Dimensional Research putting failure as high as 80-85 percent. Those estimates amount to $22- $30 billion pumped into failed AI projects in 2019.
That leaning into AI for digital transformation is not an easy or inexpensive project shouldn’t come as a surprise. What is striking is that the challenge isn’t often getting the analytics right — it’s the availability, quality, and management of data itself. Like shale production, data science is challenged by extracting, refining, and controlling the input that makes it productive.
There are three main problems data science and shale production share.
1. Extraction Challenges
For decades, shale reserves in the United States proved so difficult to tap that many were left unrecovered. Heavy research funding from the Department of Energy (DoE) and Gas Technology Institute (formerly the Gas Research Institute) into fracking and horizontal drilling resulted in the development of commercial-grade technology, equipment, and machines. Only then, after decades of funding and testing, were operators able to tap into shale plays — locations known to have large shale reserves. It took these developments in drilling technology and infrastructure (first at known plays and then to others by way of geological survey) to grow extraction from these shale plays into nearly two-thirds of oil production and three-fourths of gas production in America.
Similarly, industrial information sits in hard-to-reach systems with wildly different conditions for extraction. Like geologic surveyors, data managers and product managers must identify powerful sources of data, whether they’re in current fields (like spreadsheets), yet undiscovered fields, or previously known fields where technology now enables access (like the Internet of Things). Additionally, these sources of information maintain differently-labeled and often partially-overlapping records.Within single sources, time-series data have inconsistent timestamps and data is frequently missing or filled with errors. Metadata and quality variation within and across sources makes recovery difficult to standardize. The result is that a typical data scientist spends 80 percent of her time data-wrangling — cleaning, structuring, and enriching raw data until it’s AI-ready. At many large corporations, data collection and storage are so mismanaged that data science sometimes isn’t even possible. Gartner tabs this lost productivity because of poor quality data to about 30 percent of revenue per business on average.
2. Difficulty of Ingesting & Refining for Value
Extracted oil is valuable as a commodity, but it is worth far more once refined into fuel, plastic, or fertilizer. Because of the lighter density and higher sulphur content of American shale oil, domestic refineries have had to adapt infrastructures traditionally suited to handling heavy crude. The global oil industry too has had to change, adopting go-to-market practices that process heavy oil to fully capture the value of available crude outside the United States. Both the existing refinery processes in the US and changing go-to-market practices worldwide have made refining shale more challenging, which is on average three-to-four times more expensive than refining crude for commercial use.
Likewise, raw data is far less useful than the insights derived from it. An enterprise must refine raw data, creating valuable intellectual property through a proprietary process, subject matter expertise, analytics, software, and the ingestion of unlike datasets. Although the soaring volume, velocity, and variety of industrial data has been a boon for AI, industrial leaders admit they’re struggling to implement predictive analytics solutions that generate sustainable value. Senior executives often lack the analytical expertise to manage strategic capabilities offered by data insights. AI initiatives lag when different areas of the company have varying access to the data needed to make strategic decisions. Once developed, AI presents a similar problem of refinement from commodity into saleable product that lacks a mature market. In addition, this refinement for value must also be completed with an understanding of the legal, regulatory, and contractual restrictions that bind the use of data but which are still in flux.
3. Negative Externalities
Extracting shale has caused concern because of its impact on air pollution through the release of methane gas, leaching into underground reservoirs used for drinking water, and triggering of earthquakes.
Data science raises its own issues, such as protecting data privacy, maintaining ethical algorithms, and ensuring cybersecurity. (And those are just the concerns if data science builds models that work as intended.) A survey from the open source data science platform Anaconda found that three-quarters of data scientists use open-source platforms, a third of whom say they don’t take deliberate measures to protect their work. When you’re relying on AI to help make decisions around infrastructure, a single malfunction, data leakage, or poor decision can lead to massive ill effects. And, at the bare minimum, mismanaged data quality leads to waste: about $3.1 trillion per year in the United States according to IBM.
None of these challenges are insurmountable. The emergence of shale production as a viable source of energy, notwithstanding its environmental and human health effects, holds promise for a data-driven economy with true grounding in insights from AI. Too often though, these challenges remain blindspots for industrial leaders who fail to recognize that data integrity is holding back their AI initiatives.
Ingestion proves the greatest challenge to translate data into insight, from turning readily available raw material into functional inputs and then finally into commercially viable products. Only upon a foundation of clean, standardized, high-quality data will information empower leaders to survey the complete range of facts, ask the right questions, make decisions with total clarity, and consider the consequences of their decision. And only then can AI fulfill its promise to the industrial world and its customers.