Data engineers have long been the unsung heroes of modern business. Many of the most dazzling achievements of the digital age have relied on the work of people toiling behind the scenes to build and maintain the data pipelines, databases and infrastructures that store and analyze the ever-rising tides of information that define today’s competitive landscape.
But today is becoming tomorrow, and life is changing fast for the humble data engineer. The arrival of generative AI has already transformed the day-to-day work of wrangling data. With its ability to automate many tedious but manual processes, generative AI frees engineers’ time and attention for higher value tasks.
In our September 2023 survey of senior business and technology decision makers at large businesses in the US and UK, in fact, 61% of decision makers cited software development productivity as the business area where gen AI could play the largest role in their workplace.
Not only that, but the unique importance of data engineering to AI itself is about to give these unassuming specialists a new and central role in the business ecosystem—unsung no longer; heroes more than ever.
Gen AI and the data engineer
Generative artificial intelligence (gen AI) refers to the new breed of AI models that can generate original content based on the patterns and structures learned from huge troves of existing data. The best-known example, for now, is OpenAI’s GPT-4, a natural language processing model that can generate fluent, coherent and contextually relevant text based on user input.
Other gen AI models work in the visual medium, and the most obvious, immediate value of these technologies to data engineers is that it will let them produce high-quality charts, graphs and reports from a data set without (necessarily) enlisting the help of human designers or even analysts.
The core purpose of data engineering has always been to lay bare the trends and meanings within a data set. Gen AI has the potential not only to help identify those trends and meanings, but to also present them with such clarity that non-technical minds can grasp them in an instant.
But the “creativity” of data engineering has always been about more than charts. The work requiring the most inspiration, abstraction and “what-if” thinking is the design of data infrastructures themselves.
Here too, gen AI can provide a huge boost. As models become more advanced, they will be able to tackle these more complex data engineering tasks, from schema generation to feature engineering. Already, though, simply by automating much of the technical drudgery of data work—coding, for instance, or system maintenance—gen AI is freeing up data engineering professionals to spend more of their time and creativity on high value work and more abstract thinking.
The data side of gen AI
In addition to gen AI’s potential to help data engineers better manage the flow of existing data, this technology can also create new data. The appeal of this may not be obvious to a business already drowning in information—struggling with the challenge of converting an unmanageable “data swamp” into a less daunting “data lake,” say. However, there are several key areas where new data can directly drive growth and aid decision-making.
- Data augmentation. A pet peeve of every data engineer is the incomplete dataset, and just as GPT-4 can produce lifelike, human-seeming text, generative AI models employ advanced machine learning techniques, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) to generate realistic, high-quality data samples.
By training multiple neural networks to work in tandem, the generated output can be refined until it’s functionally indistinguishable from the missing data. By itself, this innovation—which eliminates the need for manual data imputation—can greatly streamline the data engineering process and reduce the time spent on data cleaning and preprocessing.
- Data anonymization. In the age of stringent data privacy regulations, such as GDPR and CCPA, it’s essential for businesses to ensure the privacy of sensitive user information. Generative AI models can be used to create synthetic data that retains the statistical properties of the original data while removing any personally identifiable information. This synthetic data can then be used for data analysis and other purposes without violating privacy regulations.
- Predictive analytics. If insights drawn from past and current business data are invaluable to decision-makers, imagine what they could do with information from the future? While gen AI does not actually have the gift of prophecy, it can analyze historical and current data to make informed predictions about customer behavior, market dynamics, operational performance and other key business factors.
Data engineering cautionary tales
Much has been written about the potential dangers of generative AI, and being a product of data engineering itself (see below), any and all of gen AI’s problems are ultimately problems for the data engineer. However, when considering the use of gen AI within data engineering, some of these hotly debated risks are likely less of an issue than they are for other fields, while others may be more worrisome.
Take the issues around bias and copyright. From the moment Chat GPT-3.5 brought gen AI to mainstream awareness last November, anxious observers flagged some glaring ethical concerns. Because the model was trained on a vast quantity of human-generated text, much of it scraped from the internet, there was a risk of its output directly copying a single human writer’s work without attribution or compensation. This raised the more philosophical question of what, if anything, is owed to the whole class of human writers who, without consent, provided the raw material on which the model was trained.
More disturbing was the reality that bias and prejudice within the training set, and the unconscious bias of those developing the model, could help perpetuate or even amplify those injustices in the real world, and thus in future data sets.
Data engineers need to be mindful of these issues; a set of raw numerical data can be as tainted with bias as any collection of words. For the most part, though, in the abstracted world of big-data infrastructure, it is more difficult to give offense, and numbers will never equal words or pictures in their capacity to wound or shock or denigrate.
The questions around model transparency, however, may pose more of a challenge to data engineers. Generative AI models, particularly those based on deep learning techniques, can often be functional “black boxes.” They can take input in the form of a natural-language prompt and, from it, produce content that is also digestible by human minds. In many cases, though, the chain of “reasoning” between those inputs and outputs is utterly opaque, conducted in terms that only the model understands.
For, say, a graphic designer using an AI image generator, this may not be a problem—artistic inspiration has always been mysterious, after all. But for hard-nosed data engineers, whose work has always required them to understand, and be prepared to defend or duplicate, the logical chain between input and output, the impenetrability of generative AI may pose a particular challenge.
Developing techniques to improve the interpretability and explainability of generative AI models will be crucial to their widespread adoption and integration into data engineering workflows.
A unique relationship
All of which is only to say that gen AI is going to have the same kind of impact on data engineers as it’s going to have on so many of us: a profound one, changing not just how we work, but what our work even is.
What makes data engineering unique in this regard, however, is that data engineering is literally where generative AI comes from, and what makes it tick. All the dazzling power of large language models, and their equivalents, comes from the awesome size of the datasets they use to train, and the systems that sift, analyze and weight that data into the billions—even trillions—of parameters that a model applies in order to produce fresh content.
Put another way, data engineers are to generative AI what coders are to software, or what mechanics are to cars, and their importance is set only to increase. By some predictions, we are less than a year away from fully 60% of the training data for gen AI models being synthetic, which is itself the product of gen AI, midwifed into existence by data engineers.
The next few years, in short, are going to be a wild ride for specialists who today, in the public imagination, are still primarily tasked with turning last year’s Q4 sales data into a pie chart. As professionals in every field adjust to life as the flesh-and-blood member of a human-machine partnership, it is data engineers, increasingly, who will be the matchmakers, chaperones and couples-counselors of those relationships.
It’s no exaggeration to say that humanity’s immediate future will be shaped directly by data engineers. And the future of data engineering, conversely, will be shaped by those who are best prepared and most willing to harness the awesome power of this transformative technology.