Many enterprises pursue a multi-cloud strategy favoring platform neutrality. To what extent can this principle be upheld in the new AI era, especially with the rise of Generative AI, without overly restricting innovation?

The benefits of the cloud are compelling. Even in Europe, there is no longer any doubt that cloud transformation is crucial for business agility and competitiveness. Hence, almost all enterprises are moving to the cloud, namely Amazon Web Services (AWS), Microsoft Azure or Google Cloud Platform (GCP). Overall, we are on a path towards all IT applications of all Western world enterprises running on the three hyperscale cloud platforms. To reduce the risk of vendor or platform lock-in, enterprises typically pursue a multi-cloud strategy, which must now also take into account the specific challenges of AI/GenAI.

Real barriers to cloud neutrality

As outlined in the related article “Multi-cloud – Don’t get trapped by the illusion of cloud platform neutrality” the fallacy is often drawn that you can move at will between the cloud platforms if you only use platform-independent services. In this regard, enterprises tend to overreact by banning all platform-specific, serverless services, which often represent the cutting edge of innovation. This limits the benefits of cloud transformation and AI/GenAI adoption while overlooking the real challenges of multi-cloud portability.

The services of AWS, Azure and GCP differ, but in most cases the other platforms offer similar, albeit not identical, services. Thus, cloud-native applications and its source code can usually be migrated to another cloud provider, even if some platform-specific services are used. For example, it takes some effort to port AWS Lambda to Azure Functions or Google Cloud Functions, but this is rather the least of the problems of a cloud platform switch.

The greater challenges and effort arise from the fundamental differences between the three platforms in the areas of cloud governance, cloud operations, security and networking. Moreover, big data represents a major barrier to multi-cloud, as all three players charge high egress costs. Moving a data lake is costly. With AI/GenAI driven by data, strategic enterprise data management is becoming even more imperative.

Impact of AI on multi-cloud strategies

In contrast to classical applications, which are explicitly programmed in functional or object-oriented programming languages, AI models are not coded but trained with large amounts of data. In a sense, enterprise data is baked into the underlying neural network, or more precisely, into the weights and parameters of virtual neurons. As Generative AI based on multimodal Large Language Models (LLMs) moves past the Proof-of-Concept (PoC) phase and increasingly into large-scale adoption, the associated risks of vendor lock-in must be assessed and mitigated.

Will AI, and particularly Generative AI with LLMs, greatly increase the gravitational pull of the three hyperscale cloud platforms? Could even monopoly or oligopoly structures emerge?

First, the good news is that there is no area more dynamic than AI/GenAI, with lots of start-ups and a steady stream of innovative solutions. So, there seems to be plenty of choice and no serious lock-in risk. However, in the end, all AI/GenAI-powered applications run on the three hyperscalers. Without the limitless scalability of the hyperscale cloud platforms, there would be no GenAI revolution. Against this background, is it possible to migrate AI/GenAI models from one cloud platform to another? In principle, yes, but it depends on the details. The entire technology stack and the model development and deployment process, including all frameworks and tools, must be looked at. So, let's do that briefly.

AI model development

At first, a distinction must be made between training a model from scratch and using an existing, pre-trained model. The trend is towards pre-trained models, as this approach is less complex and there are more and more models for all kinds of use cases. Only tech companies will develop new LLMs themselves, but there are customer-specific predictive analytics use cases for which it still makes sense to train an AI model from scratch. As with in-house developments, the risk of vendor lock-in then lies not in the model itself, but in the frameworks, tools and technologies applied.

In the other case, model development means that the most suitable pre-trained model is selected and adapted to the specific purpose. With more than 1 million models to be found on e.g. Hugging Face alone for all kinds of use cases, you are spoiled for choice. Thus, model evaluation is a challenge. But anyhow, there is no such thing as the best model – definitely not for all and not even for one use case. In face of a highly innovative and dynamic market, always keep the flexibility to switch to a better or more cost-effective model at a later stage.

Principally, open-source vs. proprietary models must be differentiated. With open-source models, the model details are publicly disclosed, and deployment is usually feasible on any platform. It must be checked whether commercial use is permitted. Examples for open-source LLMs are BLOOM, Llama (Meta), Bert (Google), Falcon or Dolly (Databricks).

With proprietary models, not only the pricing but also the customization options must be considered. Some models are LLMaaS (LLM as a service), which do not allow for custom deployment or fine-tuning. Well-known proprietary LLMs are OpenAI GPT, which runs exclusively on Azure, and Gemini, which runs only on GCP. Those two are currently rated as most cutting-edge LLMs. However, for reasons of price, performance, sustainability (energy consumption) and vendor lock-in mitigation, it is best practice to select not necessarily the first ranked leader, but a sufficiently good model, that can often be better adapted to the specific requirements of the use case. For this purpose, the following three approaches are pursued, also in combination:

  • Prompt engineering is about crafting and iteratively refining effective prompts, which provide clarity, contextual information and illustrate the expected outcome by examples. Consistency is ensured by creating prompt templates.

  • Retrieval-Augmented Generation (RAG) enhances AI models by integrating an authoritative internal knowledge base. Typically, relevant documents are collected, converted, tokenized and incorporated. The RAG-optimized model takes this knowledge and context into account for each given answer.

  • Fine-tuning involves training a model on a new, task-specific dataset. This allows the model to learn new patterns and relationships from the new data while preserving the knowledge acquired during pre-training. This process leads to a new model variant that is more accurate and specialized for the respective use-case.
ML Ops
AI model deployment and MLOps

In analogy to DevOps, Machine Learning Operations (MLOps) streamlines the entire machine learning lifecycle from model development to deployment and monitoring. All steps are performed in a highly iterative process that also includes a deployment workflow for different staging environments.

ML ops figure 2

With respect to lock-in risks, the tools and frameworks used for these MLOps steps are of critical importance. All three hyperscalers provide advanced end-to-end MLOps services: Google Vertex AI, Azure Machine Learning, and Amazon SageMaker. These three tools not only integrate model development smoothly with model deployment, but also fit seamlessly into the service ecosystem of the respective cloud platform, particularly regarding application integration, security, data storage and data analytics. Staying exclusively within the service ecosystem of one cloud platform simplifies and facilitates the creation of enterprise scale AI-powered applications, while on the other hand the lock-in chains tighten.

But the hyperscalers cannot be criticized for only supporting proprietary technologies. All three players are open to integrate third party solutions and endorse popular open-source frameworks such as TensorFlow and PyTorch. Thus, it is even feasible to fine-tune an open-source model (e.g. from Hugging Face) in Google Vertex AI, export it as TensorFlow SavedModel, and subsequently deploy it to Amazon SageMaker or Azure ML, or vice versa.

Besides the three mentioned, there are many independent, cloud-agnostic, overarching MLOps solution or platform alternatives – both commercial and open-source. This article is not intended to provide a comprehensive market overview or solution advice, but to show the range of options by picking a few examples.

In the open-source space there are, of course, Hugging Face, TensorFlow and PyTorch. TrueFoundry is an interesting combination of leading-edge open source and commercial services. Enterprises already using Red Hat OpenShift for cloud-neutral Kubernetes container services will certainly consider OpenShift AI. Since AI is data-driven, DataBricks and Dataiku have a legitimate claim, too. Both support data engineering and MLOps end-to-end. Enterprise application vendors such as SAP or Salesforce also provide advanced AI offerings that should be considered as well.

Overall, there are many state-of-the-art, cloud-neutral MLOps solution alternatives on a par with the offerings of the three hyperscalers. It is also not an either/or situation. Instead, all these solutions provide some kind of an overarching multi-cloud orchestration layer that integrates seamlessly with Amazon SageMaker, Azure ML and Google Vertex AI.

Recommendation

Based on the sharpened understanding of model development and deployment, the lock-in risks of AI can be rationally assessed.

The following recommendations are made:

  • Avoid dogmatic and overly restrictive rules
    Allowing only open-source services and prohibiting all proprietary AI services, tools, frameworks and models is by far too restrictive. This limits innovation and value creation. Flexibility and agility are crucial. In the rapidly changing field of AI, nothing is set in stone anyway.

  • Don't tie yourself up without good reason
    For example, using OpenAI GPT for some use cases doesn't mean you have to pick that LLM as the default for all use cases. Stay flexible, explore and implement alternatives. This mitigates lock-in risks not only in the specific case, but also in general, as a skill or capability is acquired.

  • Design alternative shadow architectures
    If for good reason proprietary services, tools or models are used, design at least one solution alternative. In some industries an exit strategy is mandatory for compliance reasons anyhow, but the given advice is practically oriented. In doing so, the alleged lack of alternatives is challenged, too.

  • Build broad, not one-sided AI capabilities
    Establish a continuous learning culture in the enterprise focusing on at least two different ecosystems and MLOps pipelines. Otherwise, the lock-in is not technically caused, but knowledge-related.

  • Establish a strong data foundation
    AI is mainly about data. Professional enterprise data management comprising data catalog and metadata management, data access management, data versioning, data pipeline, data lineage, data sharing, etc. is a critical success factor. If data is poorly managed, lock-in risks cannot be mitigated, although AI initiatives are then doomed to fail anyway.
Summary

AI/GenAI increases the gravitational pull of the three hyperscale cloud platforms. However, the market remains highly dynamic, offering a steady stream of innovative solutions. Platform lock-in is not inevitable but, rather, often self-inflicted. While confining oneself to the boundaries of a single ecosystem may seem easy and convenient, it can also be highly limiting. Instead, prioritize flexibility and cultivate a broad set of AI/GenAI skills and expertise. Build a robust data foundation, and avoid dogmatic or overly restrictive rules. This is not the time for consolidation but for decisive action and the large-scale adoption of AI/GenAI.

Felix Theisinger

Chief Enterprise Architect, Cognizant

Author Image

Felix Theisinger is Chief Enterprise Architect and Senior Director of Strategic Engagements at Cognizant. In this role, he orchestrates cross-practice, overarching target solution design and engineers the digital stack at scale across technologies from SAP to Cloud to Gen-AI.

 

