Many of the Nordic companies currently exploring techniques like AI and machine learning experience how essential the data foundation is. Learn how a cloud-based “lakehouse” can provide a scalable, cost-effective platform for both machine learning and data analytics workloads.
As businesses use machine learning across more areas of their business to make informed decisions:
- Increasing agility to launch machine learning capabilities more quickly to meet a range of business needs, from inventory optimization to fraud detection.
- Minimizing the cost of delivering business intelligence and machine learning in a hybrid cloud or multi-cloud environment.
- Quickly and easily scaling data management and analytic capabilities as workloads change.
- Seamlessly switching from one hyperscaler to another to take advantage of the best balance between cost and performance, while meeting business requirements.
Meeting the data needs
So far, the extensive data needs that come with machine learning have been met by data warehouses and data lakes. To fully address these challenges, however, businesses need to combine the best of data warehouses and data lakes into a new data architecture. The “lakehouse architecture” is gaining support from leading hyperscalers, in partnership with their technology and service provider partners.
Lakehouses evolved to overcome the limits of data delivery platforms such as data warehouses and data lakes, which are often too expensive to maintain and cannot support the types and amount of data required by today’s machine learning systems. A lakehouse aims to combine the best features of data warehouses and data lakes by providing:
- Support for both traditional SQL-based structured data and for other, more modern, data formats.
- A robust data storage and compute model that leverages any data format or scale of data to more quickly and easily build machine learning models.
- A unified view of data engineering and consumption that reduces costs and speeds up innovation for users, analysts and data scientists.
- Unlimited scalability and lower costs than on-premise infrastructure.
Databricks on Google Cloud is the latest example of leading hyperscalers offering the lakehouse architecture. Delta Lake on Databricks enables data engineering, cloud data processing, data science and analytics workloads on a unified data platform. This makes it easier, faster and more cost-effective for any user – from business analyst to data scientist – to discover and deliver insights quickly to the enterprise and use machine learning in production.
Here is how a lakehouse architecture can help meet the four critical enterprise ML needs.
- Agility: With some lakehouse architectures, businesses can query the data lake directly to answer any business question (whether using traditional business analytics or new machine learning applications). This reduces the development time for both queries and reports, as well as machine learning models. Reusable data engineering processes and self-service data preparation and analytics reduce the time required to find and act on new insights.
- Cost control: Some users have discovered that low-cost cloud storage and data compression methods, such as the open-source Delta data file format, can cut storage, compute and networking costs by as much as 80%. Pay-as-you-go billing for storage, compute and networks, and the use of open-source container and serverless technologies, also minimizes infrastructure costs and improves portability.
- Scalability: By tapping into the hyperscaler’s infrastructure, enterprises can quickly increase their capacity as business needs change.
- Openness: Because many technologies that enable the lakehouse architecture, such as the Databricks Unified Data Platform and Google Cloud, are open source and/or supported by multiple hyperscalers, businesses will find it relatively easy to move their machine learning platforms from one cloud provider to another, or among hybrid cloud platforms. The use of common pipelines to orchestrate, trigger and manage the various data and machine learning workflows and pipelines, as well as a single unified runtime, can also ease workload shifting among hyperscalers.
The lakehouse effect
As the market evolves, look for hyperscalers and their partners to deliver improved self-service capabilities for data engineering and access; higher-performance lakehouse-based platforms so they match that of data warehouses; improved ACID (atomicity, consistency, isolation, durability) capabilities; and simpler containerization and deployment of production-ready ML models.
Remember to make proper data governance and data management the foundation of your lakehouse strategy, as the “garbage-in/garbage-out” rule is more important than ever when it comes to the data required to build machine learning models.