As part of River Deep Mountain AI, we are now releasing an early version of our open-source E. coli models on GitHub. Our models have the potential to transform how we monitor and forecast pollution in bathing waters using artificial intelligence.

England currently has more than 450 registered bathing waters that require E. coli testing periodically during summer periods to monitor contamination and to calculate annual bathing water classifications. Sampling and analysis for E. coli can be resource intensive, commonly resulting in a delay of 48 hours between sampling and the release of the results to the public. The delay, along with the infrequent sampling and the lack of E. coli testing during the winter season, increases the microbial health risk to swimmers, bathers, and other waterbody recreational waterbody users.  

To tackle this challenge, we have developed our Open E. coli Models using 24 years of historical water quality data from across England (EA Water Quality Archive). The Open E. coli Models can be used as a support-tool for classifying the risk of E. coli contamination in bathing waters against a pre-defined threshold. 

To ensure transferability between bathing waters, we have intentionally focused our model development on open-source datasets with national relevance, developing robust and general models for E. coli forecasting.

An AI model for short-term forecasting of E. coli in coastal bathing waters

Our Open E. coli Models compile historical information to identify trends and correlations between E. coli levels and other physical and chemical parameters. This allows the model to understand complex patterns that showcase a relationship with the increase of E. coli presence in the water. Additionally, meteorological (weather data) factors have been integrated, aimed at improving model accuracy. This model serves as a supportive decision-making tool for authorities, swimmers, water companies and more. 

To build a robust general model, we have implemented extensive preprocessing, data extraction, correlation analysis, and feature engineering techniques to decide the best feature inputs. We have utilised multiple machine learning regression and classification models, including XGBoost, LightGBM, SVM, K-Nearest Neighbours and neural networks to assess and compare their superiority on the given tasks. 

globes

Image: For training and inputs, our E. coli models currently rely on two sets of data: water quality parameters and weather. Moving forward, we want to include an additional set of data retrieved from satellite, including land cover, as well as create a light version, reducing the reliance on in-situ water quality data. 

In a performance evaluation, the current version of the Open E. coli Model had an average accuracy of 72.78% when classifying the risk of E. coli as either above or below a threshold of 1000 CFU/100 ml in bathing waters.

This first iteration of the Open E. coli Models still has a range of limitations, which we want to improve moving forward. 

Improvements for the second iteration of our models will focus on refining model applicability and increasing prediction reliability through integrating additional data sources and advanced modelling techniques. We also plan on exploring other features and parameters that might contribute to E. coli concentration in inland and coastal bathing waters, giving the model understanding of trends and behaviours beyond just water quality correlations. 

Ultimately, we aim to develop models that can indicate water quality safety with regards to E. coli concentrations using low cost, commonly available datasets. These models can enable a proactive microbial water quality management approach and reduce the occurrence of human health risk exposures.  

A collaborative and open-sourced approach

The overarching objective of River Deep Mountain AI is to bring key stakeholders involved in waterbody health together and to collaboratively develop open-source AI/ML models that can inform effective actions to tackle waterbody pollution.  

All our models will be released open source to democratise artificial intelligence and benefit the entire water sector. The first iterations of our models are released now in May 2025, and the second iterations will be released in November 2025.

Access the first iteration of our Open E. coli Models via GitHub.

River Deep Mountain AI is funded by the Ofwat Innovation Fund and consists of 6 core partners: Northumbrian Water, Cognizant Ocean, Xylem Inc, Water Research Centre Limited, The Rivers Trust and ADAS. The project is further supported by 6 water companies across the United Kingdom and Ireland. 

 

