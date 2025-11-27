Skip to main content Skip to footer
As part of River Deep Mountain AI, we are releasing our open-source Escherichia coli (E. coli) models on GitHub. Our models have the potential to transform how we monitor and forecast pollution in bathing waters using artificial intelligence.

England currently has more than 450 registered bathing waters that require E. coli testing periodically to monitor microbial contamination and to calculate annual bathing water classifications. Sampling and analysis for E. coli can be resource intensive, typically resulting in more than a working day between sampling and the release of the results to the public. Reactive and infrequent monitoring prevents effective communication of potential health risk to swimmers, bathers, and other recreational waterbody users.

To tackle this challenge, we have developed our Open E. coli Models using 24-years of historical water quality data from across England (EA Water Quality Archive). The Open E. coli Models can be used as a support-tool for classifying the risk of E. coli contamination in bathing waters against a pre-defined E. coli concentration threshold. 

round maps of UK

Image: We are releasing two versions of the Open E. coli Models: an advanced version and a light version. The light version is trained on data from across England, Wales and Scotland, and has a reduced reliance on in-situ water quality data, making it easier to implement across Great Britain. 


To ensure transferability between bathing waters, we have intentionally focused our model development on open datasets with national relevance, developing robust and general models for E. coli forecasting.

AI for short-term forecasting of E. coli in bathing waters

Our Open E. coli Models compile historical information to identify trends and correlations between E. coli levels and other physical and chemical parameters affecting bathing waters. This allows the model to predict E. coli presence in the water based on changes in the influencing physical and chemical parameters. Additionally, meteorological, land cover and satellite remote-sensing datasets have been integrated, aimed at improving model accuracy. These models serve as a supportive decision-making tool for authorities, swimmers, water companies and more.

To build robust general models, we have implemented extensive preprocessing, data extraction, correlation analysis, and feature engineering techniques to decide the best feature inputs. We have utilised multiple machine learning regression and classification models, including XGBoost, LightGBM, SVM, K-Nearest Neighbours and neural networks to assess and compare their performances on the given tasks.

Advanced and light versions of the Open E. coli Models

As part of the second release, we are releasing two versions of the Open E. coli Model: an advanced version and a light version. The light version has been trained on a larger set of E. coli measurements from across England, Wales and Scotland, and has a reduced reliance on in-situ water quality data, making it easier to implement across Great Britain.

city map

Image: In our latest performance evaluation, the advanced version of the Open E. coli Model (left) had an accuracy of 80.8% and the light version of the Open E. coli Model (right) had an accuracy of 86.7%. 


Since the first iteration of our Open E. coli Models (in June 2025), we have conducted a validation exercise, aimed at validating how the advanced and light models perform when faced with new data. When validating the classification model (with a threshold of 500CFU/100ml) trained with a random temporal split, we recorded drops in accuracy, going from 80.9% to 80.4% (advanced) and 86.7% to 75.2% (light). In contrast, with a random geographical split, the performances of the light and advanced versions increased from 73.8% to 79.7% and decreased from 80.4% to 78.7%, respectively. The details of the validation can be explored in the model output report on GitHub shared together with the Open E. coli Models.

Ultimately, we have developed models that can support the monitoring of microbial water quality safety using low cost, commonly available datasets. The models released today can enable a proactive water quality management approach and reduce the occurrence of human health risk exposures to excessive E. coli concentrations.

A collaborative and open-sourced approach

The overarching objective of River Deep Mountain AI is to bring key stakeholders involved in waterbody health together and to collaboratively develop open-source AI/ML models that can inform effective actions to tackle waterbody pollution. 

All our models will be released open source to democratise artificial intelligence and benefit the entire water sector.

Access our Open E. coli Models via GitHub.

River Deep Mountain AI is funded by the Ofwat Innovation Fund and consists of 6 core partners: Northumbrian Water, Cognizant Ocean, Xylem Inc, Water Research Centre Limited, The Rivers Trust and ADAS. The project is further supported by 6 water companies across the United Kingdom and Ireland.

 

