Releasing the second iteration of the Open Orthophosphate Model from River Deep Mountain AI

As part of River Deep Mountain AI, we are now releasing the second iteration of our orthophosphate AI-model on GitHub.

The UK’s water environment is under huge pressure from population growth, climate change and pollution, with only 15% of English rivers achieving good or above ecological health status. Confronted with challenges of this scale and complexity, the water sector needs innovative solutions to protect the health of our rivers and ensure clean water for future generations.

Phosphorus concentrations are critical indicators of water quality and ecological health, with high values signalling nutrient pollution that threatens aquatic life and water ecosystems. However, traditional methods for measuring phosphorus compounds, such as orthophosphate, are time-intensive and costly. Consequently, historical measurements of phosphorous compounds are sparse across time and space.

Image: Map of orthophosphate concentrations across England for 2023, averaged first to daily for each monitoring point, and then averaged for each hydrological area (EA Hydrological Boundaries). The classification shown broadly reflects the various boundaries for the UKTAG standards for phosphorus.

The current monitoring systems and limited data availability make it difficult to observe the fluctuations in orthophosphate over shorter periods of time, hindering effective identification of sources of nutrient pollution.

To address this gap, we have leveraged advanced artificial intelligence and machine learning (AI/ML) to develop an orthophosphate prediction model. While this model is not a conceptual representation of the underlying processes, it captures patterns and correlations in data that can supplement traditional scientific approaches. Its predictive capabilities enable both short-term forecasting and retrospective estimation, helping to fill in gaps resulting from limited monitoring. Furthermore, the relationships identified by the model may serve as a useful cross-check against conceptual water quality models, potentially revealing new hypotheses or validating existing assumptions about phosphorus dynamics.

Unlocking insights from decades of data

Our model harnesses 24 years of water quality data from the Environment Agency Water Quality Archive, encompassing over 70 million observations from more than 23,000 sampling points across 55 rivers. The dataset includes more than 3,000 physicochemical parameters – from pH levels to nutrient concentrations – and provides a holistic view of catchment dynamics. By training AI/ML models on these data, we can establish the relationships between these parameters and orthophosphate concentration.

To ensure accuracy, we have collaborated closely with domain experts from across the water sector to conduct extensive exploratory data analysis (EDA), uncovering patterns such as seasonal trends, rainfall impacts, and catchment-specific behaviours. Advanced techniques such as agglomerative clustering and K-Means++ segmentation identified groups of determinants influencing orthophosphate levels, while principal component analysis (PCA), SelectKBest, LassoCV, and f_Regression streamlined feature selection.

The modelling phase tested a suite of algorithms: traditional regression models (linear regression) for baseline insights, tree-based approaches (Random Forest, XGBoost, LightGBM) to capture non-linear relationships, artificial neural networks (ANN) for high-dimensional pattern recognition and Large Language Models (LLMs) such as ChemBERTa (built on RoBERTa), NIH Global chemical database, ChemDataExtractor and KMeans++ for segmentations.

We used advanced AI models, specifically, bidirectional transformers. These are AI models (such as BERT) that understand context by looking at words (or in this case, molecular features) in both directions. This allows the study of how molecules, such as certain environmental pollutants (determinands) and phosphates, are similar at a molecular level. Although this study has provided segmented results, it did not define clear boundaries between these segments. Therefore, we enhanced our analysis by using agglomerative clustering, which is more accurate in performing similarity analysis. For this analysis, we used NIH global chemical data base, RDKit and Cambridge ChemDataExtractor library. These models helped reveal which molecules are chemically attracted or compatible with each other. This understanding allowed us to improve our engineering methods and better connect large-scale environmental data with detailed molecular chemistry.

Since our first release (in June 2025), the Open Orthophosphate Model has undergone additional validation, to test how well it performs across a variety of catchment types. This includes testing the model in dry and wet catchments, catchments with a high base flow index, low base flow index, ‘near natural’ catchments (UK Benchmark) as well as urban and rural catchments. Across the different types of catchments, we have recorded a mean absolute error (MAE) ranging from 0.02 to 0.05 mg/L.

Furthermore, the model has been assessed using citizen science and continuous water quality data, to test how well it performs when using data outside of the EA water quality archive.

These validation exercises have helped pinpoint the strengths, limitations and opportunities for improvement of the current model. These outcomes and discussions about suggested next steps are detailed in the model output report released alongside the model in GitHub.

A collaborative and Open-Source approach

The overarching objective of River Deep Mountain AI is to bring key stakeholders involved in waterbody health together and collaboratively develop Open-Source AI/ML models that can inform effective actions to tackle waterbody pollution.

All our models will be released into the public domain to democratise artificial intelligence and benefit the entire water sector. The first iterations of our models were released in July 2025, and the second iterations are now being released

Access the second iteration of our Open Orthophosphate Model via GitHub.

River Deep Mountain AI is funded by the Ofwat Innovation Fund and consists of 6 core partners: Northumbrian Water, Cognizant Ocean, Xylem Inc, Water Research Centre Limited, The Rivers Trust and ADAS. The project is further supported by 6 water companies across the United Kingdom.

Cognizant Nordics

Our experts are contributing with exciting insights about what is going on within technology and innovation.

Featured

Featured

Featured

Themes

Cognizant Blog

As part of River Deep Mountain AI, we are now releasing the second iteration of our orthophosphate AI-model on GitHub.

Unlocking insights from decades of data

A collaborative and Open-Source approach

Cognizant Nordics

Latest posts

Related posts