The current monitoring systems and limited data availability make it difficult to observe the fluctuations in orthophosphate over shorter periods of time, hindering effective investigations into finding sources of nutrient pollution.
To address this gap, we have leveraged advanced artificial intelligence and machine learning (AI/ML) to develop a model capable of predicting orthophosphate concentrations reliably. While this model is not a conceptual representation of the underlying processes, it captures patterns and correlations in data that can supplement traditional scientific approaches. Its predictive capabilities enable both short-term forecasting and retrospective estimation, reducing dependence on time-intensive laboratory testing. Furthermore, the relationships identified by the model may serve as a useful cross-check against conceptual water quality models, potentially revealing new hypotheses or validating existing assumptions about phosphorus dynamics.
Unlocking insights from decades of data
Our model harnesses 24 years of water quality data from the Environment Agency Water Quality Archive, encompassing over 70 million observations from more than 23,000 sampling points across 55 rivers. The dataset includes more than 3,000 physicochemical parameters – from pH levels to nutrient concentrations – and provides a holistic view of catchment dynamics. By training AI/ML models on these data, we can establish the relationships between these parameters and orthophosphate concentration.
To ensure accuracy, we have collaborated closely with domain experts from across the water sector to conduct extensive exploratory data analysis (EDA), uncovering patterns such as seasonal trends, rainfall impacts, and catchment-specific behaviours. Advanced techniques such as agglomerative clustering and K-Means++ segmentation identified groups of determinants influencing orthophosphate levels, while principal component analysis (PCA), SelectKBest, LassoCV, and f_Regression streamlined feature selection.
The modelling phase tested a suite of algorithms: traditional regression models (linear regression) for baseline insights, tree-based approaches (Random Forest, XGBoost, LightGBM) to capture non-linear relationships, artificial neural networks (ANN) for high-dimensional pattern recognition and Large Language Models (LLMs) such as ChemBERTa (built on RoBERTa), NIH Global chemical database, ChemDataExtractor and KMeans++ for segmentations.
We used advanced AI models, specifically, bidirectional transformers. These are AI models (such as BERT) that understand context by looking at words (or in this case, molecular features) in both directions. This allows the study of how molecules, such as certain environmental pollutants (determinands) and phosphates, are similar at a molecular level. Although this study has provided segmented results, it did not define clear boundaries between these segments.Therefore, we enhanced our analysis by using agglomerative clustering, which is more accurate in performing similarity analysis. For this analysis, we used NIH global chemical data base, RDKit and Cambridge ChemDataExtractor library. These models helped reveal which molecules are chemically attracted or compatible with each other. This understanding allowed us to improve our engineering methods and better connect large-scale environmental data with detailed molecular chemistry.
The Open Orthophosphate Model demonstrates performance, achieving a Mean Absolute Error (MAE) of 0.01, a Mean Squared Error (MSE) of 0.00123, and a Normalised Root Mean Squared Error (NRMSE) of 8%.
A collaborative and Open-Source approach
The overarching objective of River Deep Mountain AI is to bring key stakeholders involved in waterbody health together and collaboratively develop Open-Source AI/ML models that can inform effective actions to tackle waterbody pollution.
All our models will be released into the public domain to democratise artificial intelligence and benefit the entire water sector. The first iterations of our models are being released in July 2025, and the second iterations will be released in November 2025.
Access the first iteration of our Open Orthophosphate Model via GitHub.
River Deep Mountain AI is funded by the Ofwat Innovation Fund and consists of 4 core partners: Northumbrian Water, Cognizant Ocean, Xylem Inc, Water Research Centre Limited, The Rivers Trust and ADAS. The project is further supported by 4 water companies across the United Kingdom.