Your AI is Succeeding. That’s Exactly the Problem.
Somewhere in your organisation right now, an AI system is hitting its metrics. The dashboard looks good. The team is proud.
And it is quietly optimising for the wrong thing.
You don’t know which system. You’re not sure who would tell you. And the longer it runs, the more its logic embeds itself in your processes, your decisions, your culture. These failures don’t announce themselves. They compound.
The organisations navigating this well share one trait: someone is asking the questions AI cannot ask itself. Interrogating objectives. Connecting silos. Owning accountability. Translating capability into value.
Call that person an AI architect. Most organisations don’t have one. Many think they do and they’re wrong. What follows is the case for why, and a blueprint specific enough to act on.
PART I: THE CASE - WHY AI CREATES RISK
AI executes. It never questions.
In the summer of 2020, Covid cancelled A-level exams across the UK. The government instructed Ofqual, the examinations regulator, to issue grades anyway and to ensure the overall distribution matched previous years. The objective was explicit: prevent grade inflation. Ofqual built an algorithm to do exactly that.
It worked. The national grade distribution held steady. It also downgraded nearly 40% of A-level results. Large state schools were penalised hardest; the proportion of top grades at independent schools rose by 4.7 percentage points; more than double the rate for state comprehensives. The algorithm didn’t malfunction. It optimised for exactly what it was told to optimise for. The objective was the failure. The government U-turned within days and reverted to teacher-assessed grades.[1]
The enterprise version of this failure is less public but equally corrosive. Amazon built a recruitment AI trained on a decade of CVs. The system learned to penalise résumés containing the word “women’s”, as in “women’s chess club captain”, because historically most successful hires had been male. The objective itself was technically sound: find people who look like past top performers. But the objective itself encoded the bias. Amazon abandoned the project around 2017, Reuters revealed the story the following year, but recruiters had been consulting its outputs in real hiring decisions before then.[2]
The sceptical CEO says: “Those are poor objective-setting, not a missing role.” Exactly. That’s the point. Who in your organisation is responsible for reviewing whether an AI system’s objective is right, not whether the model is accurate? If the answer is “the project team,” you’ve made the people with the strongest incentive to ship also responsible for challenging whether they should. That’s not governance. It’s a conflict of interest.
AI scales conviction. It doesn’t test assumptions.
Zillow’s iBuying programme is what happens when an algorithm optimises for the right metric in the wrong context and nobody with cross-functional authority pulls the brake.
Zillow is the largest online real estate marketplace in the U.S., offering a "housing super app" to buy, sell, rent, and finance properties. Zillow Offers used an AI pricing model to buy homes, renovate them, and resell at a profit. In the pandemic housing boom, it worked. Then the market turned. In Q3 2021, the algorithm kept buying aggressively while conditions shifted beneath it. It purchased 9,680 homes in a single quarter and sold only 3,032. The model was optimising for acquisition volume and price appreciation. The appreciation had stopped.
The pricing model couldn’t see the renovation pipeline backing up. The renovation team couldn’t see the pricing model’s assumptions. Nobody with authority across both functions intervened in time. Zillow recorded over $500 million in losses, shut down the entire division, and cut a quarter of its workforce. Its market capitalisation fell from $48 billion to $16 billion in nine months.[3]
This was not a government body. It was a well-resourced technology company with sophisticated data science teams, and the algorithm still outran the organisation’s ability to challenge it. The failure was structural: no one owned the question “What happens to this business when the model’s core assumption breaks?”
AI follows the org chart. It doesn’t understand the organisation.
Every company has two structures: the documented one, and the one that actually works. The informal networks. The trusted fixers. The workarounds nobody has written down.
AI sees only the first version.
The Post Office Horizon scandal is the most extreme case of what happens when an organisation trusts the documented system over lived reality. Between 1999 and 2015, the Horizon accounting software, developed by Fujitsu and used across the UK’s post office network, contained bugs that generated phantom shortfalls in branch accounts. More than 900 subpostmasters were wrongfully convicted of theft, fraud, and false accounting based on the system’s data. Some were imprisoned. Many lost their homes. At least thirteen suicides have been linked to the scandal.
For over fifteen years, the Post Office trusted the system’s outputs over the testimony of hundreds of operators who said the numbers were wrong. The organisation’s documented process said the accounting system was reliable. The reality on the ground said otherwise. The system won every time, until a group litigation in 2019 finally established that Horizon contained, in the High Court’s words, “bugs, errors, and defects” capable of causing the shortfalls subpostmasters had been blamed for. The Prime Minister described it as one of the greatest miscarriages of justice in British history.[4]
This is not just a cautionary tale about software testing. It is a structural lesson: when an organisation’s documented processes diverge from operational reality, any system that automates the documented version will enforce the gap and the people closest to the reality will be the first casualties.
AI makes silos excellent and no one accountable. That’s not the same as making the organisation work.
In one financial services firm, two teams launched two models within weeks of each other.
The first sat in Sales. Its job was simple: find prospects who would say yes quickly. It was tuned for acquisition volume and short-cycle conversion. Every week the model got sharper, learning which segments clicked, applied, and onboarded with the least friction. The dashboard glowed green: cost per lead down, applications up, funnel velocity improving.
The second sat in Risk. It was tuned for the opposite pressure: minimise default exposure. It learned, with equal efficiency, which applicants would wobble under stress, which profiles correlated with early arrears, which cohorts were expensive to carry through downturns. Its dashboard glowed green too: expected loss down, approvals cleaner, portfolio risk trending in the right direction.
No one ever made them agree on what “a good customer” meant. Sales defined good as convertible. Risk defined good as durable. So the system as a whole became incoherent: the sales model discovered fast-converting segments, and the risk model, doing its job, rejected those same segments at scale. Marketing spend chased people the firm would not lend to. Acquisition costs climbed. Conversion rates collapsed. Each function stayed green against its own metrics while the business bled margin in the space between them.
On a £500 million loan book, it doesn’t take much drift for this to become material. A 5% misalignment between what Sales is trained to chase and what Risk is calibrated to approve can bleed roughly £25 million a year in wasted acquisition spend and mispriced exposure. That number appears on no one’s dashboard because it belongs to no one’s system. It lives in the gap and the gap has no owner.
The Dutch childcare benefits scandal, the Toeslagenaffaire, shows the accountability vacuum at its most destructive. Between 2005 and 2019, the Dutch tax authority used a self-learning algorithm to flag benefit claims as fraudulent. Nationality was built into the risk model as a fraud indicator. Non-Dutch nationals received systematically higher risk scores. The algorithm adapted over time based on its own outputs, creating a discriminatory feedback loop with no meaningful human oversight.
When a claim was flagged, a civil servant was supposed to review it, but was given no information about why the system had generated the risk score. The system was a black box to its own operators. At least 35,000 parents were wrongfully accused of fraud. Families were ordered to repay amounts averaging €20,000 to €60,000, often with late fees and no option for payment arrangements. Families were driven into severe debt and over 2,000 children were taken into care. The scandal brought down Prime Minister Rutte’s government in 2021, with the state subsequently admitting institutional racism was at the root of the system’s design.[5]
Someone has to be able to say: I understand why this system makes the decisions it makes, and I am responsible for them. In the Toeslagenaffaire, no one could. In most organisations deploying AI at scale, the honest answer is the same.
AI optimises metrics. Strategy lives in what metrics don’t capture.
Uber’s surge pricing was technically coherent. During Hurricane Sandy in 2012, prices doubled. During the Sydney hostage crisis in 2014, fares quadrupled as people fled the city centre. The algorithm was performing exactly as designed: match supply to demand. But in moments when the brand needs to demonstrate something else entirely, solidarity, restraint, basic decency, the metric was working and the judgement was absent. The reputational damage was severe enough that Uber eventually agreed to cap pricing during emergencies under pressure from the New York Attorney General.[6]
The UK’s Department for Work and Pensions offers a quieter, ongoing version of the same problem. The DWP uses AI to detect welfare fraud, aiming to reduce an estimated £8 billion in annual fraud and error. The system optimises for detection efficiency. But a freedom of information request in late 2024 revealed that the DWP’s own internal fairness analysis had found statistically significant disparities in who the system flagged for fraud investigation, related to age, disability, marital status, and nationality and that no fairness analysis had been carried out for race, sex, sexual orientation, religion, pregnancy or gender reassignment at all. The people disproportionately affected, disabled claimants, ethnic minorities, those with limited digital access, are the groups the system exists to protect. The metric is working. The strategic cost, in wrongful sanctions, eroded public trust and harm to vulnerable people goes unmeasured.[7]
Architects connect systems to values and consequences that don’t appear in dashboards until the damage is done.
AI generates capability. Leaders generate value.
Target, the US retailer, predicted customer pregnancies from purchasing data before customers had told their families. A statistician identified twenty-five products, whose purchase patterns reliably indicated pregnancy, accurate enough to estimate due dates. The capability was impressive. The implementation , sending baby-product coupons directly to a teenager whose father did not yet know she was pregnant, became a cautionary tale about the gap between analytical power and human judgement.[8]
Target’s fix is instructive. They didn’t stop using the algorithm. They started mixing pregnancy-related offers with unrelated products, lawn mowers alongside nappies, to camouflage the targeting. The insight was preserved. The creepiness was masked. Whether that constitutes “value” or merely better-concealed surveillance depends on your definition. But the point stands: the capability arrived long before the organisation had the decision-making infrastructure to use it responsibly.
The pattern is constant across industries. Organisations deploy AI, accumulate genuinely useful insight and then lack the governance structures to act on it responsibly. The capability grows. The value doesn’t, or worse it turns into a liability.
Continued in Part II: The Solution — Operating Model’s Missing Role
Part II makes the case for a specific organisational role designed to close this gap: what it does, where it sits, how it’s held accountable, and who you hire. The argument is specific enough to act on Monday morning.
[1] Ofqual, “Awarding GCSE, AS & A level, advanced extension awards and extended project qualifications in summer 2020: interim report” (13 August 2020). Approximately 39–40% of A-level results were graded down by one or more grades from the teacher-assessed grade. Independent schools’ A*/A rate rose 4.7 percentage points; secondary comprehensives rose 2.0 points. See also FFT Education Datalab analysis, 17 August 2020.
[2] Jeffrey Dastin, “Amazon scraps secret AI recruiting tool that showed bias against women,” Reuters, 10 October 2018. Reuters reported that the team was disbanded by the start of 2017 and that recruiters had reviewed the tool’s recommendations but had not relied solely on its rankings.
[3] Zillow Group, Inc., Form 8-K and shareholder letter, 2 November 2021. Q3 2021: 9,680 homes purchased, 3,032 sold; the company recorded a Homes-segment loss of $381m in Q3 with further losses guided for Q4. Zillow announced the wind-down of Zillow Offers and a workforce reduction of approximately 25%. Market-cap movement reported by Bloomberg, 3–4 November 2021.
[4] Bates & Others v Post Office Ltd [2019] EWHC 3408 (QB) (the “Horizon Issues” judgment, Fraser J., 16 December 2019); Post Office Horizon IT Inquiry, Volume 1 of the final report (Sir Wyn Williams, July 2025), which links at least 13 suicides to the scandal. Prime Minister Rishi Sunak’s “greatest miscarriages of justice” comment: House of Commons, 10 January 2024.
[5] Parliamentary Interrogation Committee on Childcare Allowances, “Ongekend onrecht” (“Unprecedented Injustice”), Dutch House of Representatives, 17 December 2020; resignation of the third Rutte cabinet, 15 January 2021. The 35,000 figure is the Dienst Toeslagen estimate (2024). The Dutch state formally acknowledged institutional racism in the system’s operation; see also Amnesty International, “Xenophobic Machines” (October 2021).
[6] NY Attorney General Eric Schneiderman, agreement with Uber on emergency surge-pricing caps, July 2014. Hurricane Sandy: prices doubled, October 2012. Sydney Lindt Cafe siege: fares rose to a minimum of A$100 (roughly 4× normal) on 15 December 2014; Uber subsequently refunded affected riders.
[7] DWP internal “fairness analysis” of the Universal Credit advances machine-learning model (February 2024), released to the Public Law Project under the Freedom of Information Act and reported by The Guardian, 6 December 2024. The £8bn fraud-and-error figure is from DWP’s “Fraud and Error in the Benefit System” statistics.
[8] Charles Duhigg, “How Companies Learn Your Secrets,” The New York Times Magazine, 16 February 2012, profiling Target statistician Andrew Pole’s pregnancy-prediction model and the subsequent shift to mixing baby-product offers with unrelated products to mask the targeting.