For awarding organisations that rely on large pools of examiners (often professionals marking scripts outside their day jobs), marking is costly, inconsistent and increasingly unsustainable. Every year, millions are spent just keeping the marking wheels turning. For training providers, where marking isn’t outsourced, every hour spent marking is an hour lost to teaching, mentoring and growing enrolments. We now see providers capping student intake simply because they can’t keep up with assessment workflows. Across them all (colleges included), the pain is the same: marking is slow, expensive and drains human capacity. Yet it remains central to learning outcomes and institutional sustainability. Technology providers do offer AI-powered marking and assessment solutions. But questions arise about the need to invest in these when ChatGPT and other free LLMs are readily available. From a training provider’s perspective, it’s tempting to believe a free, general-purpose tool like ChatGPT can do everything. It can’t. Dependence on generic, opaque AI systems is institutional short-termism, not innovation, because assessment – especially high-stakes certification for job-ready skills – is not a generic task. Organisations responsible for the integrity of high-stakes exams would agree that relying on such tools for marking is premature and potentially harmful. Even so, according to our clients, it is becoming a widespread practice. But while LLMs predict text and generate plausible sentences at speed; they don’t understand knowledge and are therefore unable to evaluate mastery and reasoning. While adept at tasks like summarisation, they are neither trained nor validated for assessment marking, unlike purpose-built AI models or experienced human assessors. ChatGPT, for one, lacks structured exposure to assessment criteria and subject-matter standards. It cannot and does not know enough to underpin credible assessment, undermining its ability to provide reliable and contextually accurate evaluations. In discipline-specific or skills-based contexts (such as accounting or apprenticeships), it may produce generic rubric-aligned feedback but often fails to identify gaps in professional knowledge, skills and behaviours that expert assessors would readily spot. Shaped by biased and largely US-centric training data, it can also perpetuate inconsistency. It is no substitute for domain-specific, evidence-based AI or human expertise built over decades. Within organisations, assessors using ChatGPT in silos – often with good intentions – create new risks. There is little shared benchmarking, moderation or consistency, and different assessors will receive different outputs. Erroneous feedback can easily be accepted and passed on to learners. Using free LLMs for assessment may also breach internal AI policies and GDPR requirements, particularly where identifiable student data is shared. ChatGPT is not built for GDPR compliance, and many users don’t know how to disable training or opt-out settings. I’ve seen PII (Personal Identifiable Information) and confidential learner data fed into LLMs with little awareness of the risks – something that is unacceptable for regulated, high-stakes assessments. This isn’t just about tools, but purpose. Even qualified human examiners undergo extensive training to mark against specific standards. Specialist technology providers now develop domain-specific assessment models in close collaboration with clients, capturing expert knowledge that goes beyond rubrics and improves reliability for high-stakes marking. Deriving deep insights from student submissions is equally important. Purpose-built AI assessment tools automate granular gap analysis across skills, techniques and knowledge – not just topic-level progress. Our AI solutions are developed in close collaboration with subject-matter experts, keeping humans firmly in the loop, both during training and post-deployment. Expert examiners always retain final oversight of scores and feedback, ensuring accuracy, continuous improvement and Ofqual compliance. This is not off-the-shelf ChatGPT; it is AI specifically trained to mark all assessment types, including high-stakes. Some may ask: “Well, can’t we just train ChatGPT ourselves?” In reality, this doesn’t scale, relies on limited internal expertise, and leaves the same unresolved issues of GDPR, inconsistency, oversight and cost. ChatGPT has its place. But if you are serious about AI in marking, feedback and assessment, you need purpose-built technologies that embed human expertise throughout their development and deployment lifecycle, so that they truly understand your domain, your standards and your learners – delivering higher standards at lower cost than the current examiner-dependent model. Only through collaboration between domain-informed AI and expert human examiners can we harness AI’s potential to enhance assessment workflows without compromising standards.