All rise for JudgeGPT | The Verge

Bridget McCormack is used to correcting judges’ work. As the former chief justice on the Michigan Supreme Court, it was her job to review complaints about how judges at the lower courts failed to consider key evidence or rule on certain aspects of a case.

In her current job, McCormack is working on a new kind of legal decision-maker. Like a judge, it would make mistakes. But unlike many judges, it wouldn’t be burdened by more casework than it had hours in the day. It could make sure to always show its work, check that each side agreed it understood all the facts, and ensure it ruled on each issue at play. And it wouldn’t be human — it’s made of neural networks.

McCormack leads the American Arbitration Association, which has developed an AI Arbitrator to help parties settle document-based disputes in a low-cost way. The system is built on OpenAI’s models to walk parties in arbitration through their dispute and draft a decision on who should win the case and why. The system deals only with cases that rely solely on documents, and there’s a human in the loop at every stage, including in the final step of issuing an award. But McCormack believes even with these caveats, the process can make dispute resolution faster and more accessible, greasing the wheels of an overburdened legal system.

Generative AI frequently makes headlines for its failures in the courtroom. Last year, at least two federal judges had to issue mea culpas and come up with new policies after issuing court orders with made-up facts, thanks to the use of generative AI. Academics warn that AI’s legal interpretations are not as straightforward as they can seem, and can either introduce false information or rely on sources that would never be legally admissible otherwise. AI tools have been shown to import or exacerbate human biases without careful consideration, and the public’s skepticism of the tools could further threaten trust in the justice system.

Optimists like McCormack, meanwhile, see huge potential upsides for bringing speedier justice to the American legal system, even as they see an enduring role for human decision-makers. “Most small and medium businesses in the United States can’t afford legal help at all, and one dispute can put them under,” she says. “So imagine giving all of those businesses a way to resolve disputes and move forward with their business in a way that they could navigate, afford, and manage on their own.” She and others are balancing a difficult question: Can a new technology improve a flawed and limited justice system when it has flaws and limitations of its own?

While high-profile failures have garnered the most attention, courts are using AI in ways that mostly fly under the radar. In a review of AI use in the courts, Daniel Ho, faculty director at Stanford’s RegLab, and former research fellow Helena Lyng-Olsen found AI was already being used in the judicial system for both administrative and judicial tasks. Administrative court staff, for example, use AI for things like processing and classifying court filings, basic employee or customer support, or having AI monitor social media keywords for threats to judicial staff. Judges or their staff might use generative AI tools for lower-risk use cases like asking a large language model (LLM) to organize a timeline of key events in a case, or perform a search across both text and video exhibits. But they also use them for higher-risk tasks, according to Ho and Lyng-Olsen, like relying on AI for translations or transcriptions, anticipating the potential outcome of a case, and asking an LLM for legal analysis or interpretation.

Some of the technology used in courts predates the modern generative AI era. For example, judges have been using algorithmic risk assessments for years to help evaluate whether to release a defendant before trial. These tools already raised questions about whether algorithms could encode human bias. A 2016 ProPublica investigation revealed that not only were these algorithms not very good at predicting who would go on to commit violent crimes, they also disproportionately assessed Black defendants as high risk compared to white defendants, even when ProPublica controlled for other factors like criminal history and age. Newer LLM systems introduce entirely new concerns, particularly a propensity to make up information out of whole cloth — a phenomenon known as hallucination. Hallucinations have been documented in legal research tools like LexisNexis and Westlaw, which have integrated generative AI in an effort to help lawyers and judges find case law more efficiently.

Despite these risks, at least one prominent judge has promoted the use of LLMs: Judge Kevin Newsom, who sits on the 11th Circuit Court of Appeals. In 2024, Newsom issued a “modest proposal” in a concurring opinion, which he recognized “many will reflexively condemn as heresy.” Newsom’s pitch was for judges to consider that generative AI tools — when assessed alongside other sources — could help them analyze the ordinary meaning of words central to a case.

Newsom’s test case was a dispute that hinged partly on whether installing an in-ground trampoline could be considered “landscaping,” entitling it to coverage under an insurance policy. Newsom, a self-described textualist, wanted to understand the ordinary meaning of the word “landscaping.” He found myriad dictionary definitions lackluster. Photos of the in-ground trampoline didn’t strike him as “particularly ‘landscaping’-y,” but this unscientific gut feeling bothered the jurist whose entire philosophy is based around a strict adherence to the meaning of words. Then, “in a fit of frustration,” Newsom said to his law clerk, “I wonder what ChatGPT thinks about all this.”

Photo by Amelia Holowaty Krales / The Verge

The generative AI response, Newsom found, articulated the missing pieces he couldn’t quite put into words. He asked the chatbot for the “ordinary meaning” of landscaping, and its answer broadly described “the process of altering the visible features of an area of land, typically a yard, garden or outdoor space, for aesthetic or practical purposes,” a response Newsom said was “less nutty than I had feared” — and squared with his existing impressions. When he asked both ChatGPT and Google’s Gemini (then Bard) whether installing an in-ground trampoline could be considered landscaping, ChatGPT said yes, and Google’s agent laid out the criteria under which the description would fit.

Other factors in the case ended up nullifying the need to land on a definition of landscaping, but the experiment left a lasting impression on Newsom. He acknowledged potential downsides of the technology for judicial use, including its tendency to hallucinate, the fact it doesn’t account for “offline speech” outside of its training set, and the potential for future litigants to try to game it. But he doubted those were total “deal-killers” for his proposal that LLM outputs be considered one of several data points a judge uses to interpret language.

Newsom’s pithy opinion sounds quite simple. After all, shouldn’t a system trained on a boatload of human language have a highly representative view of how different words are used in everyday life? As Newsom pointed out, textualists already tend to read multiple dictionary definitions to understand the ordinary meaning of words relevant to a case, and “the choice among dictionary definitions involves a measure of discretion.” Judges also rarely explain why they chose one definition over another, he wrote, but under his proposal, judges should include both their own queries and the generative AI outputs to show how they arrived at a conclusion.

But recent academic research suggests that some assumptions underlying Newsom’s reasoning are flawed. There’s a “mistaken assumption … that ChatGPT or Claude are a lookup engine for American English, and that completely glosses over how these models are actually trained and tuned to provide the kind of output that Judge Newsom is getting on the platform,” says Stanford’s Ho, who co-authored a 2024 article on the subject in the Minnesota Journal of Law, Science & Technology. A model’s output can be influenced by things including regional language quirks of the people who help fine-tune it, for example, which is thought to be the reason behind ChatGPT’s strangely frequent use of the term “delve.”

Ho, with Princeton University assistant professor Peter Henderson, led a team that examined the ways corpus linguistics, or the analysis of a large amount of text, can sometimes obscure the meaning of language that judges might otherwise rely on — and “may import through the back door what at least some judges would expressly refute in the front door.” That could include drawing on foreign law that several Supreme Court justices have said is not appropriate to use to interpret the US Constitution, or reflecting “elite rhetoric” rather than the ordinary meaning of words or phrases.

Newsom admits that LLM training data can “run the gamut from the highest-minded to the lowest, from Hemmingway [sic] novels and Ph.D. dissertations to gossip rags and comment threads.” But he assumes that since “they cast their nets so widely, LLMs can provide useful statistical predictions about how, in the main, ordinary people ordinarily use words and phrases in ordinary life.”

His faith in the LLMs’ transparency might be premature. “[M]odels present researchers with a wide range of discretionary choices that can be highly consequential and hidden from judicial understanding,” Ho and Henderson wrote. Though models can make a show of explaining themselves, even their creators don’t fully know how they get to their outputs, which can sometimes change. “I don’t think we’re anywhere close at the present time to a point where some of these tools could be relied upon to explain how they reached the decision that they made,” says Paul Grimm, who served as a federal judge for 25 years and until recently served as a law professor at Duke University, where he wrote about AI in the judicial system.

It’s tempting to think that LLMs have true understanding because of their often nuanced answers. For example, Newsom says, they can “‘understand’ context” because they are able to tell when something refers to a “bat” meaning the animal, or the kind that hits a baseball. But this leaves out important attributes that contribute to true understanding. While large language models are quite good at predicting language, they can’t actually think. As Cognitive Resonance founder Benjamin Riley explained recently in The Verge, “We use language to think, but that does not make language the same as thought.” A Michigan judge recently cited the article to justify sanctions against a party that used ChatGPT to write an erroneous legal filing.

“[T]he proliferation of LLMs may ultimately exacerbate, rather than eradicate, existing inequalities in access to legal services”

Then there’s the issue of AI making stuff up. Newsom agrees that generative AI’s tendency to hallucinate is one of “the most serious objections to using LLMs in the search for ordinary meaning.” He countered that the technology is rapidly improving, and human lawyers also skew facts, intentionally or not. But as it stands, there’s still ample evidence of hallucinations in even the most meticulous generative AI systems. In a 2024 paper in the Journal of Legal Analysis, researchers found that hallucinations of legal facts were “widespread” among the four LLMs they tested. The result, they wrote, is that “the risks are highest for those who would benefit from LLMs most—under-resourced or pro se litigants,” meaning those who opt to represent themselves in court. That led the researchers to “echo concerns that the proliferation of LLMs may ultimately exacerbate, rather than eradicate, existing inequalities in access to legal services.”

The two leading legal research tools, LexisNexis and Westlaw, have taken steps that they say should drastically reduce hallucinations within their systems. But when the same researchers later examined them in a 2025 paper in the Journal of Empirical Legal Studies, they found “the hallucination problem persists at significant levels,” despite improvements over the generalized tools. Both legal tools use a system called retrieval-augmented generation (RAG), where the system first retrieves information from a database, then feeds that into an LLM to finish generating a response to the user’s prompt. But the researchers found that RAG could still be flawed, and that unique quirks of legal writing made it particularly susceptible to misinterpretation by the AI models. For example, the concept of case law is that an overall set of rulings on a topic build upon each other and form precedent — but that’s not as easy to pull as a single ruling in a single case. To make things even more complicated, that precedent is constantly changing as new rulings come in, a process it’s “unclear and undocumented” how systems handle, Ho tells The Verge. “Thus, deciding what to retrieve can be challenging in a legal setting,” the researchers write.

Both Westlaw owner Thomson Reuters and LexisNexis say their offerings have changed significantly since the study was originally published in 2024. LexisNexis Legal & Professional Chief Product Officer Jeff Pfeifer said in a statement that they’ve “significantly advanced how our AI systems are designed, evaluated, and deployed” since the research was published, and that it combines RAG with other information to “reduce the risk of unsupported answers.” Thomson Reuters said in a 2024 blog post that since the tool the researchers evaluated “was not built for, nor intended to be used for primary law legal research, it understandably did not perform well in this environment.” Westlaw’s head of product management Mike Dahn said in a statement that the technology referenced isn’t available in its platform anymore, and its newer AI research offering “is significantly more powerful and accurate than earlier AI iterations.”

Newsom posits that hallucinations from AI are a bigger issue when asking a question that has a specific answer, rather than seeking the ordinary meaning of a phrase. But some research suggests seeing an authoritative-sounding response from an LLM can contribute to confirmation bias.

Newsom was not deterred by pushback to his proposal. He issued “a sequel of sorts” in another concurring opinion months later, where he admitted to being “spooked” by the realization that LLMs could sometimes issue “subtly different answers to the exact same question.” But he ultimately concluded that the slight variations actually seemed reflective of those in real-life speech patterns, reinforcing its reliability for understanding language. “Again, just my two cents,” he wrote. “I remain happy to be shouted down.”

What’s human about judging?

Whenever a new technology is proposed to update a system as important as the legal process, there’s valid concern that it will perpetuate biases. But human judges, obviously, can bring their own flaws to the table. An infamous 2011 study found, for example, that judges made more favorable parole rulings at the beginning of the day and after a lunch break, rather than right before. “We’re completely comfortable with the idea that human judges are humans and they make mistakes,” McCormack says. “What if we could really at least eliminate most of those with a technology that shows its work? That’s a game changer.”

McCormack’s organization has seen a version of this at work through its AI Arbitrator. The tool summarizes issues and proposes a decision based on its training and the facts at hand, then lets a human arbitrator look at its results and make a final call. The idea is to let parties resolve simple disputes quickly and for a lower cost, while giving attorneys and arbitrators time to work on more cases or focus on ones that require a human touch.

“We’re completely comfortable with the idea that human judges are humans and they make mistakes”

Arbitration is different from a formal court proceeding in important ways, though aspects of the process look very similar. It’s a form of alternative dispute resolution that lets two parties resolve an issue without going to court. Parties sometimes opt for arbitration because they see it as a more flexible or lower-cost option, or want to avoid the more public nature of a formal lawsuit. Sometimes, a party is forced into arbitration due to a clause in their contract, but when that’s not the case, it’s up to the individuals or businesses to go that route, unlike a court case where one side is compelled to be there. The decisions by an arbitrator — often a retired judge, legal professional, or expert in a specific field — can be binding or nonbinding, depending on what the parties agreed to.

The AI Arbitrator is currently only available for documents-only cases in the construction industry — things like a dispute between a contractor and a building owner based on their contract. Both parties agree to use the system, and submit their positions and relevant documents to back it up. The AI Arbitrator summarizes the submissions and organizes a list of claims and counterclaims, creates a timeline of the case based on all the filings, and lays out the key issues of the case, like whether there was a valid contract in place or if that contract was adequately fulfilled. At that stage, both sides have the chance to give feedback on whether the AI got these details right or left anything out.

That feedback, alongside the AI summaries, then gets handed to a human arbitrator — the first of multiple places they drop into the loop. The arbitrator reads the material and clicks through a series of screens where they can validate or edit each of the key issues in the case. The AI Arbitrator then provides an analysis for each issue, on which the human arbitrator can add feedback. The AI Arbitrator drafts a final award based on this analysis, including a rationale for the judgment. It references AAA handbooks with material from human arbitrators describing how they evaluate different parts of a case. The human arbitrator can edit and validate the AI-generated award, and then, finally, sign off on it — concluding the process.

Not everyone will feel comfortable using AI to decide on the outcome of their dispute. But some might find the time and cost savings attractive, and be reassured that a human ultimately checks the work and makes the final decision. To the extent that a human arbitrator might disagree with the AI Arbitrator’s ultimate judgment, the AAA says, they’re about as likely to disagree with another human arbitrator about it.

A human arbitrator in the AI-led system gets neatly packaged summaries of documents and arguments with parties’ feedback on those summaries, while in the completely human-led process, they’d have to pore over perhaps hundreds of pages of documentation just as a starting point. The kinds of cases the AI Arbitrator works on typically take a human arbitrator 60 to 75 days to resolve, the group says, and while the tool only launched recently, it projects that disputes using the AI Arbitrator will take 30 to 45 days, and produce at least a 35 percent cost savings.

Photo by Amelia Holowaty Krales / The Verge

McCormack has found that the AI Arbitrator has an additional benefit: Parties like how the tool makes them feel heard. Its design — which asks each side to confirm that it has understood all the relevant facts and allows them to provide additional feedback — lets people speak up if they feel like something is being lost or glossed over in arbitration. It’s an element of the technology she says she initially “underappreciated” at first. “I used to talk to judges all the time about how these parties just want to make sure you hear them,” she says. “That literally matters more than anything else, that they have a chance to tell you what happened.”

Reaching a fair outcome, of course, is a non-negotiable element of arbitration. But there is plenty of research about the importance of procedural justice, or ensuring that people perceive the process itself as fair and trustworthy — which can result in them gaining more trust in the legitimacy of the law.

A 2022 article in the Harvard Journal of Law & Technology (published before the rise of ChatGPT) suggests people aren’t necessarily opposed to AI judges, even if they still prefer humans. In the study, participants were asked about their perception of the fairness of a hypothetical AI judge. The participants said they viewed hypothetical proceedings before human judges as more fair than those before AI judges. But overall, they said being allowed to speak before an AI judge would be more procedurally fair than having no opportunity to speak at all. That suggests, the authors wrote, that the perceived fairness gap between human and AI judges may be at least partially offset “by introducing into AI adjudication procedural elements that might be absent from current processes, such as a hearing or an interpretable decision.”

Judges, like workers in every industry, are being made to figure out exactly what about their jobs requires a human touch

For the history of the judicial system, hearing out plaintiffs and defendants and doling out justice have been considered deeply human tasks. But as AI begins to excel at many jobs that have taken humans long hours to complete, judges, like workers in every industry, are being made to figure out exactly what about their jobs requires a human touch. In his 2023 end-of-year report, US Supreme Court Chief Justice John Roberts wrote about the role he saw AI playing in the judicial system in the future. He saw some judicial activities as uniquely human: determining how sincere defendants are during sentencing, or wading through “fact-specific gray areas” to decide if a lower court “abused its discretion.” He predicted that “human judges will be around for a while. But with equal confidence I predict that judicial work—particularly at the trial level—will be significantly affected by AI.”

McCormack says there are certain disputes that “should always be resolved in courthouses and in public”: criminal cases and cases brought by citizens against the government. But for many civil disputes, she says, AI could play an important role in giving more people access to justice by making the process more efficient.

Grimm, the former judge and Duke professor, says that by the time he retired from the court, “I had been many years working seven days a week, and I was working as hard as I could and I still wished I had been more prepared than I could have been.” He rattled off a list of things AI could be useful for: outlining issues that parties expect the judge to rule on, summarizing long testimony transcripts, making a list of the facts both parties agree on based on reams of court filings, and perhaps, after a judge has written their opinion, revising it for a 12th grade reading level so that it’s more accessible to the public.

“If you want a more efficient judiciary … the easy answer is not AI. It’s appoint more federal judges”

But AI isn’t necessarily the best solution for a persistently understaffed judiciary, and it’s certainly not the only one. Cody Venzke, senior policy counsel at the American Civil Liberties Union (ACLU) National Political Advocacy Division, agrees there could be a role for the technology in certain administrative tasks, but says the issues of judicial burnout largely shouldn’t be resolved with it. “If you want a more efficient judiciary where judges can spend more time on each case, where they can do things like — God forbid — have a jury trial, the easy answer is not AI,” he says. “It’s appoint more federal judges.”

Grimm and Venzke agree that judges should never be simply checking AI’s work. “I hope that there’s never a time when the judge just tells the AI to come up with an opinion that they read and sign,” Grimm says. The line, to Grimm, is about who — or what — is influencing whom. Using the tool to draft an opinion a judge is on the fence about and gauging their own reaction, for example: “I think that comes too close to the line of letting the AI get to the answer first.” That could result in confirmation bias where the judge then downplays contrary evidence, he says. Even using AI to draft two opposing outcomes of a case to decide which is better feels a bit too risky.

“AI tools do not take an oath”

Grimm’s reasoning is based both on the facts of how generative AI tools are designed and on the unique quality of human societal ethics. “These tools are not designed to get the right answer,” he says. “They’re designed to respond to prompts and inquiries and predict what the response should be, based upon the inquiry and the data that they were tested on.” An AI tool could cite language for a real court case, for example, but it might be from a dissent, which doesn’t hold the same legal weight. But an equally important point, he says, is that “AI tools do not take an oath.”

Venzke says he’d be among the last people to praise the current judicial system as perfect. “But it’s worth underscoring that AI is not superhuman intelligence,” he says. “It’s super efficient summarizing of human knowledge.” Sometimes, AI’s attempt to even do that still seems to fall flat. Venzke described a time he tried to do legal research about two neighbors’ rights to access a lake through an easement where one was trying to build a dock. But since there was not a clear ruling on such a matter in the state he was looking at, he found generative AI returned largely irrelevant results. The answer took a few hours to come up with on his own, but mostly involved interpreting the law from Supreme Court rulings and considering how other states ruled in similar matters — something he says the technology is still not very good at consolidating effectively on its own.

It’s tempting to think a carefully calibrated machine could come out with the “right” answer in a legal case more often than not. But Grimm says thinking about such decisions as right and wrong obscures the nature of the legal system. “Oftentimes legal issues could go either way,” he says. “That’s why you can get dissent on the Supreme Court … It’s too simplistic to say, well, judges have biases.”

Still, some early research suggests that despite a largely skeptical view toward AI in judicial decision-making, some people see a potential upside over the status quo. Researchers from the University of Nevada, Reno set out to study how views of AI use in the judicial system might vary across racial groups in a 2025 paper published in the MDPI Behavioral Sciences journal. They asked participants how they felt about a judge who relied only on their expertise, or only on an AI system that uses algorithms to make a bail or sentencing determination (the tools described sound like a non-generative AI system), or a combination of the two. While overall, participants in the study perceived judges that relied only on their expertise, rather than AI, as more favorable on bail and sentencing decisions, Black participants tended to perceive the AI-assisted version as more fair than their white and Hispanic counterparts did, “suggesting they may perceive AI as a tool that could enhance fairness by limiting judicial discretion.”

At the same time, research has found that judges already tend to use algorithmic tools — some of which have documented racial bias issues — to reinforce their own decisions. In a study published in 2024 in the journal Social Problems, Northwestern University researcher Sino Esthappan found that when judges were given algorithmic assessments of defendants’ risk in returning to court if released from jail, they mostly used them to justify the rulings they wanted to make anyway. In another 2024 analysis, researchers from Tulane, Penn State, and Auburn University found that while the AI recommendations seemed to help “balance out” judges’ tendency to dole harsher punishments to male versus female defendants, “the AI may trigger judges’ racial biases.”

The researchers in that study had faith that “AI’s recommendations can help judges refocus and make more objective judgments.” In the cases where judges agreed with and followed through on the AI’s recommendations to offer alternative punishments to a defendant, the researchers found the lowest recidivism rate compared to scenarios where the two misaligned. “When the two are on the same page, judges sentence the riskiest and the least risky offenders to incarceration and alternative punishments, respectively.” As a result, the researchers recommended that judges “minimize intrinsic bias by pausing and reconsidering when their decisions deviate from AI’s recommendations.”

Even AI optimists express little desire to get rid of human judges. Hallucinations remain a persistent problem, and the tools’ value remains limited when every detail must be painstakingly checked.

”When you’re talking about something as rights-impacting as a judicial process, you don’t want 95, 99 percent accuracy. You need to be excruciatingly close to 100 percent accuracy,” Venzke says. “And until or if AI systems get to that point, they really don’t have a place to be operating, especially operating independently, in the judicial process.”

“The legal profession has been unbelievably successful at avoiding any disruption for 250 years in America”

The overall goal is to leave human judges more time to work on the cases that deserve their fullest attention, while giving the largest number of people access to justice in a time-efficient way. “The legal profession has been unbelievably successful at avoiding any disruption for 250 years in America,” McCormack says. “We’ve undergone four industrial revolutions and never updated the operating system. And when our legal system was established, there was a completely different market and the one-to-one service model, everybody had a lawyer for a dispute, was the way things worked. And that’s just not true anymore and hasn’t been true for a number of decades now.”

McCormack says colleagues who were resistant to AI even a year ago are beginning to accept it. “I would not be surprised, I don’t know if it’s in five years, or 20 years, or 40 years, if we look back and think that it was hilarious that we thought humans had to oversee all of these disputes.”

Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.