@mapcar
What is dodgy about the BBC 'study' methodology?
1. It's not a double blind study.
You get journalists assessing accuracy of the tools that are ALREADY (Aus Murdoch media) taking their jobs.
Ideally, they should be assessing AI and Human stories which are not identified.
It's exactly like the police investigating police corruption.
2. I could not find anywhere whether they used commercial, pay for version of the #AI engines, or the sideshow attractions free public ones. As they referred to them simply as assistants, did not even state what versions were being used and in two cases, did not even state what LLM model they were using. Does not speak well of their journalistic rigour, much less preparation. I'm not even sure the journalists were even aware there are significant performance issues between commercial and free versions. It's like asking a sideshow clown for an economic projection (honka honka)
3. The "lifting of the blocks" on the websites for the duration of the test. Is another naivete or malicious representation. #LLM are LEARNING (the hint is in the name). Just lifting the gate for the duration of the test is absolutely not going to give the LLM access to the website. In fact, two of the engines in the test I am familiar with (o4 and Sonnet) did not even do live searches of the internet in February. And it takes about 500,000 kiloWatts to compute the multidimensional vector trees for a model.
4. Prompt engineering. Once again naivete. Just like with googling, the quality of the response is related to the query. Virtually all of the questions are questions that a first grader might ask; eg: "is vaping bad for you?". And I see people with letters before and after their name quoting this study as "AI are bad".
Presumably, they operate at a level higher than that when querying their sources. You could ask, instead; "What is the latest body of research on health effects of vaping, provide pros and cons, show controversy. Tabulate results on credibility."
The models tune to the prompt, ask a simple prompt, get a simplified response.
5. The quality of scoring. There is no consistency of scoring. Each reviewer chooses how they FEEL about the quantitative values. One may rate it at 7 the other at 2.
Since were assessing ACCURACY of the LLM, maybe we should assess accuracy of the assesment too? No?
6. Many of the errors are laughable. In the vaping one, the reviewer comment is "NHS recommends not smoking" (presumably pointing it out as an error). Where the response (to a simpleton question) is "Vaping may be bad for you".
Literally all of the "inaccuraccies" are trivial like that. For a kindergarden question.
7. Journalists write STORIES (you know what LLMs do) largely inaccurate stories (What LLMs are accused of). They are the least qualified to assess their competition.
In closing:
This widely quoted study is by folks whose jobs are threatened, to appeal to folks who are (largely) unwilling and hostile to the idea of learning nascent tech.