I wanted to extract some crime statistics broken by the type of crime and different populations, all of course normalized by the population size. I got a nice set of tables summarizing the data for each year that I requested.
When I shared these summaries I was told this is entirely unreliable due to hallucinations. So my question to you is how common of a problem this is?
I compared results from Chat GPT-4, Copilot and Grok and the results are the same (Gemini says the data is unavailable, btw :)
LLMs are totally unreliable for research. They are just probable token generators.
Especially if your looking for new data that nobody has talked about before, then your just going to get convincing hallucinations, like talking to a slightly drunk professor at a loud bar who can't ever admit they don't know something.
Example: ask a llm this "what open source software developer died in the September 11th attacks?"
It will give you names, and when you try to verify those names, you'll find out those people didn't die. It's just generating probable tokens
Tried the example, got 2 names that did die in the attacks, but they sure as hell weren't developers or anywhere near the open source sphere. Also love the classic "that's not correct" with the AI response being "ah yes, of course". Shit has absolutely 0 reflection. I mean it makes sense, people usually have doubts in their head BEFORE they write something down. The training data completely skips the thought process, LLMs can't learn to doubt.
Definitely. The thing you might want to consider as well is what you are using it for. Is it professional? Not reliable enough. Is it to try to understand things a bit better? Well, it's hard to say if it's reliable enough, but it's heavily biased just as any source might be, so you have to take that into account.
I don't have the experience to tell you how to suss out its biases. Sometimes, you can push it in one direction or another with your wording. Or with follow-up questions. Hallucinations are a thing but not the only concern. Cherrypicking, lack of expertise, the bias of the company behind the llm, what data the llm was trained on, etc.
I have a hard time understanding what a good way to double-check your llm is. I think this is a skill we are currently learning, as we have been learning how to sus out the bias in a headline or an article based on its author, publication, platform, etc. But for llms, it feels fuzzier right now. For certain issues, it may be less reliable than others as well. Anyways, that's my ramble on the issue. Wish I had a better answer, if only I could ask someone smarter than me.
Oh, here's gpt4o's take.
When considering the accuracy and biases of large language models (LLMs) like GPT, there are several key factors to keep in mind:
1. Training Data and Biases
Source of Data: LLMs are trained on vast amounts of data from the internet, books, articles, and other text sources. The quality and nature of this data can greatly influence the model's output. Biases present in the training data can lead to biased outputs. For example, if the data contains biased or prejudiced views, the model may unintentionally reflect these biases in its responses.
Historical and Cultural Biases: Since data often reflects historical contexts and cultural norms, models might reproduce or amplify existing stereotypes and biases related to gender, race, religion, or other social categories.
2. Accuracy and Hallucinations
Factual Inaccuracies: LLMs do not have an understanding of facts; they generate text based on patterns observed during training. They may provide incorrect or misleading information if the topic is not well represented in their training data or if the data is outdated.
Hallucinations: LLMs can "hallucinate" details, meaning they can generate plausible-sounding information that is entirely fabricated. This can occur when the model attempts to fill in gaps in its knowledge or when asked about niche or obscure topics.
3. Context and Ambiguity
Understanding Context: While LLMs can generate contextually appropriate responses, they might struggle with nuanced understanding, especially in cases where subtle differences in wording or context significantly change the meaning. Ambiguity in a prompt or query can lead to varied interpretations and outputs.
Context Window Limitations: LLMs have a fixed context window, meaning they can only "remember" a certain amount of preceding text. This limitation can affect their ability to maintain context over long conversations or complex topics.
4. Updates and Recency
Outdated Information: Because LLMs are trained on static datasets, they may not have up-to-date information about recent events, scientific discoveries, or new societal changes unless explicitly fine-tuned or updated.
5. Mitigating Biases and Ensuring Accuracy
Awareness and Critical Evaluation: Users should be aware of potential biases and inaccuracies and approach the output critically, especially when discussing sensitive or fact-based topics.
Diverse and Balanced Data: Developers can mitigate biases by training models on more diverse and balanced datasets and employing techniques such as debiasing algorithms or fine-tuning with carefully curated data.
Human Oversight and Expertise: Where high accuracy is critical (e.g., in legal, medical, or scientific contexts), human oversight is necessary to verify the information provided by LLMs.
6. Ethical Considerations
Responsible Use: Users should consider the ethical implications of using LLMs, especially in contexts where biased or inaccurate information could cause harm or reinforce stereotypes.
In summary, while LLMs can provide valuable assistance in generating text and answering queries, their accuracy is not guaranteed, and their outputs may reflect biases present in their training data. Users should use them as tools to aid in tasks, but not as infallible sources of truth. It is essential to apply critical thinking and, when necessary, consult additional reliable sources to verify information.
I work on a project where we are trying to analyse financial data. We use Claude and llama and they are really good. We needed a few months to achieve 87% reliability.
For our application that's probably almost enough. For an application that needs 100% all the time, every time, that quite a lot away.
If you're using an LLM, you should limit the output via a grammar to something like json, jsonl, or csv so you can load it into scripts and validate that the generated data matches the source data. Though at that point you might as well just parse the raw data and do it yourself. If I were you, I'd honestly use something like pandas/polars or even excel to get it done reliably without people bashing you for using the forbidden technology even if you can 100% confirm that the data is real and not hallucinated.
I also wouldn't use any cloud LLM solution like OpenAI, Gemini, Grok, etc. Since those can change and are really hard to validate and give you little to no control of the model. I'd recommend using a local solution like running an open weight model like Mistral Nemo 2407 Instruct locally using llama.cpp or vLLM since the entire setup will not change unless you manually go in and change something. We use a custom finetuned version of Mixtral 8x7B Instruct at work in a research setting and it works very well for our purposes (translation and summarization) despite what critics think.
Tl;dr Use pandas/polars if you want 100% reliable (Human error not accounted). LLMs require lots of work to get reliable output from
Edit: There's lots of misunderstanding for LLMs. You're not supposed to use the bare LLM for any tasks except extremely basic ones that could be done by hand better. You need to build a system around them for your specific purpose. Using a raw LLM without a Retrieval Augmented Generation (RAG) system and complaining about hallucinations is like using the bare ass Linux kernel and complaining that you can't use it as a desktop OS. Of course an LLM will hallucinate like crazy if you give it no data. If someone told you that you have to write a 3 page paper on the eating habits of 14th century monarchs in Europe and locked you in a room with absolutely nothing except 3 pages of paper and a pencil, you'd probably write something not completely accurate. However, if you got access to the internet and a few databases, you could write something really good and accurate. LLMs are exceptionally good at summarization and translation. You have to give them data to work with first.
If generation temperature is non-zero (which it often is), there is inherent randomness to the output. So even if the first number in a statistic should be 1, sometimese it will just randomly pick any other plausible number. Even if the network always picks the correct token as the highest probability, it's basically doing a coin toss for every token to make answers more creative.
That's on top of hoping the LLM has even seen that data during training AND managed to memorize it during training AND that the networks just happens to be able to reproduce the correct data given your prompt (it might not be able to for a different prompt).
If you want any reliability at all, you need to use RAG AND also you yourself have to double check all the references it quotes (if it even has that capability).
Even if it has all the necessary information to answer correctly in it's context window, it can still answer incorrectly.
None of the current models are anywhere close to producing trustworthy output 100% of the time.
The least unreliable LLM I've found by far is perplexity, in the Pro mode. (By the way, if you want to try it out. You get a few free uses a day).
The reason is because the Pro mode doesn't retrieve and spit out information from its internal memory bank, but instead, it uses that information to launch multiple search queries, then summarises the pages it finds, and then gives you that information.
Other LLMs try to answer "from memory" and then add some links at the bottom for fact checking but usually Perplexity's answers come straight from the web so they're usually quite good.
However, I still check (depending on how critical the task is) that the tidbit of information has one or two links next to it, that the links talk about the right thing, and I verify the data myself if it's actually critical that it gets it right. I use it as a beefier search engine, and it works great because it limits the possible hallucinations to the summarisation of pages. But it doesn't eliminate the possibility completely so you still need to do some checking.