Devs are aware. This was a quick n dirty prototype and they alright knew the issue with using chatgpt. They did it to make something work asap. In an interview (Danish) the devs recognized this and is moving toward using a LLM developed in French (I forget the name but irrelevant to the point that they will drop chatgpt).
Plus I don't want some random ass server to crunch through couple hundred watt hours if scanning the barcode and running that against a database would not just suffice but also be more accurate.
phi-4 is the only one I am aware of that was deliberately trained to refuse instead of hallucinating. it's mindblowing to me that that isn't standard. everyone is trying to maximize benchmarks at all cost.
I wonder if diffusion LLMs will be lower in hallucinations, since they inherently have error correction built into their inference process