They didn't ask it to produce incorrect output, the prompts are not leading it to an incorrect answer. It does highlight an important limitation of LLMs which is that it doesn't think, it just produces words off of probability.
However it's wrong to think that just because it's limited that it's useless. It's important to understand the flaws so we can make them less common through how we use the tool.
For example, you can ask it to think everything through step by step. By producing a more detailed context window for itself it can reduce mistakes. In this case it could write out the letters with the count numbered and that would give it enough context to properly answer the question since it would have the numbers and letters together giving it more context. You could even tell it to write programs to assist itself and have it generate a letter counting program to count it accurately and produce the correct answer.
People can point out flaws in the technology all they want but smarter people are going to see the potential and figure out how to work around the flaws.
Yeah which is why I get so aggravated when someone says that prompt engineering is pointless or not a real skill. It's a rapidly evolving discipline with lots of active research.
I pointed out general strategies to make it more accurate without supervision. Getting LLMs to be reliable enough to use without supervision will be a matter of adding multiple layers of safe guards.