Sadly that sort of thing got so common where I work that I'll run the tests three times before considering looking into the error message to see if it is something I broke.
From time to time we take some days just to fix tests with inconsistent results, but there's always more popping up.
That's why I kinda don't like Python and JavaScript anymore. Every time I want types for a library it's gonna take me time to get it working. For every serious project I do, I use a strongly typed language.
Just create a al Inter rule that rejects Any types and a pre-commit hook that refuses the commit if the linter fails. Sometimes the brute force approach is the best way to teach