Every test case, input, and result is public and reproducible. 155 tests across 7 categories. Full data in GitHub.