Title: The minimax risk in testing uniformity under missing ball alternatives
Abstract: We study the problem of "uniformity testing," i.e., testing the goodness of fit of categorical data to the uniform distribution over the categories. As a class of alternative hypotheses, we consider the removal of an $\ell_p$ ball of radius $\epsilon$ around the uniform rate sequence for $p \leq 2$. When the number of samples $n$ and number of categories $N$ go to infinity while $\epsilon$ is small, we show that the minimax risk $R_\epsilon^*$ in testing based on the sample's histogram (number of absent categories, singletons, collisions, ...) asymptotes to
$2\Phi(-n N^{2-2/p} \epsilon^2/\sqrt{8N})$; $\Phi(x)$ is the normal CDF. As it turns out, the minimax test relies on collisions in the very small sample limit but otherwise behaves like the chisquared test. Empirical studies over a range of problem parameters show that our estimate is accurate in finite samples and that the minimax test is significantly better than the chisquared test or a test that only uses collisions.
Our result settles several known challenges in uniformity and "identity" testing from the last two decades. In particular, it allows the comparison of the many estimators previously proposed for this problem at the constant level rather than at the rate of convergence of the risk or the scaling order of the sample complexity.
Our analysis introduces several new methods by adapting techniques previously used by Ingster and Suslina for Gaussian signal detection.