This package contains Matlab implementations of various kernel-based statistical hypothesis tests for the two-sample problem, as described in GreEtAl07, GreEtAl09, and GreEtAl12.

We propose to test whether distributions P and Q are different on the basis of samples drawn from each of them, by finding a smooth function (the witness function) which is large on the points drawn from P, and small (as negative as possible) on the points from Q. We use as our test statistic the difference between the mean function values on the two samples, or maximum mean discrepancy (MMD): when this is large, the samples are likely from different distributions. Smoothness is enforced by restricting the witness function to a unit ball in a reproducing kernel Hilbert space. The MMD is an instance of an integral probability metric.

Four strategies may be used to calculate the test threshold:

• Bootstrap (mmdTestBoot.m) uses bootstrap resampling on the aggregated data to obtain a test threshold, as described in GreEtAl12. This option is recommended for first-time users: it is computationally costly, but does not require checks for possible failure conditions.
• Fast, consistent test (mmdTestSpec.m) uses the eigenvalues of the joint kernel matrix over both samples to obtain a consistent estimate of the null distribution, as described in GreEtAl09. Before using this option, check that there are sufficient data available to avoid truncation of the eigenspectrum (see Figure 2, Section 4 in GreEtAl09 for details).
• Moment matching using Pearson curves (mmdTestPears.m) fits Pearson curves to the first three moments (and uses a lower bound on the fourth): see Section 5 of GreEtAl12. This is the slowest of the tests, and can fail under some circumstances, since it is not guaranteed to be consistent (see Figure 1, Section 4 in GreEtAl09). That said, it can be more accurate than bootstrap at small sample sizes (roughly speaking, less than 100 points from each of P and Q, but this depends on the distributions; see Section 8.2 of GreEtAl12). Requires the Matlab statistics toolbox.
• Gamma test (mmdTestGamma.m) uses a Gamma approximation to the null distribution, as described in Eq. 8 of GreEtAl09. Very fast to run, but it has no guarantees of consistency, and can fail in some circumstances (see Figure 1, Section 4 in GreEtAl09).

Note that an earlier version of this test was proposed in BorEtAl06, however the current test more accurately estimates the null distribution, and should be used in preference to the earlier algorithm.

## Old Code

An earlier version of the code may be downloaded here.

This code contains an additional test option: a large deviation bound to provide a test with non-asymptotic distribution-free guarantees of performance. In practice, the resulting test is generally too conservative, and does less well than either of the approaches above. It is included here to permit the reproduction of results in GreEtAl07.

The archive contains two files: mmd.m is the main code, and U4thmoment.c contains additional optimised c-code for one of the test options. While the algorithm runs in standalone form, it is also possible to use it with the Spider machine learning toolbox. Code is written by Malte Rasch.

## References

 [GreEtAl12] Gretton, A., K. Borgwardt, M. Rasch, B. Schoelkopf and A. Smola: A Kernel Two-Sample Test. JMLR 2012. download [GreEtAl09] Gretton, A., K. Borgwardt, M. Rasch, B. Schoelkopf and A. Smola: A Fast, Consistent Kernel Two-Sample Test. NIPS 2009. download [GreEtAl07] Gretton, A., K. Borgwardt, M. Rasch, B. Schoelkopf and A. Smola: A Kernel Method for the Two-Sample-Problem. NIPS 2006. download [BorEtAl06] Borgwardt, K., A. Gretton, M. Rasch, H.-P. Kriegel, B. Schoelkopf and A. Smola: Integrating structured biological data by Kernel Maximum Mean Discrepancy. Bioinformatics 22(14), 1-9 (2006) download

## Contact

arthur.gretton@gmail.com