Datasets

Access to example datasets and distributions.

realkd.datasets.noisy_parity(n, d=3, variance=0.25, as_df=True, random_seed=None)

Generates observations of mixture model of Gaussian clusters centered at nodes of hypercube \(\{-1, 1\}^d\) labelled according to parity of cube node.

That is,

\begin{align*} C &\sim \mathrm{Unif}(\{0, 1\}^d)\\ X | C &\sim \mathrm{Norm}(C, \sigma^2 I_d)\\ Y | C &= \prod_{i=1}^d C_i \end{align*}

For example:

>>> x, y = noisy_parity(10, random_seed=0)
>>> x
         x1        x2        x3
0  0.633866  0.727871  0.841850
1 -0.794185 -0.478743 -1.064267
2 -0.316768 -1.332597 -0.824245
3  1.451735  1.047006  0.628250
4  0.539137  0.771137  1.110098
5  0.495191  0.895412  0.920387
6  1.270423  1.107330 -0.822314
7  0.673086  0.935193 -0.608012
8 -0.253284  0.370467  1.756962
9 -0.327062  1.390656  1.132228
>>> y
0    1
1   -1
2   -1
3    1
4    1
5    1
6   -1
7   -1
8   -1
9   -1
dtype: int64
Parameters
  • n – number of observations

  • d – dimension of data

  • variance – variance of the clusters

  • as_df – whether to wrap return value in pandas dataframe/series

  • random_seed – seed passed to np.random.default_rng

Returns

dataframe/matrix x and corresponding label series/arrays