Sampling procedures for testing models (testing)

class Orange.evaluation.testing.Results(data=None, *, nmethods=None, nrows=None, nclasses=None, domain=None, row_indices=None, folds=None, score_by_folds=True, learners=None, models=None, failed=None, actual=None, predicted=None, probabilities=None, store_data=None, store_models=None)[source]

Class for storing predictions in model testing.

data

Data used for testing.

Type

Optional[Table]

models

A list of induced models.

Type

Optional[List[Model]]

row_indices

Indices of rows in data that were used in testing, stored as a numpy vector of length nrows. Values of actual[i], predicted[i] and probabilities[i] refer to the target value of instance, that is, the i-th test instance is data[row_indices[i]], its actual class is actual[i], and the prediction by m-th method is predicted[m, i].

Type

np.ndarray

nrows

The number of test instances (including duplicates); nrows equals the length of row_indices and actual, and the second dimension of predicted and probabilities.

Type

int

actual

true values of target variable in a vector of length nrows.

Type

np.ndarray

predicted

predicted values of target variable in an array of shape (number-of-methods, nrows)

Type

np.ndarray

probabilities

predicted probabilities (for discrete target variables) in an array of shape (number-of-methods, nrows, number-of-classes)

Type

Optional[np.ndarray]

folds

a list of indices (or slice objects) corresponding to testing data subsets, that is, row_indices[folds[i]] contains row indices used in fold i, so data[row_indices[folds[i]]] is the corresponding testing data

Type

List[Slice or List[int]]

get_augmented_data(model_names, include_attrs=True, include_predictions=True, include_probabilities=True)[source]

Return the test data table augmented with meta attributes containing predictions, probabilities (if the task is classification) and fold indices.

Parameters
  • model_names (list of str) – names of models

  • include_attrs (bool) – if set to False, original attributes are removed

  • include_predictions (bool) – if set to False, predictions are not added

  • include_probabilities (bool) – if set to False, probabilities are not added

Returns

data augmented with predictions, probabilities and fold indices

Return type

augmented_data (Orange.data.Table)

split_by_model()[source]

Split evaluation results by models.

The method generates instances of Results containing data for single models

class Orange.evaluation.testing.CrossValidation(k=10, stratified=True, random_state=0, store_data=False, store_models=False, warnings=None)[source]

K-fold cross validation

k

number of folds (default: 10)

Type

int

random_state

seed for random number generator (default: 0). If set to None, a different seed is used each time

Type

int

stratified

flag deciding whether to perform stratified cross-validation. If True but the class sizes don’t allow it, it uses non-stratified validataion and adds a list warning with a warning message(s) to the Result.

Type

bool

get_indices(data)[source]

Return a list of arrays of indices of test data instance

For example, in k-fold CV, the result is a list with k elements, each containing approximately len(data) / k nonoverlapping indices into data.

This method is abstract and must be implemented in derived classes unless they provide their own implementation of the __call__ method.

Parameters

data (Orange.data.Table) – test data

Returns

a list of arrays of indices into data

Return type

indices (list of np.ndarray)

class Orange.evaluation.testing.CrossValidationFeature(feature=None, store_data=False, store_models=False, warnings=None)[source]

Cross validation with folds according to values of a feature.

feature

the feature defining the folds

Type

Orange.data.Variable

get_indices(data)[source]

Return a list of arrays of indices of test data instance

For example, in k-fold CV, the result is a list with k elements, each containing approximately len(data) / k nonoverlapping indices into data.

This method is abstract and must be implemented in derived classes unless they provide their own implementation of the __call__ method.

Parameters

data (Orange.data.Table) – test data

Returns

a list of arrays of indices into data

Return type

indices (list of np.ndarray)

class Orange.evaluation.testing.LeaveOneOut(*, store_data=False, store_models=False)[source]

Leave-one-out testing

get_indices(data)[source]

Return a list of arrays of indices of test data instance

For example, in k-fold CV, the result is a list with k elements, each containing approximately len(data) / k nonoverlapping indices into data.

This method is abstract and must be implemented in derived classes unless they provide their own implementation of the __call__ method.

Parameters

data (Orange.data.Table) – test data

Returns

a list of arrays of indices into data

Return type

indices (list of np.ndarray)

static prepare_arrays(data, indices)[source]

Prepare folds, row_indices and actual.

The method is used by __call__. While functional, it may be overriden in subclasses for speed-ups.

Parameters
  • data (Orange.data.Table) – data use for testing

  • indices (list of vectors) – indices of data instances in each test sample

Returns

(np.ndarray): see class documentation row_indices: (np.ndarray): see class documentation actual: (np.ndarray): see class documentation

Return type

folds

class Orange.evaluation.testing.ShuffleSplit(n_resamples=10, train_size=None, test_size=0.1, stratified=True, random_state=0, store_data=False, store_models=False)[source]

Test by repeated random sampling

n_resamples

number of repetitions

Type

int

test_size

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. By default, the value is set to 0.1. The default will change in version 0.21. It will remain 0.1 only if train_size is unspecified, otherwise it will complement the specified train_size. (from documentation of scipy.sklearn.StratifiedShuffleSplit)

Type

float, int, None

train_size

float, int, or None, default is None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. (from documentation of scipy.sklearn.StratifiedShuffleSplit)

stratified

flag deciding whether to perform stratified cross-validation.

Type

bool

random_state

seed for random number generator (default: 0). If set to None, a different seed is used each time

Type

int

get_indices(data)[source]

Return a list of arrays of indices of test data instance

For example, in k-fold CV, the result is a list with k elements, each containing approximately len(data) / k nonoverlapping indices into data.

This method is abstract and must be implemented in derived classes unless they provide their own implementation of the __call__ method.

Parameters

data (Orange.data.Table) – test data

Returns

a list of arrays of indices into data

Return type

indices (list of np.ndarray)

class Orange.evaluation.testing.TestOnTestData(*, store_data=False, store_models=False)[source]

Test on separately provided test data

Note that the class has a different signature for __call__.

class Orange.evaluation.testing.TestOnTrainingData(*, store_data=False, store_models=False)[source]

Test on training data

Orange.evaluation.testing.sample(table, n=0.7, stratified=False, replace=False, random_state=None)[source]

Samples data instances from a data table. Returns the sample and a dataset from input data table that are not in the sample. Also uses several sampling functions from scikit-learn.

tabledata table

A data table from which to sample.

nfloat, int (default = 0.7)

If float, should be between 0.0 and 1.0 and represents the proportion of data instances in the resulting sample. If int, n is the number of data instances in the resulting sample.

stratifiedbool, optional (default = False)

If true, sampling will try to consider class values and match distribution of class values in train and test subsets.

replacebool, optional (default = False)

sample with replacement

random_stateint or RandomState

Pseudo-random number generator state used for random sampling.