Regression (regression)

Linear Regression

Linear regression is a statistical regression method which tries to predict a value of a continuous response (class) variable based on the values of several predictors. The model assumes that the response variable is a linear combination of the predictors, the task of linear regression is therefore to fit the unknown coefficients.

Example

>>> from Orange.regression.linear import LinearRegressionLearner
>>> mpg = Orange.data.Table('auto-mpg')
>>> mean_ = LinearRegressionLearner()
>>> model = mean_(mpg[40:110])
>>> print(model)
LinearModel LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
>>> mpg[20]
Value('mpg', 25.0)
>>> model(mpg[0])
Value('mpg', 24.6)
class Orange.regression.linear.LinearRegressionLearner(preprocessors=None, fit_intercept=True)[source]

A wrapper for sklearn.linear_model._base.LinearRegression. The following is its documentation:

Ordinary least squares Linear Regression.

LinearRegression fits a linear model with coefficients w = (w1, ..., wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

class Orange.regression.linear.RidgeRegressionLearner(alpha=1.0, fit_intercept=True, copy_X=True, max_iter=None, tol=0.001, solver='auto', preprocessors=None)[source]

A wrapper for sklearn.linear_model._ridge.Ridge. The following is its documentation:

Linear least squares with l2 regularization.

Minimizes the objective function:

||y - Xw||^2_2 + alpha * ||w||^2_2

This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm. Also known as Ridge Regression or Tikhonov regularization. This estimator has built-in support for multi-variate regression (i.e., when y is a 2d-array of shape (n_samples, n_targets)).

Read more in the User Guide.

class Orange.regression.linear.LassoRegressionLearner(alpha=1.0, fit_intercept=True, precompute=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False, preprocessors=None)[source]

A wrapper for sklearn.linear_model._coordinate_descent.Lasso. The following is its documentation:

Linear Model trained with L1 prior as regularizer (aka the Lasso).

The optimization objective for Lasso is:

(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

Technically the Lasso model is optimizing the same objective function as the Elastic Net with l1_ratio=1.0 (no L2 penalty).

Read more in the User Guide.

class Orange.regression.linear.SGDRegressionLearner(loss='squared_error', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=5, tol=0.001, shuffle=True, epsilon=0.1, n_jobs=1, random_state=None, learning_rate='invscaling', eta0=0.01, power_t=0.25, class_weight=None, warm_start=False, average=False, preprocessors=None)[source]

A wrapper for sklearn.linear_model._stochastic_gradient.SGDRegressor. The following is its documentation:

Linear model fitted by minimizing a regularized empirical loss with SGD.

SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).

The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for learning sparse models and achieve online feature selection.

This implementation works with data represented as dense numpy arrays of floating point values for the features.

Read more in the User Guide.

class Orange.regression.linear.LinearModel(skl_model)[source]

Polynomial

Polynomial model is a wrapper that constructs polynomial features of a specified degree and learns a model on them.

class Orange.regression.linear.PolynomialLearner(learner=LinearRegressionLearner(), degree=2, preprocessors=None, include_bias=True)[source]

Generate polynomial features and learn a prediction model

Parameters:
  • learner (LearnerRegression) -- learner to be fitted on the transformed features

  • degree (int) -- degree of used polynomial

  • preprocessors (List[Preprocessor]) -- preprocessors to be applied on the data before learning

Mean

Mean model predicts the same value (usually the distribution mean) for all data instances. Its accuracy can serve as a baseline for other regression models.

The model learner (MeanLearner) computes the mean of the given data or distribution. The model is stored as an instance of MeanModel.

Example

>>> from Orange.data import Table
>>> from Orange.regression import MeanLearner
>>> data = Table('auto-mpg')
>>> learner = MeanLearner()
>>> model = learner(data)
>>> print(model)
MeanModel(23.51457286432161)
>>> model(data[:4])
array([ 23.51457286,  23.51457286,  23.51457286,  23.51457286])
class Orange.regression.MeanLearner(preprocessors=None)[source]

Fit a regression model that returns the average response (class) value.

fit_storage(data)[source]

Construct a MeanModel by computing the mean value of the given data.

Parameters:

data (Orange.data.Table) -- data table

Returns:

regression model, which always returns mean value

Return type:

MeanModel

Random Forest

class Orange.regression.RandomForestRegressionLearner(n_estimators=10, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, preprocessors=None)[source]

A wrapper for sklearn.ensemble._forest.RandomForestRegressor. The following is its documentation:

A random forest regressor.

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

For a comparison between tree-based ensemble models see the example sphx_glr_auto_examples_ensemble_plot_forest_hist_grad_boosting_comparison.py.

Read more in the User Guide.

Simple Random Forest

class Orange.regression.SimpleRandomForestLearner(n_estimators=10, min_instances=2, max_depth=1024, max_majority=1.0, skip_prob='sqrt', seed=42)[source]

A random forest regressor, optimized for speed. Trees in the forest are constructed with SimpleTreeLearner classification trees.

Parameters:
  • n_estimators (int, optional (default = 10)) -- Number of trees in the forest.

  • min_instances (int, optional (default = 2)) -- Minimal number of data instances in leaves. When growing the three, new nodes are not introduced if they would result in leaves with fewer instances than min_instances. Instance count is weighed.

  • max_depth (int, optional (default = 1024)) -- Maximal depth of tree.

  • max_majority (float, optional (default = 1.0)) -- Maximal proportion of majority class. When this is exceeded, induction stops (only used for classification).

  • skip_prob (string, optional (default = "sqrt")) --

    Data attribute will be skipped with probability skip_prob.

    • if float, then skip attribute with this probability.

    • if "sqrt", then skip_prob = 1 - sqrt(n_features) / n_features

    • if "log2", then skip_prob = 1 - log2(n_features) / n_features

  • seed (int, optional (default = 42)) -- Random seed.

fit_storage(data)[source]

Default implementation of fit_storage defaults to calling fit. Derived classes must define fit_storage or fit

Regression Tree

Orange includes two implemenations of regression tres: a home-grown one, and one from scikit-learn. The former properly handles multinominal and missing values, and the latter is faster.

class Orange.regression.TreeLearner(*args, binarize=False, min_samples_leaf=1, min_samples_split=2, max_depth=None, **kwargs)[source]

Tree inducer with proper handling of nominal attributes and binarization.

The inducer can handle missing values of attributes and target. For discrete attributes with more than two possible values, each value can get a separate branch (binarize=False), or values can be grouped into two groups (binarize=True, default).

The tree growth can be limited by the required number of instances for internal nodes and for leafs, and by the maximal depth of the tree.

If the tree is not binary, it can contain zero-branches.

Parameters:
  • binarize -- if True the inducer will find optimal split into two subsets for values of discrete attributes. If False (default), each value gets its branch.

  • min_samples_leaf -- the minimal number of data instances in a leaf

  • min_samples_split -- the minimal number of data instances that is split into subgroups

  • max_depth -- the maximal depth of the tree

Return type:

instance of OrangeTreeModel

fit_storage(data)[source]

Default implementation of fit_storage defaults to calling fit. Derived classes must define fit_storage or fit

class Orange.regression.SklTreeRegressionLearner(criterion='squared_error', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features=None, random_state=None, max_leaf_nodes=None, preprocessors=None)[source]

A wrapper for sklearn.tree._classes.DecisionTreeRegressor. The following is its documentation:

A decision tree regressor.

Read more in the User Guide.

Neural Network

class Orange.regression.NNRegressionLearner(hidden_layer_sizes=(100,), activation='relu', solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, preprocessors=None)[source]

A wrapper for Orange.regression.neural_network.MLPRegressorWCallback. The following is its documentation:

Multi-layer Perceptron regressor.

This model optimizes the squared error using LBFGS or stochastic gradient descent.

New in version 0.18.

Gradient Boosted Trees

class Orange.regression.gb.GBRegressor(loss='squared_error', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, presort='deprecated', validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0, preprocessors=None)[source]

A wrapper for sklearn.ensemble._gb.GradientBoostingRegressor. The following is its documentation:

Gradient Boosting for regression.

This estimator builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.

sklearn.ensemble.HistGradientBoostingRegressor is a much faster variant of this algorithm for intermediate datasets (n_samples >= 10_000).

Read more in the User Guide.

class Orange.regression.catgb.CatGBRegressor(iterations=None, learning_rate=None, depth=None, l2_leaf_reg=None, model_size_reg=None, rsm=None, loss_function=None, border_count=None, feature_border_type=None, per_float_feature_quantization=None, input_borders=None, output_borders=None, fold_permutation_block=None, od_pval=None, od_wait=None, od_type=None, nan_mode=None, counter_calc_method=None, leaf_estimation_iterations=None, leaf_estimation_method=None, thread_count=None, random_seed=None, use_best_model=None, verbose=False, logging_level=None, metric_period=None, ctr_leaf_count_limit=None, store_all_simple_ctr=None, max_ctr_complexity=None, has_time=None, allow_const_label=None, classes_count=None, class_weights=None, one_hot_max_size=None, random_strength=None, name=None, ignored_features=None, train_dir='/home/docs/.cache/Orange/3.36.2', custom_loss=None, custom_metric=None, eval_metric=None, bagging_temperature=None, save_snapshot=None, snapshot_file=None, snapshot_interval=None, fold_len_multiplier=None, used_ram_limit=None, gpu_ram_part=None, allow_writing_files=False, final_ctr_computation_mode=None, approx_on_full_history=None, boosting_type=None, simple_ctr=None, combinations_ctr=None, per_feature_ctr=None, task_type=None, device_config=None, devices=None, bootstrap_type=None, subsample=None, sampling_unit=None, dev_score_calc_obj_block_size=None, max_depth=None, n_estimators=None, num_boost_round=None, num_trees=None, colsample_bylevel=None, random_state=None, reg_lambda=None, objective=None, eta=None, max_bin=None, scale_pos_weight=None, gpu_cat_features_storage=None, data_partition=None, metadata=None, early_stopping_rounds=None, cat_features=None, grow_policy=None, min_data_in_leaf=None, min_child_samples=None, max_leaves=None, num_leaves=None, score_function=None, leaf_estimation_backtracking=None, ctr_history_unit=None, monotone_constraints=None, feature_weights=None, penalties_coefficient=None, first_feature_use_penalties=None, model_shrink_rate=None, model_shrink_mode=None, langevin=None, diffusion_temperature=None, posterior_sampling=None, boost_from_average=None, text_features=None, tokenizers=None, dictionaries=None, feature_calcers=None, text_processing=None, preprocessors=None)[source]

A wrapper for catboost.core.CatBoostRegressor. The following is its documentation:

Implementation of the scikit-learn API for CatBoost regression.

class Orange.regression.xgb.XGBRegressor(max_depth=None, learning_rate=None, n_estimators=100, verbosity=None, objective='reg:squarederror', booster=None, tree_method=None, n_jobs=None, gamma=None, min_child_weight=None, max_delta_step=None, subsample=None, colsample_bytree=None, colsample_bylevel=None, colsample_bynode=None, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, base_score=None, random_state=None, missing=nan, num_parallel_tree=None, monotone_constraints=None, interaction_constraints=None, importance_type='gain', gpu_id=None, validate_parameters=None, preprocessors=None)[source]

A wrapper for xgboost.sklearn.XGBRegressor. The following is its documentation:

Implementation of the scikit-learn API for XGBoost regression. See /python/sklearn_estimator for more information.

class Orange.regression.xgb.XGBRFRegressor(max_depth=None, learning_rate=None, n_estimators=100, verbosity=None, objective='reg:squarederror', booster=None, tree_method=None, n_jobs=None, gamma=None, min_child_weight=None, max_delta_step=None, subsample=None, colsample_bytree=None, colsample_bylevel=None, colsample_bynode=None, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, base_score=None, random_state=None, missing=nan, num_parallel_tree=None, monotone_constraints=None, interaction_constraints=None, importance_type='gain', gpu_id=None, validate_parameters=None, preprocessors=None)[source]

A wrapper for xgboost.sklearn.XGBRFRegressor. The following is its documentation:

scikit-learn API for XGBoost random forest regression. See /python/sklearn_estimator for more information.

Curve Fit

class Orange.regression.curvefit.CurveFitLearner(expression: Callable | Expression | str, parameters_names: List[str] | None = None, features_names: List[str] | None = None, available_feature_names: List[str] | None = None, functions: List[str] | None = None, sanitizer: Callable | None = None, env: Dict[str, Any] | None = None, p0: List | Dict | None = None, bounds: Tuple | Dict = (-inf, inf), preprocessors=None)[source]

Fit a function to data. It uses the scipy.curve_fit to find the optimal values of parameters.

Parameters:
  • expression (callable or str) -- A modeling function. If callable, it must take the independent variable as the first argument and the parameters to fit as separate remaining arguments. If string, a lambda function is created, using expression, available_feature_names, function and env attributes. Should be string for pickling the model.

  • parameters_names (list of str) -- List of parameters names. Only needed when the expression is callable.

  • features_names (list of str) -- List of features names. Only needed when the expression is callable.

  • available_feature_names (list of str) -- List of all available features names. Only needed when the expression is string. Needed to distinguish between parameters and features when translating the expression into the lambda.

  • functions (list of str) -- List of all available functions. Only needed when the expression is string. Needed to distinguish between parameters and functions when translating the expression into the lambda.

  • sanitizer (callable) -- Function for sanitizing names.

  • env (dict) -- An environment to capture in the lambda's closure.

  • p0 (list of floats, optional) -- Initial guess for the parameters.

  • bounds (2-tuple of array_like, optional) -- Lower and upper bounds on parameters.

  • preprocessors (tuple of Orange preprocessors, optional) -- The processors that will be used when data is passed to the learner.

Examples

>>> import numpy as np
>>> from Orange.data import Table
>>> from Orange.regression import CurveFitLearner
>>> data = Table("housing")
>>> # example with callable expression
>>> cfun = lambda x, a, b, c: a * np.exp(-b * x[:, 0] * x[:, 1]) + c
>>> learner = CurveFitLearner(cfun, ["a", "b", "c"], ["CRIM", "LSTAT"])
>>> model = learner(data)
>>> pred = model(data)
>>> coef = model.coefficients
>>> # example with str expression
>>> sfun = "a * exp(-b * CRIM * LSTAT) + c"
>>> names = [a.name for a in data.domain.attributes]
>>> learner = CurveFitLearner(sfun, available_feature_names=names,
...                           functions=["exp"])
>>> model = learner(data)
>>> pred = model(data)
>>> coef = model.coefficients
preprocessors = [HasClass(), RemoveNaNColumns(), Impute()]

A sequence of data preprocessors to apply on data prior to fitting the model

fit_storage(data: Table) CurveFitModel[source]

Default implementation of fit_storage defaults to calling fit. Derived classes must define fit_storage or fit