Regression
- Handful of Regressors
- Cross Validation

Regression¶

From the interface point of view, regression methods in Orange are very similar to classification. Both intended for supervised data mining, they require class-labeled data. Just like in classification, regression is implemented with learners and regression models (regressors). Regression learners are objects that accept data and return regressors. Regression models are given data items to predict the value of continuous class:

import Orange

data = Orange.data.Table("housing")
learner = Orange.regression.linear.LinearRegressionLearner()
model = learner(data)

print "pred obs"
for d in data[:3]:
    print "%.1f %.1f" % (model(d), d.get_class())

Handful of Regressors¶

Let us start with regression trees. Below is an example script that builds the tree from data on housing prices and prints out the tree in textual form:

data = Orange.data.Table("housing.tab")
tree = Orange.regression.tree.TreeLearner(data, m_pruning=2., min_instances=20)
print tree.to_string()

The script outputs the tree:

RM<=6.941: 19.9
RM>6.941
|    RM<=7.437
|    |    CRIM>7.393: 14.4
|    |    CRIM<=7.393
|    |    |    DIS<=1.886: 45.7
|    |    |    DIS>1.886: 32.7
|    RM>7.437
|    |    TAX<=534.500: 45.9
|    |    TAX>534.500: 21.9

Following is initialization of few other regressors and their prediction of the first five data instances in housing price data set:

data = Orange.data.Table("housing")
test = Orange.data.Table(random.sample(data, 5))
train = Orange.data.Table([d for d in data if d not in test])

lin = Orange.regression.linear.LinearRegressionLearner(train)
lin.name = "lin"
rf = Orange.ensemble.forest.RandomForestLearner(train)
rf.name = "rf"
tree = Orange.regression.tree.TreeLearner(train)
tree.name = "tree"

models = [lin, rf, tree]

print "y    " + " ".join("%-4s" % l.name for l in models)
for d in test[:3]:
    print "%.1f" % (d.get_class()),
    print " ".join("%4.1f" % model(d) for model in models)

Looks like the housing prices are not that hard to predict:

y    lin  rf   tree
12.7 11.3 15.3 19.1
13.8 20.2 14.1 13.1
19.3 20.8 20.7 23.3

Cross Validation¶

Just like for classification, the same evaluation module (Orange.evaluation) is available for regression. Its testing submodule includes procedures such as cross-validation, leave-one-out testing and similar, and functions in scoring submodule can assess the accuracy from the testing:

data = Orange.data.Table("housing.tab")

lin = Orange.regression.linear.LinearRegressionLearner()
lin.name = "lin"
rf = Orange.ensemble.forest.RandomForestLearner()
rf.name = "rf"
tree = Orange.regression.tree.TreeLearner(m_pruning = 2)
tree.name = "tree"

learners = [lin, rf, tree]

res = Orange.evaluation.testing.cross_validation(learners, data, folds=5)
rmse = Orange.evaluation.scoring.RMSE(res)

print "Learner  RMSE"
for i in range(len(learners)):
    print "{0:8}".format(learners[i].name),
    print "%.2f" % rmse[i]

Random forest has the lowest root mean squared error:

Learner  RMSE
lin      4.83
rf       3.73
tree     5.10