This is documentation for Orange 2.7. For the latest documentation, see Orange 3.
Selection (selection)¶
Feature selection module contains several utility functions for selecting features based on they scores normally obtained in classification or regression problems. A typical example is the function select that returns a subsets of highest-scored features features:
import Orange
voting = Orange.data.Table("voting")
n = 3
ma = Orange.feature.scoring.score_all(voting)
best = Orange.feature.selection.top_rated(ma, n)
print 'Best %d features:' % n
for s in best:
print s
The script outputs:
Best 3 features:
physician-fee-freeze
el-salvador-aid
synfuels-corporation-cutback
The module also includes a learner that incorporates feature subset selection.
New in version 2.7.1: select, select_above_threshold and select_relief now preserve the domain’s meta attributes and class_vars.
Functions for feature subset selection¶
- static selection.top_rated(scores, n, highest_best=True)¶
Return n top-rated features from the list of scores.
Parameters: - scores (list) – A list such as the one returned by score_all()
- n (int) – Number of features to select.
- highest_best (bool) – If true, the features that are scored higher are preferred.
Return type:
- static selection.above_threshold(scores, threshold=0.0)¶
Return features (without scores) with scores above or equal to a specified threshold.
Parameters: - scores (list) – A list such as one returned by score_all()
- threshold (float) – Threshold for selection.
Return type:
- static selection.select(data, scores, n)¶
Construct and return a new data table that includes a class and only the best features from a list scores.
Parameters: - data (Orange.data.Table) – a data table
- scores (list) – a list such as the one returned by score_all
- n (int) – number of features to select
Return type:
- static selection.select_above_threshold(data, scores, threshold=0.0)¶
Construct and return a new data table that includes a class and features from the list returned by score_all with higher or equal score to a given threshold.
Parameters: - data (Orange.data.Table) – a data table
- scores (list) – a list such as the one returned by score_all
- threshold (float) – threshold for selection
Return type:
- static selection.select_relief(data, measure=Orange.feature.scoring.Relief(k=20, m=10), margin=0)¶
Iteratively remove the worst scored feature until no feature has a score below the margin. The filter procedure was originally designed for measures such as Relief, which are context dependent, i.e., removal of features may change the scores of other remaining features. The score is thus recomputed in each iteration.
Parameters: - data (Orange.data.Table) – a data table
- measure (Orange.feature.scoring.Score) – a feature scorer
- margin (float) – margin for removal
Learning with feature subset selection¶
- class Orange.feature.selection.FilteredLearner(base_learner, filter=FilterAboveThreshold(), name=filtered)¶
- A feature selection wrapper around base learner. When provided data,
- this learner applies a given feature selection method and then calls the base learner.
Here is an example of how to build a wrapper around naive Bayesian learner and use it on a data set:
nb = Orange.classification.bayes.NaiveBayesLearner() learner = Orange.feature.selection.FilteredLearner(nb, filter=Orange.feature.selection.FilterBestN(n=5), name='filtered') classifier = learner(data)
- class Orange.feature.selection.FilteredClassifier(**kwds)¶
A classifier returned by FilteredLearner.
Class wrappers for selection functions¶
- class Orange.feature.selection.FilterAboveThreshold(data=None, measure=Orange.feature.scoring.Relief(k=20, m=50), threshold=0.0)¶
A wrapper around select_above_threshold; the constructor stores the parameters of the feature selection procedure that are then applied when the the selection is called with the actual data.
Parameters: - measure (Orange.feature.scoring.Score) – a feature scorer
- threshold (float) – threshold for selection. Defaults to 0.
- __call__(data)¶
Return data table features that have scores above given threshold.
Parameters: data (Orange.data.Table) – data table
Below are few examples of utility of this class:
>>> filter = Orange.feature.selection.FilterAboveThreshold(threshold=.15)
>>> new_data = filter(data)
>>> new_data = Orange.feature.selection.FilterAboveThreshold(data)
>>> new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1)
>>> new_data = Orange.feature.selection.FilterAboveThreshold(data, threshold=.1, \
measure=Orange.feature.scoring.Gini())
- class Orange.feature.selection.FilterBestN(data=None, measure=Orange.feature.scoring.Relief(k=20, m=50), n=5)¶
A wrapper around select; the constructor stores the filter parameters that are applied when the function is called.
Parameters: - measure (Orange.feature.scoring.Score) – a feature scorer
- n (int) – number of features to select
- class Orange.feature.selection.FilterRelief(data=None, measure=Orange.feature.scoring.Relief(k=20, m=50), margin=0)¶
A class wrapper around select_best_n; the constructor stores the filter parameters that are applied when the function is called.
Parameters: - measure (Orange.feature.scoring.Score) – a feature scorer
- margin (float) – margin for Relief scoring
Examples
The following script defines a new Naive Bayes classifier, that selects five best features from the data set before learning. The new classifier is wrapped-up in a special class (see Learners in Python lesson in Orange Tutorial). Th script compares this filtered learner with one that uses a complete set of features.
import Orange
class BayesFSS(object):
def __new__(cls, examples=None, **kwds):
learner = object.__new__(cls)
if examples:
return learner(examples)
else:
return learner
def __init__(self, name='Naive Bayes with FSS', N=5):
self.name = name
self.N = 5
def __call__(self, table, weight=None):
ma = Orange.feature.scoring.score_all(table)
filtered = Orange.feature.selection.selectBestNAtts(table, ma, self.N)
model = Orange.classification.bayes.NaiveLearner(filtered)
return BayesFSS_Classifier(classifier=model, N=self.N, name=self.name)
class BayesFSS_Classifier:
def __init__(self, **kwds):
self.__dict__.update(kwds)
def __call__(self, example, resultType = Orange.classification.Classifier.GetValue):
return self.classifier(example, resultType)
# test above wraper on a data set
voting = Orange.data.Table("voting")
learners = (Orange.classification.bayes.NaiveLearner(name='Naive Bayes'),
BayesFSS(name="with FSS"))
results = Orange.evaluation.testing.cross_validation(learners, voting)
# output the results
print "Learner CA"
for i in range(len(learners)):
print "%-12s %5.3f" % (learners[i].name, Orange.evaluation.scoring.CA(results)[i])
Interestingly, and somehow expected, feature subset selection helps. This is the output that we get:
Learner CA
Naive Bayes 0.903
with FSS 0.940
We can do all of he above by wrapping the learner using FilteredLearner, thus creating an object that is assembled from data filter and a base learner. When given a data table, this learner uses attribute filter to construct a new data set and base learner to construct a corresponding classifier. Attribute filters should be of the type like FilterAboveThreshold or FilterBestN that can be initialized with the arguments and later presented with a data, returning new reduced data set.
The following code fragment replaces the bulk of code from previous example, and compares naive Bayesian classifier to the same classifier when only a single most important attribute is used.
nb = Orange.classification.bayes.NaiveLearner()
fl = Orange.feature.selection.FilteredLearner(nb,
filter=Orange.feature.selection.FilterBestNAtts(n=1), name='filtered')
learners = (Orange.classification.bayes.NaiveLearner(name='bayes'), fl)
Now, let’s decide to retain three features and observe how many times an attribute was used. Remember, 10-fold cross validation constructs ten instances for each classifier, and each time we run FilteredLearner a different set of features may be selected. Orange.evaluation.testing.cross_validation stores classifiers in results variable, and FilteredLearner returns a classifier that can tell which features it used, so the code to do all this is quite short.
print "\nNumber of times attributes were used in cross-validation:"
attsUsed = {}
for i in range(10):
for a in results.classifiers[i][1].atts():
if a.name in attsUsed.keys():
attsUsed[a.name] += 1
else:
attsUsed[a.name] = 1
for k in attsUsed.keys():
print "%2d x %s" % (attsUsed[k], k)
Running selection-filtered-learner.py with three features selected each time a learner is run gives the following result:
Learner CA
bayes 0.903
filtered 0.956
Number of times features were used in cross-validation:
3 x el-salvador-aid
6 x synfuels-corporation-cutback
7 x adoption-of-the-budget-resolution
10 x physician-fee-freeze
4 x crime
References¶
- K. Kira and L. Rendell. A practical approach to feature selection. In D. Sleeman and P. Edwards, editors, Proc. 9th Int’l Conf. on Machine Learning, pages 249{256, Aberdeen, 1992. Morgan Kaufmann Publishers.
- I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In F. Bergadano and L. De Raedt, editors, Proc. European Conf. on Machine Learning (ECML-94), pages 171-182. Springer-Verlag, 1994.
- R. Kohavi, G. John: Wrappers for Feature Subset Selection, Artificial Intelligence, 97 (1-2), pages 273-324, 1997