This is documentation for Orange 2.7. For the latest documentation, see Orange 3.

Imputation (imputation)

Imputation replaces missing feature values with appropriate values, for instancewith minimal values of features:

import Orange
bridges = Orange.data.Table("bridges")

imputer = Orange.feature.imputation.MinimalConstructor()
imputer = imputer(bridges)

print "Example with missing values"
print bridges[10]
print "Imputed values:"
print imputer(bridges[10])

imputed_bridges = imputer(bridges)
print imputed_bridges[10]

The output of this code is:

Example with missing values
['A', 1853, 'RR', ?, 2, 'N', 'DECK', 'WOOD', '?', 'S', 'WOOD']
Imputed values:
['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']
['A', 1853, 'RR', 804, 2, 'N', 'DECK', 'WOOD', 'SHORT', 'S', 'WOOD']

Imputers

Orange.feature.imputation.Constructor is the abstract root of a hierarchy of classes that accept training data and construct an instance of a class derived from Orange.feature.imputation.Imputer. When an Imputer is called with an Instance it returns a new instance with the missing values imputed (leaving the original instance intact). If imputer is called with a Table it returns a new data table with imputed values.

class Orange.feature.imputation.Constructor
impute_class

Indicates whether to impute the class value. Defaults to True.

Simple imputation

Simple imputation always imputes the same value for a particular feature, disregarding the values of other features.

class Orange.feature.imputation.Defaults
defaults

A data instance Instance with the default values that are imputed instead of the missing values. Features whose values are left unspecified are not imputed. The instances to which the data is imputed be from the same Domain as defaults.

__init__(domain)

Construct a new instance of Defaults and set defaults to a data instance with from the given domain all values undefined.

The following example constructs an imputer that sets the unknown bridge lengths to 1234 and leaves are all other values as they are:

imputer = Orange.feature.imputation.Defaults(bridges.domain)
imputer.defaults["LENGTH"] = 1234
__init__(values)

Construct a new instance of the class and set the defaults to the given values. The constructor does not copy the data instance, so if the instance is not constructed specifically for the imputer, the caller should make a copy (e.g. by calling Orange.feature.imputation.Defaults(Orange.data.Instance(inst)) and not Orange.feature.imputation.Defaults(inst).

Instances of Orange.feature.imputation.Defaults are returned by MinimalConstructor, MaximalConstructor, AverageConstructor.

For continuous features, they will impute the smallest, largest or the average values encountered in the training instances. For discrete, they will impute the lowest (the one with index 0, e. g. attr.values[0]), the highest (attr.values[-1]), and the most common value encountered in the data, respectively. If values of discrete features are ordered according to their impact on class (for example, possible values for symptoms of some disease can be ordered according to their seriousness), the minimal and maximal imputers will then represent optimistic and pessimistic imputations.

Random imputation

class Orange.feature.imputation.Random

Imputes random values. The corresponding constructor is RandomConstructor.

impute_class

Tells whether to impute the class values or not. Defaults to True.

deterministic

If true (defaults to False), random generator is initialized for each instance using the instance’s hash value as a seed. This results in same instances being always imputed with the same (random) values.

Model-based imputation

class Orange.feature.imputation.ModelConstructor

Model-based imputers learn to predict the features’s value from values of other features. ModelConstructor is given two learning algorithms and constructs a classifier for each attribute. The constructed imputer Model stores a list of classifiers that are used for imputation.

learner_discrete, learner_continuous

Learner for discrete and for continuous attributes. If any of them is missing, the attributes of the corresponding type will not get imputed.

use_class

Tells whether the imputer can use the class attribute. Defaults to False. It is useful in more complex designs in which one imputer is used on learning instances, where it uses the class value, and a second imputer on testing instances, where class is not available.

class Orange.feature.imputation.Model
models

A list of classifiers, each corresponding to one attribute to be imputed. The class_var‘s of the models should equal the instances’ attributes. If an element is None, the corresponding attribute’s values are not imputed.

Examples

Examples are taken from imputation-complex.py. The following imputer predicts the missing attribute values using classification and regression trees with the minimum of 20 examples in a leaf.

imputer = Orange.feature.imputation.ModelConstructor()
imputer.learner_continuous = imputer.learner_discrete = Orange.classification.tree.TreeLearner(min_subset=20)
imputer = imputer(bridges)

A common setup, where different learning algorithms are used for discrete and continuous features, is to use NaiveLearner for discrete and MeanLearner (which just remembers the average) for continuous attributes:

imputer = Orange.feature.imputation.ModelConstructor()
imputer.learner_continuous = Orange.regression.mean.MeanLearner()
imputer.learner_discrete = Orange.classification.bayes.NaiveLearner()
imputer = imputer(bridges)

To construct a user-defined Model:

imputer = Orange.feature.imputation.Model()
imputer.models = [None] * len(bridges.domain)
imputer.models[bridges.domain.index("LANES")] = Orange.classification.ConstantClassifier(2.0)
tord = Orange.classification.ConstantClassifier(Orange.data.Value(bridges.domain["T-OR-D"], "THROUGH"))
imputer.models[bridges.domain.index("T-OR-D")] = tord

A list of empty models is first initialized models. Continuous feature “LANES” is imputed with value 2 using ConstantClassifier. A float must be given, because integer values are interpreted as indexes of discrete features. Discrete feature “T-OR-D” is imputed using ConstantClassifier which is given the index of value “THROUGH” as an argument.

Feature “LENGTH” is computed with a regression tree induced from “MATERIAL”, “SPAN” and “ERECTED” (feature “LENGTH” is used as class attribute here). Domain is initialized by giving a list of feature names and domain as an additional argument where Orange will look for features.

len_domain = Orange.data.Domain(["MATERIAL", "SPAN", "ERECTED", "LENGTH"], bridges.domain)
len_data = Orange.data.Table(len_domain, bridges)
len_tree = Orange.classification.tree.TreeLearner(len_data, min_subset=20)
imputer.models[bridges.domain.index("LENGTH")] = len_tree
print len_tree

This is how the inferred tree should look like:

<XMP class=code>SPAN=SHORT: 1158
SPAN=LONG: 1907
SPAN=MEDIUM
|    ERECTED<1908.500: 1325
|    ERECTED>=1908.500: 1528
</XMP>

Wooden bridges and walkways are short, while the others are mostly medium. This could be encoded in feature “SPAN” using ClassifierByLookupTable, which is faster than the Python function used here:

span_var = bridges.domain["SPAN"]
def compute_span(ex, rw):
    if ex["TYPE"] == "WOOD" or ex["PURPOSE"] == "WALK":
        return Orange.data.Value(span_var, "SHORT")
    else:
        return Orange.data.Value(span_var, "MEDIUM")

imputer.models[bridges.domain.index("SPAN")] = compute_span

If compute_span is written as a class it must behave like a classifier: it accepts an instance and returns a value. The second argument tells what the caller expects the classifier to return - a value, a distribution or both. Currently, Model, always expects values and the argument can be ignored.

Missing values as special values

Missing values sometimes have a special meaning. Cautious is needed when using such values in decision models. When the decision not to measure something (for example, performing a laboratory test on a patient) is based on the expert’s knowledge of the class value, such missing values clearly should not be used in models.

class Orange.feature.imputation.AsValueConstructor

It constructs AsValue that converts the instance into the new domain.

Constructs a new domain in which each discrete feature is replaced with a new feature that have one more value: “NA”. The new feature computes its values on the fly, copying the normal values from the old one and replacing the unknowns with “NA”.

For continuous attributes, it constructs a two-valued discrete attribute with values “def” and “undef”, telling whether the value is defined or not. The features’s name will equal the original’s with “_def” appended. The original continuous feature will remain in the domain and its unknowns will be replaced by averages.

class Orange.feature.imputation.AsValue
domain

The domain with the new feature constructed by AsValueConstructor.

defaults

Default values for continuous features.

The following code shows what the imputer actually does to the domain:

imputer = Orange.feature.imputation.AsValueConstructor(bridges)
original = bridges[19]
imputed = imputer(bridges[19])
print original.domain
print
print imputed.domain
print

for i in original.domain:
    print "%s: %s -> %s" % (original.domain[i].name, original[i], imputed[i.name]),
    if original.domain[i].var_type == Orange.feature.Type.Continuous:
        print "(%s)" % imputed[i.name + "_def"]
    else:
        print
print

The script’s output looks like this:

[RIVER, ERECTED, PURPOSE, LENGTH, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]

[RIVER, ERECTED_def, ERECTED, PURPOSE, LENGTH_def, LENGTH, LANES_def, LANES, CLEAR-G, T-OR-D, MATERIAL, SPAN, REL-L, TYPE]

RIVER: M -> M
ERECTED: 1874 -> 1874 (def)
PURPOSE: RR -> RR
LENGTH: ? -> 1567 (undef)
LANES: 2 -> 2 (def)
CLEAR-G: ? -> NA
T-OR-D: THROUGH -> THROUGH
MATERIAL: IRON -> IRON
SPAN: ? -> NA
REL-L: ? -> NA
TYPE: SIMPLE-T -> SIMPLE-T

The two intances have the same attribute, imputed having a few additional ones. Comparing original.domain[0] == imputed.domain[0] will result in False. While the names are same, they represent different features. Writting, imputed[i] would fail since imputed has no attribute i, but it has an attribute with the same name. Using i.name to index the attributes of imputed will work, yet it is not fast. If a frequently used, it is better to compute the index with imputed.domain.index(i.name).

For continuous features, there is an additional feature with name prefix “_def”, which is accessible by i.name+"_def". The value of the first continuous feature “ERECTED” remains 1874, and the additional attribute “ERECTED_def” has value “def”. The undefined value in “LENGTH” is replaced by the average (1567) and the new attribute has value “undef”. The undefined discrete attribute “CLEAR-G” (and all other undefined discrete attributes) is assigned the value “NA”.

Using imputers

Imputation is also used by learning algorithms and other methods that are not capable of handling unknown values.

Imputer as a component

Learners that cannot handle missing values should provide a slot for imputer constructor. An example of such class is LogRegLearner with attribute imputer_constructor, which imputes to average value by default. When given learning instances, LogRegLearner will pass them to imputer_constructor to get an imputer and use it to impute the missing values in the learning data. Imputed data is then used by the actual learning algorithm. When a classifier LogRegClassifier is constructed, the imputer is stored in its attribute imputer. During classification the same imputer is used for imputation of missing values in (testing) instances.

Details may vary from algorithm to algorithm, but this is how the imputation is generally used. When writing user-defined learners, it is recommended to use imputation according to the described procedure.

The choice of the imputer depends on the problem domain. In this example the minimal value of each feature is imputed:

import Orange

lr = Orange.classification.logreg.LogRegLearner()
imputer = Orange.feature.imputation.MinimalConstructor

imlr = Orange.feature.imputation.ImputeLearner(base_learner=lr,
    imputer_constructor=imputer)

voting = Orange.data.Table("voting")
res = Orange.evaluation.testing.cross_validation([lr, imlr], voting)
CAs = Orange.evaluation.scoring.CA(res)

print "Without imputation: %5.3f" % CAs[0]
print "With imputation: %5.3f" % CAs[1]

The output of this code is:

Without imputation: 0.945
With imputation: 0.954

Note

Just one instance of LogRegLearner is constructed and then used twice in each fold. Once it is given the original instances as they are. It returns an instance of LogRegLearner. The second time it is called by imra and the LogRegLearner gets wrapped into Classifier. There is only one learner, which produces two different classifiers in each round of testing.

Wrappers for learning

In a learning/classification process, imputation is needed on two occasions. Before learning, the imputer needs to process the training instances. Afterwards, the imputer is called for each instance to be classified. For example, in cross validation, imputation should be done on training folds only. Imputing the missing values on all data and subsequently performing cross-validation will give overly optimistic results.

Most of Orange’s learning algorithms do not use imputers because they can appropriately handle the missing values. Bayesian classifier, for instance, simply skips the corresponding attributes in the formula, while classification/regression trees have components for handling the missing values in various ways. A wrapper is provided for learning algorithms that require imputed data.

class Orange.feature.imputation.ImputeLearner

Wraps a learner and performs data imputation before learning.

This learner returns either an instance of ImputeLearner or, if called with Table, an instance of a classifier.

base_learner

A wrapped learner.

imputer_constructor

An instance of a class derived from Constructor or a class with the same call operator.

dont_impute_classifier

If set and a table is given, the classifier is not be wrapped into an imputer. This can be done if classifier can handle missing values.

The learner is best illustrated by its code - here’s its complete __call__ method:

def __call__(self, data, weight=0):
    trained_imputer = self.imputer_constructor(data, weight)
    imputed_data = trained_imputer(data, weight)
    base_classifier = self.base_learner(imputed_data, weight)
    if self.dont_impute_classifier:
        return base_classifier
    else:
        return ImputeClassifier(base_classifier, trained_imputer)

During learning, ImputeLearner will first construct the imputer. It will then impute the data and call the given base_learner to construct a classifier. For instance, base_learner could be a learner for logistic regression and the result would be a logistic regression model. If the classifier can handle unknown values (that is, if dont_impute_classifier, it is returned as is, otherwise it is wrapped into ImputeClassifier, which holds the base classifier and the imputer used to impute the missing values in (testing) data.

class Orange.feature.imputation.ImputeClassifier

Objects of this class are returned by ImputeLearner when given data.

base_classifier

A wrapped classifier.

imputer

An imputer for imputation of unknown values.

__call__()

This class’s constructor accepts and stores two arguments, the classifier and the imputer. The call operator for classification looks like this:

def __call__(self, ex, what=orange.GetValue):
    return self.base_classifier(self.imputer(ex), what)

It imputes the missing values by calling the imputer and passes the class to the base classifier.

Note

In this setup the imputer is trained on the training data. Even during cross validation, the imputer will be trained on the right data. In the classification phase, the imputer will be used to impute testing data.

Code of ImputeLearner and ImputeClassifier

The learner is called with Orange.feature.imputation.ImputeLearner(base_learner=<someLearner>, imputer=<someImputerConstructor>). When given data table, it trains the imputer, imputes the data, induces a base_classifier by the base_learner and constructs ImputeClassifier that stores the base_classifier and the imputer. For classification, the missing values are imputed and the classifier’s prediction is returned.

This is a slightly simplified code, where details on how to handle non-essential technical issues that are unrelated to imputation:

class ImputeLearner(orange.Learner):
    def __new__(cls, data = None, weightID = 0, **keyw):
        self = orange.Learner.__new__(cls, **keyw)
        self.__dict__.update(keyw)
        if data:
            return self.__call__(examples, weightID)
        else:
            return self

    def __call__(self, data, weight=0):
        trained_imputer = self.imputer_constructor(data, weight)
        imputed_data = trained_imputer(data, weight)
        base_classifier = self.base_learner(imputed_data, weight)
        return ImputeClassifier(base_classifier, trained_imputer)

class ImputeClassifier(orange.Classifier):
    def __init__(self, base_classifier, imputer):
        self.base_classifier = base_classifier
        self.imputer = imputer

    def __call__(self, i, what=orange.GetValue):
        return self.base_classifier(self.imputer(i), what)

Write your own imputer

Imputation classes provide the Python-callback functionality. The simples way to write custom imputation constructors or imputers is to write a Python function that behaves like the built-in Orange classes. For imputers it is enough to write a function that gets an instance as argument. Inputation for data tables will then use that function.

Special imputation procedures or separate procedures for various attributes, as demonstrated in the description of ModelConstructor, are achieved by encoding it in a constructor that accepts a data table and id of the weight meta-attribute, and returns the imputer. The benefit of implementing an imputer constructor is that you can use is as a component for learners (for example, in logistic regression) or wrappers, and that way properly use the classifier in testing procedures.