Variable Descriptors (variable)

Every variable is associated with a descriptor that stores its name and other properties. Descriptors serve three main purposes:

  • conversion of values from textual format (e.g. when reading files) to the internal representation and back (e.g. when writing files or printing out);

  • identification of variables: two variables from different datasets are considered to be the same if they have the same descriptor;

  • conversion of values between domains or datasets, for instance from continuous to discrete data, using a pre-computed transformation.

Descriptors are most often constructed when loading the data from files.

>>> from Orange.data import Table
>>> iris = Table("iris")

>>> iris.domain.class_var
DiscreteVariable('iris')
>>> iris.domain.class_var.values
['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

>>> iris.domain[0]
ContinuousVariable('sepal length')
>>> iris.domain[0].number_of_decimals
1

Some variables are derived from others. For instance, discretizing a continuous variable gives a new, discrete variable. The new variable can compute its values from the original one.

>>> from Orange.preprocess import DomainDiscretizer
>>> discretizer = DomainDiscretizer()
>>> d_iris = discretizer(iris)
>>> d_iris[0]
DiscreteVariable('D_sepal length')
>>> d_iris[0].values
['<5.2', '[5.2, 5.8)', '[5.8, 6.5)', '>=6.5']

See Derived variables for a detailed explanation.

Constructors

Orange maintains lists of existing descriptors for variables. This facilitates the reuse of descriptors: if two datasets refer to the same variables, they should be assigned the same descriptors so that, for instance, a model trained on one dataset can make predictions for the other.

Variable descriptors are seldom constructed in user scripts. When needed, this can be done by calling the constructor directly or by calling the class method make. The difference is that the latter returns an existing descriptor if there is one with the same name and which matches the other conditions, such as having the prescribed list of discrete values for DiscreteVariable:

>>> from Orange.data import ContinuousVariable
>>> age = ContinuousVariable.make("age")
>>> age1 = ContinuousVariable.make("age")
>>> age2 = ContinuousVariable("age")
>>> age is age1
True
>>> age is age2
False

The first line returns a new descriptor after not finding an existing desciptor for a continuous variable named "age". The second reuses the first descriptor. The last creates a new one since the constructor is invoked directly.

The distinction does not matter in most cases, but it is important when loading the data from different files. Orange uses the make constructor when loading data.

Base class

class Orange.data.Variable(name='', compute_value=None, *, sparse=False)[source]

The base class for variable descriptors contains the variable's name and some basic properties.

name

The name of the variable.

unknown_str

A set of values that represent unknowns in conversion from textual formats. Default is {"?", ".", "", "NA", "~", None}.

compute_value

A function for computing the variable's value when converting from another domain which does not contain this variable. The function will be called with a data set (Orange.data.Table) and has to return an array of computed values for all its instances. The base class defines a static method compute_value, which returns Unknown. Non-primitive variables must redefine it to return None.

sparse

A flag about sparsity of the variable. When set, the variable suggests it should be stored in a sparse matrix.

source_variable

An optional descriptor of the source variable - if any - from which this variable is derived and computed via compute_value.

attributes

A dictionary with user-defined attributes of the variable

classmethod is_primitive(var=None)[source]

True if the variable's values are stored as floats. Non-primitive variables can appear in the data only as meta attributes.

static str_val(val)

Return a textual representation of variable's value val. Argument val must be a float (for primitive variables) or an arbitrary Python object (for non-primitives).

Derived classes must overload the function.

to_val(s)[source]

Convert the given argument to a value of the variable. The argument can be a string, a number or None. For primitive variables, the base class provides a method that returns Unknown if s is found in unknown_str, and raises an exception otherwise. For non-primitive variables it returns the argument itself.

Derived classes of primitive variables must overload the function.

Parameters:

s (str, float or None) -- value, represented as a number, string or None

Return type:

float or object

val_from_str_add(s)[source]

Convert the given string to a value of the variable. The method is similar to to_val except that it only accepts strings and that it adds new values to the variable's domain where applicable.

The base class method calls to_val.

Parameters:

s (str) -- symbolic representation of the value

Return type:

float or object

Continuous variables

class Orange.data.ContinuousVariable(name='', number_of_decimals=None, compute_value=None, *, sparse=False)[source]

Descriptor for continuous variables.

number_of_decimals

The number of decimals when the value is printed out (default: 3).

adjust_decimals

A flag regulating whether the number_of_decimals is being adjusted by to_val.

The value of number_of_decimals is set to 3 and adjust_decimals is set to 2. When val_from_str_add is called for the first time with a string as an argument, number_of_decimals is set to the number of decimals in the string and adjust_decimals is set to 1. In the subsequent calls of to_val, the nubmer of decimals is increased if the string argument has a larger number of decimals.

If the number_of_decimals is set manually, adjust_decimals is set to 0 to prevent changes by to_val.

classmethod make(name, *args, **kwargs)

Return an existing continuous variable with the given name, or construct and return a new one.

classmethod is_primitive(var=None)

True if the variable's values are stored as floats. Non-primitive variables can appear in the data only as meta attributes.

str_val(val: float)

Return the value as a string with the prescribed number of decimals.

to_val(s)[source]

Convert a value, given as an instance of an arbitrary type, to a float.

val_from_str_add(s)[source]

Convert a value from a string and adjust the number of decimals if adjust_decimals is non-zero.

Discrete variables

class Orange.data.DiscreteVariable(name='', values=(), compute_value=None, *, sparse=False)[source]

Descriptor for symbolic, discrete variables. Values of discrete variables are stored as floats; the numbers corresponds to indices in the list of values.

values

A list of variable's values.

classmethod make(name, *args, **kwargs)

Return an existing continuous variable with the given name, or construct and return a new one.

classmethod is_primitive(var=None)

True if the variable's values are stored as floats. Non-primitive variables can appear in the data only as meta attributes.

str_val(val)

Return a textual representation of the value (self.values[int(val)]) or "?" if the value is unknown.

Parameters:

val (float (should be whole number)) -- value

Return type:

str

to_val(s)[source]

Convert the given argument to a value of the variable (float). If the argument is numeric, its value is returned without checking whether it is integer and within bounds. Unknown is returned if the argument is one of the representations for unknown values. Otherwise, the argument must be a string and the method returns its index in values.

Parameters:

s -- values, represented as a number, string or None

Return type:

float

val_from_str_add(s)[source]

Similar to to_val, except that it accepts only strings and that it adds the value to the list if it does not exist yet.

Parameters:

s (str) -- symbolic representation of the value

Return type:

float

String variables

class Orange.data.StringVariable(name='', compute_value=None, *, sparse=False)[source]

Descriptor for string variables. String variables can only appear as meta attributes.

classmethod make(name, *args, **kwargs)

Return an existing continuous variable with the given name, or construct and return a new one.

classmethod is_primitive(var=None)

True if the variable's values are stored as floats. Non-primitive variables can appear in the data only as meta attributes.

static str_val(val)[source]

Return a string representation of the value.

to_val(s)[source]

Return the value as a string. If it is already a string, the same object is returned.

val_from_str_add(s)

Return the value as a string. If it is already a string, the same object is returned.

Time variables

Time variables are continuous variables with value 0 on the Unix epoch, 1 January 1970 00:00:00.0 UTC. Positive numbers are dates beyond this date, and negative dates before. Due to limitation of Python datetime module, only dates in 1 A.D. or later are supported. Note that Orange's Table stores datetime values as UNIX epoch (seconds from 1970-01-01), thus Table.from_numpy expects values in this format.

Orange's TimeVariable supports storing either date, time, or a combination of both:

  • TimeVariable("Timestamp", have_date=True) stores only date information -- it is analogous to datetime.date

  • TimeVariable("Timestamp", have_time=True) stores only time information (without date) -- it is analogous to datetime.time`

  • TimeVariable("Timestamp", have_time=True, have_date=True) stores date and time -- it is analogous to datetime.datetime

When the parse method is used to parse datetimes from a string, it is not necessary to set the have_time and have_date attributes since they will be inferred from from datetimes.

class Orange.data.TimeVariable(*args, have_date=0, have_time=0, **kwargs)[source]

TimeVariable is a continuous variable with Unix epoch (1970-01-01 00:00:00+0000) as the origin (0.0). Later dates are positive real numbers (equivalent to Unix timestamp, with microseconds in the fraction part), and the dates before it map to the negative real numbers.

Unfortunately due to limitation of Python datetime, only dates with year >= 1 (A.D.) are supported.

If time is specified without a date, Unix epoch is assumed.

If time is specified without an UTC offset, localtime is assumed.

parse(datestr)[source]

Return datestr, a datetime provided in one of ISO 8601 formats, parsed as a real number. Value 0 marks the Unix epoch, positive values are the dates after it, negative before.

If date is unspecified, epoch date is assumed.

If time is unspecified, 00:00:00.0 is assumed.

If timezone is unspecified, local time is assumed.

Derived variables

The compute_value mechanism is used throughout Orange to compute all preprocessing on training data and applying the same transformations to the testing data without hassle.

Method compute_value is usually invoked behind the scenes in conversion of domains. Such conversions are are typically implemented within the provided wrappers and cross-validation schemes.

Derived variables in Orange

Orange saves variable transformations into the domain as compute_value functions. If Orange was not using compute_value, we would have to manually transform the data:

>>> from Orange.data import Domain, ContinuousVariable
>>> data = Orange.data.Table("iris")
>>> train = data[::2]  # every second row
>>> test = data[1::2]  # every other second instance

We will create a new data set with a single feature, "petals", that will be a sum of petal lengths and widths:

>>> petals = ContinuousVariable("petals")
>>> derived_train = train.transform(Domain([petals],
...                                 data.domain.class_vars))
>>> derived_train.X = train[:, "petal width"].X + \
...                   train[:, "petal length"].X

We have set Table's X directly. Next, we build and evaluate a classification tree:

>>> learner = Orange.classification.TreeLearner()
>>> from Orange.evaluation import CrossValidation, TestOnTestData
>>> res = CrossValidation(derived_train, [learner], k=5)
>>> Orange.evaluation.scoring.CA(res)[0]
0.88
>>> res = TestOnTestData(derived_train, test, [learner])
>>> Orange.evaluation.scoring.CA(res)[0]
0.3333333333333333

A classification tree shows good accuracy with cross validation, but not on separate test data, because Orange can not reconstruct the "petals" feature for test data---we would have to reconstruct it ourselves. But if we define compute_value and therefore store the transformation in the domain, Orange could transform both training and test data:

>>> petals = ContinuousVariable("petals",
...    compute_value=lambda data: data[:, "petal width"].X + \
...                               data[:, "petal length"].X)
>>> derived_train = train.transform(Domain([petals],
                                    data.domain.class_vars))
>>> res = TestOnTestData(derived_train, test, [learner])
>>> Orange.evaluation.scoring.CA(res)[0]
0.9733333333333334

All preprocessors in Orange use compute_value.

Example with discretization

The following example converts features to discrete:

>>> iris = Orange.data.Table("iris")
>>> iris_1 = iris[::2]
>>> discretizer = Orange.preprocess.DomainDiscretizer()
>>> d_iris_1 = discretizer(iris_1)

A dataset is loaded and a new table with every second instance is created. On this dataset, we compute discretized data, which uses the same data to set proper discretization intervals.

The discretized variable "D_sepal length" stores a function that can derive continous values into discrete:

>>> d_iris_1[0]
DiscreteVariable('D_sepal length')
>>> d_iris_1[0].compute_value
<Orange.feature.discretization.Discretizer at 0x10d5108d0>

The function is used for converting the remaining data (as automatically happens within model validation in Orange):

>>> iris_2 = iris[1::2]  # previously unselected
>>> d_iris_2 = iris_2.transform(d_iris_1.domain)
>>> d_iris_2[0]
[<5.2, [2.8, 3), <1.6, <0.2 | Iris-setosa]

The code transforms previously unused data into the discrete domain d_iris_1.domain. Behind the scenes, the values for the destination domain that are not yet in the source domain (iris_2.domain) are computed with the destination variables' compute_value.

Optimization for repeated computation

Some transformations share parts of computation across variables. For example, PCA uses all input features to compute the PCA transform. If each output PCA component was implemented with ordinary compute_value, the PCA transform would be repeatedly computed for each PCA component. To avoid repeated computation, set compute_value to a subclass of SharedComputeValue.

class Orange.data.util.SharedComputeValue(compute_shared, variable=None)[source]

A base class that separates compute_value computation for different variables into shared and specific parts.

Parameters:
  • compute_shared (Callable[[Orange.data.Table], object]) -- A callable that performs computation that is shared between multiple variables. Variables sharing computation need to set the same instance.

  • variable (Orange.data.Variable) -- The original variable on which this compute value is set. Optional.

compute(data, shared_data)[source]

Given precomputed shared data, perform variable-specific part of computation and return new variable values. Subclasses need to implement this function.

The following example creates normalized features that divide values by row sums and then tranforms the data. In the example the function row_sum is called only once; if we did not use SharedComputeValue, row_sum would be called four times, once for each feature.

iris = Orange.data.Table("iris")

def row_sum(data):
    return data.X.sum(axis=1, keepdims=True)

class DivideWithMean(Orange.data.util.SharedComputeValue):

    def __init__(self, var, fn):
        super().__init__(fn)
        self.var = var

    def compute(self, data, shared_data):
        return data[:, self.var].X / shared_data

divided_attributes = [
    Orange.data.ContinuousVariable(
        "Divided " + attr.name,
        compute_value=DivideWithMean(attr, row_sum)
    ) for attr in iris.domain.attributes]

divided_domain = Orange.data.Domain(
    divided_attributes,
    iris.domain.class_vars
)

divided_iris = iris.transform(divided_domain)