Data table (Table)

Data table (`Table`)¶

Class Orange.data.Table holds a list of data instances of type Orange.data.Instance. All instances belong to the same domain (Orange.data.Domain).

Data tables are usually loaded from a file (see Loading and saving data):

import Orange
data = Orange.data.Table("titanic")

Data tables can also be created programmatically, as in the code below.

Table supports most list-like operations: getting, setting, removing data instances, as well as methods append and extend. When setting items, the item must be either the instance of the correct type or a Python list of appropriate length and content to be converted into a data instance of the corresponding domain. Retrieving data instances returns references and not copies: changing the retrieved instance changes the data in the table. Slicing returns ordinary Python lists containing references to data instances, not a new Orange.data.Table.

According to a Python convention, the data table is considered False when empty.

class Orange.data.Table¶

domain¶: The domain to which the instances belong. This attribute is read-only.

owns_instances¶: True if the table contains the data instances and False if it contains references to instances owned by another table.

owner¶: The actual owner of the data when own_instances is False; None otherwise.

version¶: An integer that is increased when instances are added or removed from the table. It does not detect changes of the data.

random_generator¶: Random generator that is used by method random_instance. If the method is called and random_generator is None, a new generator is constructed with random seed 0 and stored here for future use.

attribute_load_status¶: If the table was loaded from a file, this list of flags tells whether the feature descriptors were reused and how they matched. See descriptor reuse for details.

meta_attribute_load_status¶: A dictionary holding this same information for meta attributes, with keys corresponding to their ids and values to load statuses.

__init__(filename[, create_new_on])¶

Read data from the given file. If the name includes the extension it must be one of the known file formats (see Loading and saving data). If no extension is given, the directory is searched for any file with recognized extensions. If the file is not found, Orange will also search the directories specified in the environment variable ORANGE_DATA_PATH.

The optional flag create_new_on decides when variable descriptors are reused. See descriptor reuse for more details.

Parameters:	filename (str) – the name of the file create_new_on (int) – flag specifying when to reuse existing descriptors

__init__(domain)

Construct an empty data table with the given domain.

import Orange

cards = [3, 3, 2, 3, 4, 2]
values = ["1", "2", "3", "4"]

features = [Orange.feature.Discrete(name, values=values[:card])
              for name, card in zip("abcdef", cards)]
classattr = Orange.feature.Discrete("y", values=["0", "1"])
domain = Orange.data.Domain(features + [classattr])
data = Orange.data.Table(domain)

The example continues.

Parameters:	domain (Orange.data.Domain) – domain descriptor

__init__(instances[, references])

Construct a new data table containing the given data instances. These can be given either as another Table or as list of instances represented by list of value or as Orange.data.Instance.

If the optional second argument is True, the first argument must be a Table. The new table will contain references to data stored in the given table. If the second argument is omitted or False, data instances are copied.

Parameters:	instances (Table or list) – data instances references (bool) – if `True`, the new table contains references

__init__(domain, instances)

Construct a new data table with the given domain and initialize it with the given instances. Instances can be given as a Table (if domains do not match, they are converted), as a list containing either instances of Orange.data.Instance or lists.

This constructor can also be used for conversion from numpy arrays. The argument instances can be a numpy array. The number of variables in the domain must match the number of columns.

Parameters:	domain (Orange.data.Domain) – domain descriptor instances (Table or list or numpy.array) – data instances

The following example fills the data table created above with some data from a list.

loe = [["3", "1", "1", "2", "1", "1", "1"],
       ["3", "1", "1", "2", "2", "1", "0"],
       ["3", "3", "1", "2", "2", "1", "1"]
      ]

d2 = Orange.data.Table(domain, loe)

The following example shows initializing a data table from numpy array.

import numpy
d = Orange.data.Domain([Orange.feature.Continuous('a%i' % x) for x in range(5)])
a = numpy.array([[1, 2, 3, 4, 5], [5, 4, 3, 2, 1]])
t = Orange.data.Table(a)

__init__(tables)

Construct a table by combining data instances from a list of tables. All tables must have the same length. Domains are combined so that each (ordinary) feature appears only once in the resulting table. The class attribute is the last class attribute in the list of tables, while all other class attributes are added as ordinary features. For instance, if three tables are merged but the last one is class-less, the class attribute for the new table will come from the second table. Meta attributes for the new domain are merged based on id’s: if the same attribute appears under two id’s it will be added twice. If, on the opposite, same id appears two different attributes in two tables, this raises an exception. As instances are merged, exception is raised if a features or a meta attribute that appears in multiple tables does not have the same value on all of them; the feature is allowed to have a missing value on one or more (or all) tables.

Note that this is not the SQL’s join operator as it doesn’t try to find matches between the tables but instead merges them row by row.

Parameters:	tables (list of instances of `Table`) – tables to be merged into the new table

For example, suppose the file merge1.tab contains:

a1    a2    m1    m2
f     f     f     f
            meta  meta
1     2     3     4
5     6     7     8
9     10    11    12

and merge2.tab contains:

a1    a3    m1     m3
f     f     f      f
            meta   meta
1     2.5   3      4.5
5     6.5   7      8.5
9     10.5  11     12.5

The two tables can be loaded, merged and printed out by the following script.

import Orange

data1 = Orange.data.Table("merge1.tab")
data2 = Orange.data.Table("merge2.tab")

merged = Orange.data.Table([data1, data2])

print "Domain 1: ", data1.domain
print "Domain 2: ", data2.domain
print "Merged:   ", merged.domain
print
for i in range(len(data1)):
    print "  ", data1[i]
    print " +", data2[i]
    print "->", merged[i]
    print

This is what the output looks like:

Domain 1:  [a1, a2], {-2:m1, -3:m2}
Domain 2:  [a1, a3], {-2:m1, -4:m3}
Merged:    [a1, a2, a3], {-2:m1, -3:m2, -4:m3}

   [1, 2], {"m1":3, "m2":4}
 + [1, 2.5], {"m1":3, "m3":4.5}
-> [1, 2, 2.5], {"m1":3, "m2":4, "m3":4.5}

   [5, 6], {
"m1":7, "m2":8}
 + [5, 6.5], {"m1":7, "m3":8.5}
-> [5, 6, 6.5], {"m1":7, "m2":8, "m3":8.5}

   [9, 10], {"m1":11, "m2":12}
 + [9, 10.5], {"m1":11, "m3":12.5}
-> [9, 10, 10.5], {"m1":11, "m2":12, "m3":12.5}

Merging succeeds since the values of a1 and m1 are the same for all matching instances from both tables.

append(instance)¶

Append the given instance to the end of the table.

Parameters:	instance (`Orange.data.Instance` or a list) – instance to be appended

for i in range(5):
    inst = [random.randint(0, c - 1) for c in cards]
    inst.append(inst[0] == inst[1] or inst[4] == 0)
    data.append(inst)

extend(instances)¶

Append the given list of instances to the end of the table.

Parameters:	instances (list) – instances to be appended

select(folds[, select, negate=False])¶

Return a subset of instances as a new Table. The first argument should be a list of the same length as the table; its elements should be integers or bools. The resulting table contains instances corresponding to non-zero elements of the list.

If the second argument is given, it must be an integer; method select will then return the data instances for which the corresponding fold‘s elements match the value of the argument select.

The third argument, negate inverts the selection. It can only be given as a keyword.

Note: This method should be used when the selected data instances are going to be modified later on. In all other cases, method select_ref is preferred.

Parameters:	folds (list) – list of fold indices corresponding to data instances select (int) – select which instances to pick negate (bool) – inverts the selection
Return type:	`Orange.data.Table`

One common use of this method is to split the data into folds. A list for the first argument can be prepared using Orange.data.sample.SubsetIndicesCV. The following example prepares a simple data table and indices for four-fold cross validation, and then selects the training and testing sets for each fold.

import Orange

domain = Orange.data.Domain([Orange.feature.Continuous()])
data = Orange.data.Table(domain)
for i in range(10):
    data.append([i])

cv_indices = Orange.data.sample.SubsetIndicesCV(data, 4)
print "Indices: ", cv_indices, "\n"

for fold in range(4):
    train = data.select(cv_indices, fold, negate = 1)
    test  = data.select(cv_indices, fold)
    print "Fold %d: train " % fold
    for inst in train:
        print "    ", inst
    print
    print "      : test  "
    for inst in test:
        print "    ", inst
    print

The printout begins with:

Indices:  <1, 0, 2, 2, 0, 1, 0, 3, 1, 3>

Fold 0: train
     [0.000000]
     [2.000000]
     [3.000000]
     [5.000000]
     [7.000000]
     [8.000000]
     [9.000000]

      : test
     [1.000000]
     [4.000000]
     [6.000000]

Another form of calling the method is to use a vector of zero’s and one’s.

t = data.select([1, 1, 0, 0, 0,  0, 0, 0, 0, 1])
for inst in t:
    print inst

This prints out:

[0.000000]
[1.000000]
[9.000000]

select_ref(folds[, select, negate=False])¶

Same as select, except that the resulting table contains references to data instances in the original table instead of its own copy of data.

In most cases, this function is preferred over the former since it consumes less memory.

Parameters:	folds (list) – list of fold indices corresponding to data instances select (int) – select which instances to pick negate (bool) – inverts the selection
Return type:	`Orange.data.Table`

get_items(indices)¶

Return a table with data instances indicated by indices. For instance, data.get_items([0, 1, 9]) returns a table with instances with indices 0, 1 and 9.

This function is useful when data is going to be modified. If not, use get_items_ref.

Parameters:	indices (list of int’s) – indices of selected data instances
Return type:	`Orange.data.Table`

get_items_ref(indices)¶

Same as above, except that it returns a table with references to data instances. This method is usually preferred over the above one.

Parameters:	indices (list of int’s) – indices of selected data instances
Return type:	`Orange.data.Table`

filter(conditions)¶

Return a table with data instances matching the criteria. These can be given in form of keyword arguments or a dictionary; with the latter, additional keyword argument negate can be given to reverse the selection.

Note that method filter_ref is more memory efficient and should be preferred when data is not going to be modified.

Young patients from the lenses data set can be selected by

young = data.filter(age="young")

More than one value can be allowed and more than one attribute checked. This selects all patients with age “young” or “psby” who are astigmatic:

young = data.filter(age=["young", "presbyopic"], astigm="y")

The following has the same effect:

young = data.filter({"age": ["young", "presbyopic"],
                    "astigm": "y"})

Selection can be reversed only in the latter form, by adding a keyword argument negate with value 1:

young = data.filter({"age": ["young", "presbyopic"],
                    "astigm": "y"},
                    negate=1)

Filters for continuous features are specified by pairs of values. In dataset “bridges”, bridges with lengths between 1000 and 2000 (inclusive) are selected by

mid = data.filter(LENGTH=(1000, 2000))

Bridges that are shorter or longer than that can be selected by inverting the range.

mid = data.filter(LENGTH=(2000, 1000))

filter(filt): Similar to above, except that conditions are given as Orange.core.Filter.

filter_ref(conditions), filter_ref(filter)¶: Same as the above two, except that they return a table with references to instances instead of their copies.

filter_bool(conditions), filter_bool(filter)¶: Return a list of bools denoting which data instances are accepted by the conditions or the filter.

translate(domain)¶

Return a new data table in which data instances are translated into the given domain.

Parameters:	domain (`Orange.data.Domain`) – new domain
Return type:	`Orange.data.Table`

translate(variables[, keep_metas])

Similar to above, except that the domain is given by a list of features. If keep_metas is True, the new data instances will also have all the meta attributes from the original domain.

Parameters:	variables (list) – variables for the new data
Return type:	`Orange.data.Table`

to_numpy(content, weightID, multinominal)¶

Convert a data table to numpy array. Raises an exception if the data contains undefined values. to_numpyMA converts to a masked array where the mask denotes the defined values. (For conversion from numpy, see the constructor.)

The function returns a tuple with the array and, depending on arguments, some vectors. The argument content is a string separated in two parts with a slash. The part to the left of slash describes the content of the array; in the part on the right side lists the vectors. The content is described with the following characters:

a: features (without the class); can only appear on the left
A: like a, but raises exception if there are no features
c: class value represented as an index of the value (0, 1, 2...); if the data has no class, the column is omitted (if c is to the left of the slash) or the tuple will contain None instead of the vector.
C: like c, but raises exception if the data has no class
m: like c, but one column for each target variable in a multi-target domain.
M: synonymous to m.
w: instance weight; like for c the column is omitted or None is returned instead of the vector if the argument weightID is missing.
W: instance weight; raise an exception if weightID is missing.
0: a vector of zeros
1: a vector of ones

The default content is a/cw: an array with feature values and separate vectors with classes and weights. Specifying an empty string has the same effect. If the elements to the right of the slash repeat, the function returns the same Python object, e.g. in acc000/cwww the three weight vectors are one and the same Python object, so modifying one will change all three of them.

This is the default behaviour on data set iris with 150 data instances described by four features and a class value:

>>>>>> data = orange.ExampleTable("iris")
>>> a, c, w = data.toNumpy()
>>> a.shape
(150, 4)
>>> c.shape
(150,)
>>> print w
None
>>> a[0]
array([ 5.0999999 ,  3.5       ,  1.39999998,  0.2       ])
>>> c[0]
0.0

For a more complicated example, the array will contain a column with class, features, a vector of ones, two vectors with classes and another vector of zeroes:

>>>>>> a, = data.toNumpy("ca1cc0")
>>> a[0]
array([ 0., 5.0999999, 3.5       , 1.39999998, 0.2       , 1., 0., 0., 0.])
>>> a[130]
array([ 2., 7.4000001, 2.79999995, 6.0999999 , 1.89999998, 1., 2., 2., 0.])
>>> c[120]
2.0

The third argument specifies the treatment of non-continuous non-binary values (binary values are always translated to 0.0 or 1.0). The argument’s value can be Orange.data.Table.Multinomial_Ignore (such features are omitted), Orange.data.Table.Multinomial_AsOrdinal (the values’ indices are treated as continuous numbers) or Orange.data.Table.Multinomial_Error (an exception is raised if such features are encountered). Default treatment is Orange.data.Table.ExampleTable.Multinomial_AsOrdinal.

When the class attribute is discrete and has more than two values, an exception is raised unless multinomial attributes are treated as ordinal. More options for treating multinominal values are available in Orange.data.continuization.

to_numpyMA(content, weightID, multinominal)¶: Similar to to_numpy except that it returns a masked array with mask representing the (un)defined values.

checksum()¶

Return a CRC32 computed over all discrete and continuous features and class attributes of all data instances.

Return type:	int

has_missing_values()¶: Return True if any of data instances has any missing values. Meta attributes are not checked.

has_missing_classes()¶: Return True if any instance miss the class value.

random_instance()¶: Return a random instance from the table. Data table’s random_generator is used, which is initially seeded to 0, so results are deterministic.

remove_duplicates([weightID])¶

Remove duplicates of data instances. If weightID is given, a meta attribute is added which contains the number of instances merged into each new instance.

Parameters:	weightID (int) – id for meta attribute with weight
Return type:	None

sort([variables])¶

Sort the data table. The argument gives the values ordered by importance. If omitted, the order from the domain is used. Values of discrete features are not ordered alphabetically but according to the Orange.feature.Discrete.values.

This sorts the data from the bridges data set by the lengths and years of their construction:

data.sort(["LENGTH", "ERECTED"])

shuffle()¶: Randomly shuffle the data instances.

add_meta_attribute(attr[, value=1])¶: Add a meta value to all data instances. The first argument can be an integer id, or a string or a variable descriptor of a meta attribute registered in the domain.

remove_meta_attribute(attr)¶: Remove a meta attribute from all data instances.

Data table (Table)¶

Data table (`Table`)¶