Loading and saving data

Loading and saving data¶

Orange’s tab-delimited format¶

Orange comes with its native, tab-delimited data format. This is a standard data format with a non-standard 3-line header that specifies the type of features and optional flags along with the feature names. The first header line contains feature (column) names, the second line specifies their types, and the third line provides optional flags. Default file extension of Orange’s data files is ”.tab”.

Example of iris dataset in Orange’s format (iris.tab)

sepal length	sepal width	petal length	petal width	iris
c	c	c	c	d
				class
5.1	3.5	1.4	0.2	Iris-setosa
4.9	3.0	1.4	0.2	Iris-setosa
4.7	3.2	1.3	0.2	Iris-setosa
4.6	3.1	1.5	0.2	Iris-setosa

Three header lines thus include:

feature names (first header line). Feature names can include any combination of characters.
feature types (second header line) report on type of the feature. The type has to be provided for all features, and can be any of the following:
- discrete (or d) - imported as Orange.feature.Discrete
- continuous (or c) - imported as Orange.feature.Continuous
- text (or string, s) - imported as Orange.feature.String
- basket - used for storing sparse data. More on basket formats in a dedicated section.
optional flags (third header line) that are either empty (no flag) or of the following:
- ignore (or i) - feature will not be imported
- class (or c) - feature will be imported as class variable. Only one feature can be marked as class.
- multiclass - feature is one of multiple classes. Data can have both, multiple classes and an ordinary class.
- meta (or m) - feature will be imported as a meta attribute.

Simplified header¶

Instead of a three-line header a single-line header can be used with tags listed before feature names and separated from a feature name with a hash (“#”) sign. Supported tags are:

c for class feature

i for feature to be ignored

m for the meta attribute

C for continuous-typed feature

D for discrete feature

S for string

Files with simplified header should use extension ”.csv”.

An example data file in this format is shown below:

C#sepal length	iC#sepal width	mC#petal length	mC#petal width	cD#iris
5.1	3.5	1.4	0.2	Iris-setosa
4.9	3.0	1.4	0.2	Iris-setosa
4.7	3.2	1.3	0.2	Iris-setosa

Notice that we have ignored the second column (sepal width), and declared third and fourth column (features on petal leaves) as meta:

>>> import Orange.data
>>> data = Orange.data.Table("abridged-iris.csv")
>>> data[0]
[5.1, 'Iris-setosa'], {"petal length":1.4, "petal width":0.2}
>>> data.domain
[sepal length, iris], {-3:petal length, -4:petal width}

Baskets¶

Baskets can be used for storing sparse data in tab delimited files. They were specifically designed for text mining needs. If text mining and sparse data is not your business, you can skip this section.

Baskets are given as a list of space-separated <name>=<value> atoms. A continuous meta attribute named <name> will be created and added to the domain as optional if it is not already there. A meta value for that variable will be added to the example. If the value is 1, you can omit the =<value> part.

It is not possible to put meta attributes of other types than continuous in the basket.

A tab delimited file with a basket can look like this:

K       Ca      b_foo     Ba  y
c       c       basket    c   c
        meta              i   class
0.06    8.75    a b a c   0   1
0.48            b=2 d     0   1
0.39    7.78              0   1
0.57    8.22    c=13      0   1

These are the examples read from such a file:

[0.06, 1], {"Ca":8.75, "a":2.000, "b":1.000, "c":1.000}
[0.48, 1], {"Ca":?, "b":2.000, "d":1.000}
[0.39, 1], {"Ca":7.78}
[0.57, 1], {"Ca":8.22, "c":13.000}

It is recommended to have the basket as the last column, especially if it contains a lot of data.

Note a few things. The basket column’s name, b_foo, is not used. In the first example, the value of a is 2 since it appears twice. The ordinary meta attribute, Ca, appears in all examples, even in those where its value is undefined. Meta attributes from the basket appear only where they are defined. This is due to the different nature of these meta attributes: Ca is required while the others are optional.

>>> d.domain.getmetas()
{-6: FloatVariable 'd', -22: FloatVariable 'Ca', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
>>> d.domain.getmetas(False)
{-22: FloatVariable 'Ca'}
>>> d.domain.getmetas(True)
{-6: FloatVariable 'd', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}

To fully understand all this, you should read the documentation on meta attributes in Domain and on the basket file format (a simple format that is limited to baskets only).

Basket Format¶

Basket files (.basket) are suitable for representing sparse data. Each example is represented by a line in the file. The line is written as a comma-separated list of name-value pairs. Here’s an example of such file.

nobody, expects, the, Spanish, Inquisition=5
our, chief, weapon, is, surprise=3, surprise=2, and, fear,fear, and, surprise
our, two, weapons, are, fear, and, surprise, and, ruthless, efficiency
to, the, Pope, and, nice, red, uniforms, oh damn

The file contains four examples. The first examples has five attributes defined, “nobody”, “expects”, “the”, “Spanish” and “Inquisition”; the first four have (the default) value of 1.0 and the last has a value of 5.0.

The attributes that appear in the domain aren’t defined in any headers or even separate files, as with other formats supported by Orange.

If attribute appears more than once, its values are added. For instance, the value of attribute “surprise” in the second examples is 6.0 and the value of “fear” is 2.0; the former appears three times with values of 3.0, 2.0 and 1.0, and the latter appears twice with value of 1.0.

All attributes are loaded as optional meta-attributes, so zero values don’t take any memory (unless they are given, but initialized to zero). See also section on meta attributes in the reference for domain descriptors.

Notice that at the time of writing this reference only association rules can directly use examples presented in the basket format.

Other supported data formats¶

Orange can import data from csv or tab delimited files where the first line contains attribute names followed by lines containing data. For such files, orange tries to guess the type of features and treats the right-most column as the class variable. If feature types are known in advance, special orange tab format should be used.