Data format

From JCLEC wiki
Jump to: navigation, search

JCLEC requires a file for the specification of a dataset. An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the Department of Computer Science of The University of Waikato.

The ARFF Header Section

This section of the file contains the relation name and attribute declarations.

The @relation name, where <relation-name> is a string without spaces.

@relation <relation-name>
  

The @attribute Declarations, which take the form of an orderd sequence of @attribute statements. Each attribute in the data set has its own @attribute statement which uniquely defines the name of that attribute and it's data type.

@attribute <attribute-name> <datatype>
  

where the <attribute-name> is a string without spaces and the <datatype> can be any of the four types currently supported by Weka:

  • numeric: Numeric attributes can be real or integer numbers.
  • string: String attributes allow us to create attributes containing arbitrary textual values.
  • Nominal values are defined by providing a <nominal-specification> listing the possible values: {<nominal-name1>, <nominal-name2>, <nominal-name3>, ...}

The ARFF Data Section

This section contains the data declaration line and the actual instance lines.

The @data Declaration is a single line denoting the start of the data segment in the file. The format is:

@data

ARFF Data example

Here is an example of an ARFF file with 3 numeric attributes an 1 nominal attribute:

@relation Example
@attribute feature1 numeric
@attribute feature2 numeric
@attribute feature3 numeric
@attribute feature4 {red, green, black}
@data
2.1,2.3,0.5,green
2.3,5.6,1.4,red
....