Data Sets


Data sets in Classer format:

Three data sets are provided already encoded in the format expected by Classer. These data sets are used in all the examples and tutorials, and can be downloaded here:


The first two data sets are ready to use after unzipping. The Boston testbed must first be expanded by running the batch script in

d_BostonunWrap.bat

Data Set: Circle-in-the-Square (CIS)

This is a simple data set with just two classes and two dimensions, describing whether or not points fall within a circle set within a square.



Data Details:
  • 2 dimensions, x and y,
  • 2 classes, Out and In.
  • 1000 samples

All points fall within the unit square; a centered circle with area exactly half that of the square partitions the two classes. When samples are drawn at random from the square, on average half are in the circle, and half are outside.


Data obeying these rules can be generated at will; simulations described in this document use the thousand point data set shown in the figure at right.


Data Set: Frey & Slate Letter data

This data set describes statistical attributes of 20,000 digitized pictures of letters, and was used to study machine learning using Holland-style adaptive classifiers (Frey & Slate, 1991). Our copy was obtained from the UCI repository (http://archive.ics.uci.edu/ml/).



Data Details:
  • 16 dimensions (listed below).
  • 26 classes representing letters of the alphabet (A-Z).
  • 20,000 samples, divided into a 16,000 sample training set and 4,000 sample test set.

Frey & Slate Letter data : List of input features
x-box - horizontal position of box
y-box - vertical position of box
width - width of box
high - height of box
onpix - total # on pixels
x-bar - mean x of on pixels in box
y-bar - mean y of on pixels in box
x2bar - mean x variance
y2bar - mean y variance
xybar - mean x y correlation
x2ybr - mean of x * x * y
xy2br - mean of x * y * y
x-ege - mean edge count left to right
xegvy - correlation of x-ege with y
y-ege - mean edge count bottom to top
yegvx - correlation of y-ege with x

A 20,000 point data set is broken into two pieces: one with 16,000 points, and one with 4,000 points. Each data point is derived from a pixellated image of a letter. As stated in the description accompanying the data set: "The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15. We typically train on the first 16000 items and then use the resulting model to predict the letter category for the remaining 4000."



Data Set: Boston Remote Sensing Testbed

The Boston remote sensing testbed describes a remotely sensed area (data from a Landsat 7 Thematic Mapper satellite), 360 pixels wide by 600 pixels in height, or 5.4 km x 9 km in area.


Data Details:
  • 41 dimensions (more details below)
  • 8 classes (Beach, Ocean, Ice, River, Road, Park, Residential, Industrial).
  • 216,000 samples total, of which 29,003 are labeled

41 layers of data are available for each pixel (lower resolution bands were upsampled to 15 m):


  • 6 Thematic Mapper (TM) bands at 30m resolution.
  • Two thermal bands at 60m resolution
  • One panchromatic band with 15m resolution
  • 32 derived bands representing local contrast, color and texture.

Of the 216,000 points in the image, 29,003 have been assigned one of the eight labels (i.e., represent the ground truth information). As shown in the figure, the image is divided into four vertical strips.



The distribution of ground truth from strip to strip is far from uniform, as shown in the following table.



Boston Data - Ground truth class Distributions
  Strip 1 Strip 2 Strip 3 Strip 4 Totals
Beach 0 67 313 118 498
Ocean 0 552 1280 19919 22101
Ice 144 589 146 0 1169
River 559 323 192 845 1919
Road 75 131 58 4 268
Park 182 152 418 117 555
Residential 419 249 400 243 1311
Industrial 294 431 0 107 832
Totals 1673 3170 2807 21353 25503