Data Sets

Data sets in Classer format:

Three data sets are provided already encoded in the format expected by Classer. These data sets are used in all the examples and tutorials, and can be downloaded here:

d_CIS Circle-in-the-Square data
d_Letters Letter Recognition (Frey & Slate, from UCI Machine Learning Repository)
d_Boston Boston Testbed (must be preprocessed before use).

The first two data sets are ready to use after unzipping. The Boston testbed must first be expanded by running the batch script in

d_BostonunWrap.bat

Data Set: Circle-in-the-Square (CIS)

This is a simple data set with just two classes and two dimensions, describing whether or not points fall within a circle set within a square.

Data Details:

2 dimensions, x and y,
2 classes, Out and In.
1000 samples

All points fall within the unit square; a centered circle with area exactly half that of the square partitions the two classes. When samples are drawn at random from the square, on average half are in the circle, and half are outside.

Data obeying these rules can be generated at will; simulations described in this document use the thousand point data set shown in the figure at right.

Data Set: Frey & Slate Letter data

This data set describes statistical attributes of 20,000 digitized pictures of letters, and was used to study machine learning using Holland-style adaptive classifiers (Frey & Slate, 1991). Our copy was obtained from the UCI repository (http://archive.ics.uci.edu/ml/).

Data Details:

16 dimensions (listed below).
26 classes representing letters of the alphabet (A-Z).
20,000 samples, divided into a 16,000 sample training set and 4,000 sample test set.

Frey & Slate Letter data : List of input features
x-box - horizontal position of box
y-box - vertical position of box
width - width of box
high - height of box
onpix - total # on pixels
x-bar - mean x of on pixels in box
y-bar - mean y of on pixels in box
x2bar - mean x variance
y2bar - mean y variance
xybar - mean x y correlation
x2ybr - mean of x * x * y
xy2br - mean of x * y * y
x-ege - mean edge count left to right
xegvy - correlation of x-ege with y
y-ege - mean edge count bottom to top
yegvx - correlation of y-ege with x

A 20,000 point data set is broken into two pieces: one with 16,000 points, and one with 4,000 points. Each data point is derived from a pixellated image of a letter. As stated in the description accompanying the data set: "The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15. We typically train on the first 16000 items and then use the resulting model to predict the letter category for the remaining 4000."

Data Set: Boston Remote Sensing Testbed

The Boston remote sensing testbed describes a remotely sensed area (data from a Landsat 7 Thematic Mapper satellite), 360 pixels wide by 600 pixels in height, or 5.4 km x 9 km in area.

Data Details:

41 dimensions (more details below)
8 classes (Beach, Ocean, Ice, River, Road, Park, Residential, Industrial).
216,000 samples total, of which 29,003 are labeled

41 layers of data are available for each pixel (lower resolution bands were upsampled to 15 m):

6 Thematic Mapper (TM) bands at 30m resolution.
Two thermal bands at 60m resolution
One panchromatic band with 15m resolution
32 derived bands representing local contrast, color and texture.

Of the 216,000 points in the image, 29,003 have been assigned one of the eight labels (i.e., represent the ground truth information). As shown in the figure, the image is divided into four vertical strips.

The distribution of ground truth from strip to strip is far from uniform, as shown in the following table.

Boston Data - Ground truth class Distributions

	Strip 1	Strip 2	Strip 3	Strip 4	Totals
Beach	0	67	313	118	498
Ocean	0	552	1280	19919	22101
Ice	144	589	146	0	1169
River	559	323	192	845	1919
Road	75	131	58	4	268
Park	182	152	418	117	555
Residential	419	249	400	243	1311
Industrial	294	431	0	107	832
Totals	1673	3170	2807	21353	25503

Classer Toolkit Menu

Data Sets