Data sets in Classer format:
Three data sets are provided already encoded in the format expected by Classer. These data sets are used in all the examples and tutorials, and can be downloaded here:
The first two data sets are ready to use after unzipping. The Boston testbed must first be expanded by running the batch script in
d_BostonunWrap.batThis is a simple data set with just two classes and two dimensions, describing whether or not points fall within a circle set within a square.
All points fall within the unit square; a centered circle with area exactly half that of the square partitions the two classes. When samples are drawn at random from the square, on average half are in the circle, and half are outside.
Data obeying these rules can be generated at will; simulations described in this document use the thousand point data set shown in the figure at right.
This data set describes statistical attributes of 20,000 digitized pictures of letters, and was used to study machine learning using Holland-style adaptive classifiers (Frey & Slate, 1991). Our copy was obtained from the UCI repository (http://archive.ics.uci.edu/ml/).
A 20,000 point data set is broken into two pieces: one with 16,000 points, and one with 4,000 points. Each data point is derived from a pixellated image of a letter. As stated in the description accompanying the data set: "The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15. We typically train on the first 16000 items and then use the resulting model to predict the letter category for the remaining 4000."
The Boston remote sensing testbed describes a remotely sensed area (data from a Landsat 7 Thematic Mapper satellite), 360 pixels wide by 600 pixels in height, or 5.4 km x 9 km in area.
41 layers of data are available for each pixel (lower resolution bands were upsampled to 15 m):
Of the 216,000 points in the image, 29,003 have been assigned one of the eight labels (i.e., represent the ground truth information). As shown in the figure, the image is divided into four vertical strips.
The distribution of ground truth from strip to strip is far from uniform, as shown in the following table.
Strip 1 | Strip 2 | Strip 3 | Strip 4 | Totals | |
Beach | 0 | 67 | 313 | 118 | 498 |
Ocean | 0 | 552 | 1280 | 19919 | 22101 |
Ice | 144 | 589 | 146 | 0 | 1169 |
River | 559 | 323 | 192 | 845 | 1919 |
Road | 75 | 131 | 58 | 4 | 268 |
Park | 182 | 152 | 418 | 117 | 555 |
Residential | 419 | 249 | 400 | 243 | 1311 |
Industrial | 294 | 431 | 0 | 107 | 832 |
Totals | 1673 | 3170 | 2807 | 21353 | 25503 |