Citation: Neural Networks, 8, 1053-1080
Abstract: The recognition of three-dimensional (3-D) objects from sequences of their two-dimensional (2-D) views is modeled by a family of self-organizing neural architectures, called VIEWNET, that use View Information Encoded With NETworks. VIEWNET incorporates a preprocessor that generates a compressed but 2-D invariant representation of an image, a supervised incremental learning system that classifies the preprocessed representations into 2-D view categories whose outputs are combined into 3-D invariant object categories, and a working memory that makes a 3-D object prediction by accumulating evidence from 3-D object category nodes as multiple 2-D views are experienced. The simplest VIEWNET achieves high recognition scores without the need to explicitly code the temporal order of 2-D views in working memory. Working memories are also discussed that save memory resources by implicitly coding temporal order in terms of the relative activity of 2-D view category nodes, rather than as explicit 2-D view transitions. Variants of the VIEWNET architecture may be used for scene understanding by using a preprocessor and classifier that can determine both what objects are in a scene and where they are located. The present VIEWNET preprocessor includes the CORT-X 2 filter, which discounts the illuminant, regularizes and completes figural boundaries, and suppresses image noise. This boundary segmentation is rendered invariant under 2-D translation, rotation, and dilation by use of a log-polar transform. The invariant spectra undergo Gaussian coarse coding to further reduce noise and 3-D foreshortening effects, and to increase generalization. These compressed codes are input into the classifier, a supervised learning system based on the fuzzy ARTMAP algorithm. Fuzzy ARTMAP learns 2-D view categories that are invariant under 2-D image translation, rotation, and dilation as well as 3-D image transformations that do not cause a predictive error. Evidence from sequences of 2-D view categories converges at 3-D object nodes that generate a response invariant under changes of 2-D view. These 3-D object nodes input to a working memory that accumulates evidence over time to improve object recognition. In the simplest working memory, each occurrence (nonoccurrence) of a 2-D view category increases (decreases) the corresponding node s activity in working memory. The maximally active node is used to predict the 3-D object. Recognition is studied with noisy and clean images using slow and fast learning. Slow learning at the fuzzy ARTMAP map field is adapted to learn the conditional probability of the 3-D object given the selected 2-D view category. VIEWNET is demonstrated on an MIT Lincoln Laboratory database of 128x128 2-D views of aircraft with and without additive noise. A recognition rate ofup to 90% is achieved with one 2-D view and of up to 98.5% correct with three 2-D views. The properties of2-D view and 3-D object category nodes are compared with those of cells in monkey inferotemporal cortex.