| Home | Table of Contents |

General Intelligence :

Basic vision scheme

version 4.2

Our approach to solving the general vision problem is to combine vision with GI (general intelligence). See an introduction to GI-vision.

The basic tenets of my vision theory are:

  1. Break down the image via 3D-2D-1D-0D primitives
  2. Apply machine learning to represent the image symbolically

I have thought about the vision problem for 1-2 years and studied many real images to conclude that "everything under the sun" can be recognized by this method. A paper will be published to explain the theory in detail.

  1. 3-2-1-0D reduction
  2. Logical representation
  3. Machine learning
  4. Example: quadrilateral
  5. Approximate recognition and feedback
  6. Searchlight attention
  7. Relations between objects
  8. Architecture of the vision module

3-2-1-0D reduction

The premise is that any 3D object (or its "geon-like" components) can be defined by 2D surfaces, which are in turn defined by 1D lines. Please refer to General vision theory and background for the justification of this point.

The image is first broken down into 0D/1D primitives (such as edgels and lines). Then 2D elements are recognized (such as regions and surfaces). Then 3D objects are recognized (blocks, 3D geons, etc).

The decompositions are represented by graphical data structures (ie nodes and links). Nodes are primitive elements; Links represent the spatial relations between them.

Details of 3-2-1-0D reduction.

Our 1D-2D levels may coincide with David Marr's primal sketch level. Here we reframe these levels under the primal sketch framework.


Logical representation

The first step is to transform the image to a logical representation:

On the left hand side is the image; we have applied Sobel edge detection. On the right hand side is the logical representation. L1,L2,L3... = lines, J1,J2,J3... = junctions. The blue lines represent how the elements are connected. Other details at the background have not been represented.

For simple, uncluttered scenes, this scheme will work fine. The idea is that every detail, no matter how irregular, would be represented using this logical representation (using elements such as blobs and shades in addition to lines, junctions, etc).

This approach requires a lot of patience, but ultimately we would be able to analyse everything in the world. Other approaches such as neural network or SIFT are not as general or comprehensive.

We may use neural networks at the lowest level for recognizing "edgels". Then the next stage is to join the edgels to recognize longer lines.


Machine learning

What we need is a special kind of machine learning known as inductive learning (as opposed to deductive learning). In inductive learning a system tries to induce general rules from a set of observed instances ("learning by examples").

There should be an underlying knowledge representation (KR) or "calculus" that encompasses what kinds of rules are possible. The most common KR scheme is first-order predicate logic, often abbreviated as FOL (first order logic). Other options include: neural networks, semantic networks, conceptual graphs, Bayesian networks, etc.

It is not easy to determine what kind of KR is adequate for our task at hand (visual recognition). Therefore we start with something simple and similar to FOL and see if it needs to be expanded or modified.

Details of inductive learning.


Example: quadrilateral

Any figure with 4 sides (straight lines).

Define the predicate Terminates(edge,vertex) to indicate when an edge terminates with a vertex.

This results in the set of logical statements:

Terminates(edge1,vertex1) = true
Terminates(edge1,vertex2) = true
Terminates(edge2,vertex2) = true
Terminates(edge2,vertex3) = true
Terminates(edge3,vertex3) = true
Terminates(edge3,vertex4) = true
Terminates(edge4,vertex4) = true
Terminates(edge4,vertex1) = true

Perhaps, we can introduce a new predicate Connects(edge,vertex1,vertex2) to simplify the above to:

Connects(edge1,vertex1,vertex2) = true
Connects(edge2,vertex2,vertex3) = true
Connects(edge3,vertex3,vertex4) = true
Connects(edge4,vertex4,vertex1) = true

Assuming that the "universe" is a connected graph that the system is currently paying attention to, now we can easily define the 0-ary predicate Quadrilateral() using typed logic:

Quadrilateral() IF

∃e1:edge
∃e2:edge
∃e3:edge
∃e4:edge
∃v1:vertex
∃v2:vertex
∃v3:vertex
∃v4:vertex
Connects(e1,v1,v2) ^
Connects(e2,v2,v3) ^
Connects(e3,v3,v4) ^
Connects(e4,v4,v1)

This is just an example. Please refer to this page concerning various issues of inductive learning. Some other issues specific to vision are discussed as follows.


Approximate recognition and feedback

One problem is that primitive features are often fuzzy and should be approximately recognized.

For example, in the image below, 2 edges and 1 vertex are almost invisible, yet given the current context they should be interpreted as edges and vertex. The context is important because very weak features would be regarded as noise otherwise:

The feedback mechanism should work this way: When the Recognizer finds that a concept is "almost" recognized (eg with the majority of conjuncts being true), it will select the remaining features that are not yet matched and send them to the lower-level Recognizer, which would then lower its threshold for recognizing those features.

This requires 2 things:

  1. The Recognizer should measure a degree of certainty associated with each feature being recognized.
  2. The Recognizer at the lower level should be able to use an feedback cue to look for certain features. This may require performing an "inversion" of the cue.

A detailed explanation of the feedback mechanism will be presented soon.


Searchlight attention

Another problem is that real world images are often composed of many cluttered elements, so we need to use a "searchlight" to look for individual objects in a cluttered scene.

Searchlight attention is closely related to the feedback mechanism outlined above. Due to limited computational resources we can only extract features within a "fovea" of attention. If a concept is detected to be "almost" complete, the searchlight will direct the fovea to focus on areas that are likely to complete the concept, with a concomitant decrease of attention to other areas.

This requires the searchlight to know where to search for the feedback cue. In a sense this also requires inversion of the cue.


Relations between objects

The vision system not only has to recognize individual objects, but also relations among them (represented as links).

The way to achieve this is to pay attention to individual objects sequentially. Relations are then recognized by the identities of the objects in the sequence and by how the searchlight moved.

In our logical formulation, an object is recognized by a 0-ary predicate such as Cube1(). Then we have to bring this object to the next level of recognition, where it is represented by a variable such as cube1. Only then we would be allowed to denote a relation like Above(cube1,cube2) at this level.

Recognition at each level is independent of recognition at other levels, except for the feedback mechanism.

The "main loop" of the Recognizer would be using the searchlight to scan around the image, and recognizing individual objects. When the searchlight moves, its movement will be recorded and later used to form the link between the current object and the next object.


Architecture of the vision module

The operation of the above module is typical of a rule-based system, except that there is a "loop" in the lower right corner that repeats the pattern matching process at multiple levels. What this means is that the raw sensory experience goes through multiple stages of memory consolidation via pattern matching. For details please refer to Memory systems.

This architecture has to be integrated with the larger GI (general intelligence) framework, to form a complete intelligent agent. Please refer to GI architecture.


You may provide some instant feedback with this form:

Name:
Comment / suggestion / question:


| Home | Table of Contents |

17/Nov, 21/Jun/2006 (C) General intelligence corporation limited [ Notice about intellectual property on this page ]