Extracting Rules from Backpropagation Units

by Jonathan Henckel and Abolfazl Sirjani

Click here for source code.

In 1986, Rumelhart, Hinton, and Williams published the backpropagation (BP) neural network algorithm. The BP neural network consists of many independent computing "units" arranged in layers. BP is used to inductively learn functional mappings between multiple input and output variables. It is the most common neural network algorithm used in applications.

The BP algorithm is a "black box". That is, when it has learned a functional mapping, there is no way to (1) disect the function into simple components, or (2) explain how BP arrived at a particular conclusion. Also, (3) BP requires that all inputs be specified, rather than requesting inputs on an as-needed basis.

Suppose that doctors at a clinic can request numerous lab tests in order to diagnose patients. Each time a patient is diagnosed, his lab test results and diagnosis is recorded in a database. The BP algorithm is run against this database to learn the mapping between lab test results and diagnosis. This mapping can then be used to automatically diagnose patients based on lab test results. However, since the mapping is a black box, there is no way for the doctors to approve the knowledge that has been acquired. There is no way for BP to explain its diagnosis. If partial input is given to BP, it does not indicate if more information is needed. Therefore, every lab test would need to be performed beforehand in order to guarantee correctness.

We have not completely solved all problems mentioned in the previous section. We have developed an algorithm to convert the function performed by a single BP unit into a minimal set of production rules. Production rules are very well suited for representing complex functions as simpler statements that people can read and understand. Furthermore, the rules could be entered into an expert system shell and executed. Most ES shells request input variables as needed and can generate reasonable explanations of how a conclusion was reached.

Background
Rule Extraction
Success Pruning
Failure Pruning
Combined Pruning
Summary

Background

Fig.1 shows an example of a BP unit with four input variables. For our purposes, we will assume input varibles are binary, 0 or 1. The weight, w_i, associated with each input variable and the threshold, t, are real numbers.

The unit's operation is to first find its net input by

net = t + SUM w_ki_k

and then to compute its output

out = ( 1 + e^-net )^-1

This formula is called the activation function. It is a squashing function so the output is a real value in the range (0 1). An output above 0.9 is generally interpreted as a 1 or TRUE, and an output below 0.1 as a 0 or FALSE. Other output values are interpreted as ambiguous.

Rule Extraction

Our rule extraction algorithm efficiently converts the function performed by a BP unit to a minimal set of production rules. A "minimal" set means that the total size of the left-hand sides of the productions is as small as possible.

Before describing the algorithm, we will make some observations.

1. The function performed by a BP unit depends entirely on its threshold and weights. Therefore, this is all the information needed to extract rules.

2. The inputs with associated weights with the greatest absolute value have the greatest effect on the output of the unit.

3. The activation function is monotonic and tends to 1 for large positive net and tends to 0 for negative net.

Each input variable has a corresponding weight. When BP is in learning mode, these weights and the threshold are adjusted. Afterwards, the weights and threshold are not changed. Inputs of 0 have no effect on the net. Inputs of 1 with negative corresponding weights decrease the net and those with positive corresponding weights increase the net.

The production rules to be generated will take two forms:

if <expression₁> then output is high

if <expression₂> then output is low

The <expression> is a conjuction of statements about the input variables. (Allowing disjuction would not make the expressions more powerful.)

In fig.1, suppose W = (5, 4, -3, -2) and t=2. Conveniently, these weights are in sorted order by absolute value. If they were not, we would use a table sort to index them in sorted order. If i₁=1 and i₄=0 then the net would be at least 4, so the rule

(1) if i₁=1 and i₄=0 then output is high

is valid. Notice that

(2) if i₁=1 then output is high

is also valid. Rule (2) is a minimal rule, but (1) is not because (1) is subsumed by (2).

Some notation needs to be defined to make these expressions more readable. Let p_x represent a statement that can cause the output to be high

p_x means i_x=1 when w_x >= 0

p_x means i_x=0 when w_x < 0

Expressions are compound statements of the form

p_x1p_x2 ... p_xm, 0 <= m <= n

where n is the number of input variables. To avoid redundancy, x1 < x2 < ... < xm.

Thus our previous rule (1) can be expressed

if p₁p₄ then output is high.

Let total_min equal the threshold plus the sum of negative weights. Define the function f so that

f(Ø) = total_min

f(p_x) = total_min + |w_x|

f(p_xp_y) = total_min + |w_x| + |w_y|

...

Notice that f is our "at least" function. In the previous example, f(p₁p₄) = 4. When f is near zero, then the output could be ambiguous. When f is above 2 the output will be above 0.9.

To extract rules we search the possible expressions from shortest to longest. The shortest expression is the empty expression Ø, when m=0. This search can be thought of as traversing the tree of expressions as in fig.2. As the tree is traversed, each expression is tested. If f(e) > 2 then the expression has succeeded, otherwise the expression failed. The size of this tree is 2ⁿ. which is too large to search exhaustively. We will discuss two schemes to reduce the size of this tree: success pruning, for pruning the height, and failure pruning for pruning the breadth of the tree.

Success Pruning

Notice in the tree that every expression is subsumed by its parent. If an expression succeeds, there is no reason to search its children, because we want to find the minimal set of rules. This is success pruning.

Failure Pruning

Failure pruning takes advantage of the fact that the weights are sorted by absolute value. Because |w₁| >= |w₂|, we know f(p₁) >= f(p₂). In general, this is true for any two expressions that differ in only one place.

If i < j, then f(E₁p_iE₂) >= f(E₁p_jE₂)

where E_k is some legal sequence of p's.

Suppose, for example, that the current expression is p₁. If this fails (because f(p₁) < 2), then we will not look at p₂ since we know f(p₂) <= f(p₁).

Fig.3 shows the fail graphs for expressions of size 1, 2, and 3. If any expression in the graph fails, then all expressions it "points to" will also fail, so they can be pruned from the search. The shaded regions show the quadrants pruned when each e fails.

We use these fail graphs to control the order in which expressions of the same length are searched. One problem in traversing these graphs is that some expressions have multiple paths leading into them. The expression p₃p₅, for example, has five paths leading to it from p₁p₂. We only want to search p₃p₅ once. To fix this we could either eliminate the arrows pointing right or those pointing left. Because of the shape of the pruning, we eliminate those pointing left, except when they are necessary. The left arrows are necessary from each expression p_xp_y where x=y-2. After eliminating these arrows we get the graphs in fig.4.

In these graphs, each expression will only be looked at once, and it will only be looked at if the expression pointing to it has not failed. There are graphs for higher m, but the graph for m=4 is four dimensional and it is difficult to draw a 4-d picture. We can, however, use these small cases to generalize to larger m. Most expressions point to only one other expression which has the last p incremented by 1.

E₁p_i points to E₁p_i+1, i < n

Some expressions point to a second expression, (no expression points to more than two). We already noted that

E₁p_ip_j points to E₁p_i+1p_j, i=j-2

In general,

E₁p_ip_j1...p_jr points to E₁p_i+1p_j1...p_jr, (r >= 0) where i=j₁-2, j_s=j_s+1-1 (s < r)

An example expression p₁p₄p₆p₈p₉p₁₀ points to both p₁p₄p₆p₈p₉p₁₁ and p₁p₄p₇p₈p₉p₁₀.

Combined Pruning

To combine these pruning strategies we will combine the tree from fig.2 with the graphs in fig.4 to get the search graph in fig.5.

The success paths are only followed when an expression succeeds, the failure paths only when it fails. Only one failure pointer is needed from each expression, because the children are linked by success pointers.

Some expressions have two success pointers. If one of these expressions succeeds, both pointers must be followed. A recursive procedure is the most obvious way to do this. All other cases may be handled iteratively.

In implementation, this search graph is never actually constructed in memory. The only part in memory at one time is the current expression and a stack of expressions for which only one of two success paths have been explored.

Summary

The only information used to guide the search in this algorithm is the fact that the weights have been sorted beforehand by absolute value and the results of the expressions already searched. The algorithm is optimal in the sense that, with this information, no expression is looked at if it can be proven to succeed (it is subsumed by a previous successful expression) or it can be proven to fail.

The time complexity of this algorithm is relatively good. The only alternative algorithms we know of begin by listing all the minterms of the function. This takes O(2ⁿ) time and O(2ⁿ) space. We ran our rule extraction algorithm for n up to 25 and it appears to take O(1.6ⁿ) time on average. We have not taken measurements on space complexity, but it appears small.

So far we have discussed how to extract rules of the form "If E then output is high." To extract the other kind of rules, "If E then output is low," all we need to do is invert the definition of p_i and redefine f such that

f(p_x) = -total_max + |w_x|

We have not discussed combining the rules from multiple layers of BP units. The naïve substitution approach seems to work in some cases. The difficulty is that the inputs to units in other than the first layer are often not binary.

This page hosted by Get your own Free Home Page