CA2035338C

CA2035338C - Operational speed improvement for neural networks

Info

Publication number: CA2035338C
Application number: CA002035338A
Authority: CA
Inventors: Bernhard Boser
Original assignee: American Telephone and Telegraph Co Inc
Current assignee: AT&T Corp
Priority date: 1990-03-21
Filing date: 1991-01-31
Publication date: 1995-07-25
Anticipated expiration: 2011-01-31
Also published as: FR2660091A1; FR2660091B1

Abstract

Higher operational speed is obtained without sacrificing computational accuracy and reliability in a neural network by interchanging a computationally complex nonlinear function with a similar but less complex nonlinear function ineach neuron or computational element after each neuron of the network has been trained by an appropriate training algorithm for the classifying problem addressed by the neural network. In one exemplary embodiment, a hyperbolic tangent function is replaced by a piecewise linear threshold logic function.

Description

.

OPERATIONAL SPEED IMPROVEMENT FOR
NEURAL NETWORKS

Technical Field This invention relates to the field of pattern recognidon and, more S pardcularly, to col~.~ulational elements in neural nc~wc,l~s.

Ba~h~ u..d of the I~ ti~,..
Col~uter-based info"l ation gathering, h~nclling, manipuladon, storage, and tr~nsmi~ion have fostered the growth and accep~ce of colllpulation ~y~t~,llls based upon adapdve le~ming with various neural nelw~ architectures. These 10 co~u~lion sy~l~llls are coll-l,lonly called aul~,l.-atic le~ming nelwo,ks, neural nelwu,~, hierarchically layered hClwOl~S, and massively parallel cc"l",ula~ion nclw~ s. Applir~tion~ of the colllpula~ional sy~t~,ms l~iplesent potendally ~mci~nt a~pr~.a~ es to solving problems such as providing au~o...~t;c recognidon, analysis and cl~ifi~ation of cha~a.;l~r pattern~ in a pardcular image. In ll~asuling the value 15 of such applied systems, it is necess~ry to focus on two key operadonal palal"el~s reladve to COl~nE ~I;or~l approaches. The parameters are speed and ac~;ula.;y.
Ac~;ula~y of the sy~l~llls has been improved steadily by employing more complex d~hi~ u,~s, more extensive training roudnes, and muld-valued interm~ te decision levels (e. g., gray scale encoding). Unfortunately, 20 improvements to obtain greater system accuracy tend to adversely impact the speed of the system.
Since most sy~l~",s are based on impl-..,e~ tions using a general or special-l,u,~ose pl~cessol, complexity and rigor~usness of such an impl~...enl~liQn is generally tr~n~l~t~b!e into ~l~lition~l program steps for the processor. In turn, the 25 ,~ se of the l),vcess~r is slowed by an amount co.. ~n~-lrate with the addidonal number of ~Oglalll steps to be p~,lro,.,led. As a result, faster, more effecdve neural network co,llpu~lion sy~tellls appear realizable only by replacement of the processor with a faster and equally, or more, accurate processor.

Summary of the Invention Higher operational speed is obtained with reduced colllpulational compleYity without sacrificing quality, accuracy and reliability of the result in a neural network by training a plurality of neurons (colllpulational elements) in the ~ 2035338 neural network while a computationall complex nonlinear squashing function is emplored in each neuron or computational element to determine accurate weights for each neuron, replacing the computationall complex nonlinear squashing function with a less complex nonlinear squashing function, and classif~ing data presented to the neural network with the less complex nonlinear squashing function employed in each neuron of the neural network. Replacement of the computationall complex nonlinear squashing function with the less complex nonlinear function permits a savings in computation to be realized for the neural network at each neuron.
In one embodiment, a differentiable nonlinear function such as a h.perbolic tangent is replaced b a piecewise linear threshold logic function. Differentiable nonlinear functions are required for certain well known training criteria such as back propagation.
In another embodiment, each neuron-or computational element of the neural network includes several nonlinear function elements representing nonlinear squashing functions of var ing degrees of complexit . The elements are controllabl switched into, and out of, operation during training and classif ing operational phases in accordance with the principles of the invention.
In accordance with one aspect of the invention there is provided a computational device for use in a neural network comprising means jointl responsive to a data input vector and a weight vector for performing a dot product thereof, means responsive to an output from said dot product means for squashing said output according to a predetermined nonlinear function to generate an output value for the computational device, means for substituting a second nonlinear function for a first nonlinear function as the predetermined function in said squashing means, said weight vector being related to said first nonlinear function, said second nonlinear function being less computationall complex than said first nonlinear function.
In accordance with another aspect of the invention there is provided a neural network comprising a pluralit of computational devices interconnected in a predetermined hierarchical structure to form said network, each computational device comprising means jointl responsive to a data input vector and a weight vector for performing a dot product thereof, means responsive to an output from said dot product means for squashing said output according to a predetermined nonlinear function to .~

_ -2a-generate an output value for the computational device, means for substituting a second nonlinear function for a first nonlinear function as the predetermined function in said squashing means, said weight vector being related to said first nonlinear function, said second nonlinear function being less computationally complex than said first nonlinear 5 function.
In accordance with yet another aspect of the invention there is provided a method for improving a neural network, said network comprising a plurality of computational devices, each computational device comprising means responsive to a data input value for squashing said data input value according to a predetermined nonlinear 10 function to generate an output value for the computational device, said method comprising the step of: substituting a second nonlinear function for a first nonlinear function as the predetermined function in each said squashing means, said data input value being related to said first nonlinear function, said second nonlinear function being less computationally complex than said first nonlinear function.

15 Brief Description of the Drawing A more complete understanding of the invention may be obtained by reading the following description of specific illustrative embodiments of the invention in conjunction with the appended drawing in which:
FIGs. I and 2 are simplified block diagrams for alternative individual 20 computational elements in a learning or neural network in accordance with the principles of the invention;
FIGs. 3 through 7 show exemplary graphical representations of different nonlinear functions employed by the computational elements in FIGs. I and 2; andFIG. 8 shows a simplified block diagram of an exemplary hierarchical 25 automatic learning network in the prior art.

Detailed Description In the description below, pattern recognition and, particularly, optical character recognition are presented as applicable areas for the present invention. Such an example is meant for purposes of explication and not for purposes of limitation. It is 30 contemplated that the present invention be applied also to areas such ~ B

~O~i33~
, as speech processing and recognition, automatic control systems, and other artificial intelligence or automatic learning networks which fall within the realm commonlycalled neural nelwolL~,. Hereinafter, the term "neural network" is understood to refer generically to all such networks.
Colll~ukl~ional elçm~nt~ as shown in FIGs. 1 and2 form the fi~n~l~ment~l functional and connec~ionist blocks for most neural or learning nelwu~k~,. In general, a colllp..l~;on~l element forms a weighted sum of input values for n+l inputs and passes the result through a nonline~rity f(a), where a is the input to the nonlinear function, to arrive at a single value. The nonline,~rity is often termed 10 a nonlinear sq~ hing function and includes functions such as hard limiter functions, threshold logic ele~ --t functions, and sigmoidal functions, for example, as shown in FIGs. 3 through 7. Input and output values for the col~lpulalional element may be analog, quasi-analog such as multi-level and gray scale, or binary in nature.
In operation, the cc,lll~ ;on~l elem--nt shown in FIGs. 1 and 2 scans n 15 diffe~nt inputs which, during an exemplary operation such as opdcal characterrecognition, may be neighboring input pixels, pixel values or unit values from an image or feature map. These inputs have values l~rese.lled as al, a2 ..., an. Aninput bias is supplied to an n+l input of a colll~ulational element. For simplicity, the bias is generally set to a CO~ t value such as 1. The inputs and the bias are 20 supplied to multipliers 1-1 through l-(n+l). The multipliers accept another input from a weight vector having weights wl through wn+l. Outputs from all multipliers are supplied to adder 2 which ge,le.ates the weighted sum of the input values. As such, the output from adder 2 is simply the dot product of a vector of input values (incl~1~ing a bias value) with a vector l~l~;se -ling the weights. The output value 25 from adder 2 is passed through the nonlinear function to gene ~le a single unit output value xi. As will be understood more clearly below, unit output value x; is related to the value of the id' uni~ in the feature map under consideration.
A first nonline~r function is used during the training phase of neural network operation to establish values of the weight vector by a standard training 30 ~lgo.;~ such as backward propagation or the lLke. During a non-training phasesuch as a cl~sifiration or recognition phase, a second nonlinear function is used by the neural network. In accordance with principles of the present invention, the second nonline~r function is less coln~ tionally complex than the first nonlinear fiml~tion Colllpulalional comrlexity is understood to mean that the COIllpu~Lion necess~.y to achieve a result for the nonlinear function f(a) are simpler and, most probably, fewer in number for the second nonlinear function as cc,lllp~u~id with the first nonlinear function. This will become more clear with the description of FIGs. 3 5 through 7 below. By requiring lower cc,nl~u~lional complexity for the nonlinearity during a non-training phase, one can readily see that each colllpu~tional element Op~ tcS faster during a non-training operational phase than each corresponding coll.~u~lional elçmrnt using the first nonlinear function during training operational phase.
Colll~u~tional complexity is ~lh~s best understood with reference to FIGs. 3 through 7. FIGs. 5 and 7 are continuous nonlinear functions having a high degree of colll~utadonal complexity relative to the piecewise linear or quasi-continuous functions shown in E;IGs. 3, 4, and 6. In accordance with the principles of this invention, a coll-~ ;on~lly complex nonlinear funcdon (e. g., FIGs. S and 7) 15 is employed during the training phase to establish values for each weight vector entry. After the weight vector values are set, a colllpulationally simpler nonlinear function (e. g., FIGs. 3, 4, and 6) replaces the cc.~u~lionally complex function.
Continuous nonlinear functions such as shown in FIGs. 5 and 7 require a cignifir~nt amount of colllpuling to d~,tellm~le f(a) for a given a. Hence, they are 20 colllpulalionally complex. It should be noted that these functions are a~ylllplotic along both hori7ont~1 axes.
Less comrleY functions shown in FIGs. 3, 4, and 6 require much less colllpulillg to dete ..-ine the value of F(a). A hard limiter function shown in FIG. 3 minim~lly a~,proAh.~lcs the more complex nonlin~r function in FIG. 7. A much 25 closer applv,~i.--~l;on for the function in FIG. 7 is shown in FIG. 6, which is a pieccwise linear threshold logic function. The latter function is known as a piece~. ise linear function becanse it comprises a number of linear pieces to complete the entire function. While the breakpoints are shown to occur at a point of equality for the o~dinate and abscic~s other relationships are conlelllplated so that the slope 30 of the line may be changed or the curve may be shifted left or right and the like.
VVhile a nulll~ of dirr.,~.-t nonlinear functions of varying colll~u~tional complexity have been shown in FIGs. 3 through 7, many other nonlinear functions are collte.llplated for use in the present invention. For example, a Taylor series approximqti~)n of a particular nonlinear sq l~ching function may be 35 employed as the compula~ionally complex nonlinear funrtion. An accurate Taylor series a~ç~illlation of the hyperbolic tangent function is given below as:

tanh(x)- ~ aixi, i=o where n is a large integer and ai are constant coefficients. A nonlinear function having reduced co",puLational complexity which may be substituted for the accurate Taylor series expansion described above is given as:

S tanh(x) ~ ~, aixi, i=o where m is a small integer such that m n.
It may be desirable to choose a less coull)u~tion~lly complex function, f2(a), to replace the more ccn~uL~tionally complex nonlinear function, fl(a), where f2(a) is a fairly good ~p~ ation of fl (a). However, it is conlc",plated that any 10 less complex function may be used to replace any re complex function where there is little or no relationship ~l~een the two functions.
As shown in FIG. 1, the computationally complex nonlinear function 3, fl(a), which is used during the training phase, is controllably swilched out of operation for the coulpulational ele~ n~ by switches S and 6. When the n~lwolk is 15 in a non-training phase, the less complex nonlinear function 4, f2(a), is connected into operation for the co"lpulation ele-~-"-t Variations of the switch configuration shown in FIG. 1 have been conl~l)lated. Such v~ri~tions include com ecLing the output xi to each of nonlin~rity 3 and 4 and using switch 5 to direct the data to the correct nonlinear 20 function for the particular operational phase of the neural network. Another variation inc!udes connecting the input to each of nonlinearitv 3 and 4 and operating switch 6 to select the proper nonlinear function for the particular operational phase.
In FIG. 2, a single nonlin~rity7 is shown in which the nonlinear function, fi(alpha), is evaluated for a given a where the function subscript i is 1 for a 25 co"lpu~lionally complex nonlinear function and 2 for a nonlinear function of lesser co",~u~ on~l complexity. This arrangement requires fewer parts than the co,~uLaLional elem~nt in FIG. 1 but it is iso~o,l.nic with the arrangement shown in FIG. 1.
In an example from e,~pe~ l practice, an exemplary sigmoidal 30 function for co,~",uL~Lionally complex nonlin~rity 3 in FIG. 1 is chosen for the training phase as a scaled hy~lbolic tangent function, fl(a)=c tanh(Sa), where a is the weighted sum input to the nonline~nty, c is the arnplitude of the function, and S
~letçrmines the slope of the function at the origin. As described above and shown in FIG. 7, the nonlinearit is an odd function with horizontal as mptotes at +c and -c. It is understood that nonlinear functions exhibiting an odd s mmetr are believed to ield faster convergence of the weight vector elements, wl through wn+l, during learning or training.
Weights for each computational element in a learning network are obtained using a standard training algorithm. One such training algorithm is a trail and error learning technique known as back propagation. See Rumelhart et al., "Learning Internal Representation b Error Propagation", in Parallel Distributed Processin:Explorations in the Microstructure of Cognition, Volume I, pp. 319-364 (Cambridge, MA.:
Bradford Books) or see R.P. Lippmann, "An Introduction to Computing with Neural Nets", IEEE ASSP Ma~azine, Vol. 4, No. 2, pp. 4-22 (1987). Prior to training, each of weight is initialized to a random value using a uniform distribution between, for example, -2.4/F; and 2.4/Fj where F; is the number of inputs (fan-in) of the unit to which the connection belongs. For the example shown in FIG. 1, the fan-in F; is equal to n+1. B
using this initialization technique, it is possible to maintain values within the operating range of the sigmoid nonlinearit. During training, image patterns are presented in a constant order. Weights are updated according to the stochastic gradient or "on-line"
procedure after each presentation of a single image pattern for recognition. A true gradient procedure ma be emplo ed for updating so that averaging takes place over the entire training set before weights are updated. It is understood that the stochastic gradient is found to cause weights to converge faster than the true gradient especiall for large, redundant image data bases.
Realization of the computational elements and, for that matter, entire networks ma be in hardware or software or some convenient combination of hardware and software. Much of the network presented herein has been implemented using anAT&T DSP 32C digital signal processor with simple programs performing the rudimentar mathematical operations of addition, subtraction, multiplication, and comparison. Pipeline devices, microprocessors, and special purpose digital signal processors also provide convenient architectures for realizing the network in accordance with the principles of the invention. MOS VLSI technolog has also been emplo ed to implement particular weighted interconnection networks of the t pe shown in FIG. 8. Local memor is desirable to store pixel and unit values and other temporar computation results.

... .
~ A

_ - 7 -Each pixel has a value associated therewith which corresponds to the light intensit or color or the like em~n:lting from that small area on the visual character image.
The values of the pixels are then stored in the memor devices. When reference is made to a particular map, it is understood that the terms "pixel" and "unit value(s)" are used interchangeabl and include pixels, pixel values and unit values output from each computation element combining to form the map arra. It ma be more convenient to think in terms o~
planes or 2-dimensional arra s (maps) of pixels rather than pixel values or unit values for visualizing and developing an understanding of network operation.
Standard techniques are emplo ed to convert a handwritten character to the pixel arra which forms the supplied character image. The character image ma be obtained through electronic transmission from a remote location or it ma be obtained locall with a sc~nning camera or other sc~nning device. Regardless of its source and in accordance with conventional practice, the character image is represented b an ordered collection of pixels.
The ordered collection is t picall an arra . Once represented, the character image is generall captured and stored in an optical memor device or an electronic memor device such as frame buffer.
Various other preprocessing techniques used to prepare a character image as a pixel arra for character recognition ma include various transformations such as scaling, size normalization, deskewing, centering, and translation or shifting, all of which are well known to those skilled in the art. In addition, transformation from the handwritten character to a gra scale pixel arra ma be desirable to preserve information which would otherwise be irretrievabl lost during preprocessing. The latter transformation is understood to be well known to those skilled-in the art.
In addition to the operations listed above for preparation of the image for character recognition, it is generall desirable to provide a uniform, substantiall constant level border around the original image.
FIG. 8 shows a simplified block diagram of an exemplar embodiment for a hierarchical constrained automatic learning network which can include computational elements realized in accordance with the principles of the invention. This network has been described in ~n~ n Patent Application Serial No. 2,015,748 which was filed on April 30, 1990 in the name of J.S. Denker; et al. The network performs character recognition from a supplied image via massivel parallel computations. Each arra (box) shown in ~ .,, levels 20 through 60 is understood to comprise a plurality of computational elements, one per array unit.
The network shown in FIG. 8 comprises first and second feature detection layers and a character classification layer. Each layer comprises one or S more feature maps or arrays of varying size. In most conventional applications, the maps are square. However, rectangular and other symmetric and non-symmetric or irregular map patterns are con~ lated. The arrangement of detected features is referred to as a map because an array is constructed in the memory device where the pixels (unit values) are stored and feature detections from one lower level map are 10 placed in the apl)r~pliate locations in the array for that map. As such, the presence or su~s~nt;~l presence (using gray scale levels) of a feature and its reladve location are thus lecol~ed.
The type of feature detected in a map is determined by the weight vector being used. In con~LIained feature maps, the same weight vector is used for eachlS unit of the same map. That is, a constrained feature map scans a pixel array for the oc~;ullcnce of the particular feature defined by one particular weight vector. As such, the term "coh~llained" is understood to convey the condition that co~ )utalion elem~nt~ compri~ing a particular map are forced to share the same set of weight vector. This results in the same feature being detected at dirÇerent locations in an 20 input image. It is understood that this technique is also known as weight sh~rin~.
For those skilled in the art, it will be understood that the weight vector defines a receptive field (S pixels x 5 pixels or 2 pixels x 2 pixels) on the plane of the image pixels or map units being detected for occurrence the feature defined by the weight vector. By pl~cçmçn~ of the weight vector on a pixel array, it is possible to 25 show what pixels are being input to the coll~utation element in the feature map and which unit on the feature map is being activated. The unit being activated coll~*,onds generally to an a~plo,~illlate location of the feature occurrence in the map under detection.
The first feature detection layer includes a plurality of constrained 30 feature maps 20 and a coll~pollding plurality of feature reduction maps 30. As shown in the figure, the particular embodiment of the network includes four each of the constrained feature maps and the coll~ onding feature reduction maps in the first layer. The second feature detection layer includes a plurality of constrained feature maps 40 and a corre~ndillg plurality of feature reduction maps S0. As 35 shown in the figure, the particular embodiment of the network includes twelve each of the constrained feature maps and the corresponding feature reduction maps in the g second layer.
The final layer of the network comprises a character classification layer 60 which is fully connected to all feature reduction maps of the second feature detection layer. The character cl~ccific~tiQn layer generates an inllication of the 5 character recognized by the network from the supplied original image. The term"fully connected" is understood to mean that the co-llpu~ation element associated with a pixel in character cl~ccifi~ation layer 60 receives its input from every pixel or unit included in the preceding layer of maps, that is, layer 50.
Interconnection lines from layer to layer in the network have been 10 drawn to show which maps in a preceding layer provide inputs to each and every co.llpulation element whose units form the maps in the layer of interest. For example, constrained feature maps 201 through 204 detect dirr~ent features from image 10 in the process of gel1el~ting the constrained feature maps. Proceeding to the next level of maps, it is seen that feature reduction map 301 derives its input 15 solely from the units in conctrained feature map 201. Similarly, feature reduction maps 302 through 304 derive their input solely from the units in con~ ined feature maps 202 through 204, r~specli~ely. For the network embodiment shown in FIG. 2, intel.;on.1eclion from the first feature detection layer to the second feature detection layer is so--lcwhat more complicated. Constrained feature maps 401, 404, 407, and 20 410 derive their inputs solely from units in feature reduction maps 301 through 304, respectively. Constrained feature maps 402, 403, 405, and 406 derive their inputs from combinations of units from feature reduction maps 301 and 302; constrained feature maps 408, 409, 411 and 412 derive their inputs from combinations of units from feature reduction maps 303 and 304. Finally, individual feature reduction maps 25 501 through 512 derive their inputs solely from units in individual ones of corresponding constrained feature maps 401 through 412, respectively.
It should be noted that the character classification layer 60 includes a sufficient nu-ll'~r of elements for the particular character recognition problem being solved by the network. That is, for the recognition of either upper case or lower case 30 Latin alphabetic characters, one exemplary embodiment of layer 60 would include 26 units signifying the letters A through Z or a through z. On the other hand, for the recognition of numeric characters, one embodiment of layer 60 would include only10 units signifying the numbers 0 through 9.
For convenience and ease of understanding, the bias input to the 35 com~uL~lional element and its associated weight in the weight vector element shown in FMs. 1 and 2 have been omitted in the description of the neural network herein.

- lO- 2035338 In e,~l~ental practice, the bias is set to 1 and its corresponding weight is learned through back propagation.
When colllpulalional elements realized in accordance with the principles of the invention are employed in the network shown in FIG. 8 and programrned on S an AT&T digital signal processor DSP-32C, the neural network achieved a 100%
increase in operational speed for character recognition. The network was trainedusing a hy~bolic tangent function with back propagation training. The hyperbolictangent function was replaced by a piecewise linear function during the classification (character recognition) phase. This piecewise linear function is shown in FIG. 6 and 10 expressed as follows:
c, if a>c f(a) = ' -c, if a<-c a, otherwise where a is the input to the nonlin~-~nty in the colll~ulalional element

Claims

1. A computational device for use in a neural network comprising means jointly responsive to a data input vector and a weight vector for performing a dot product thereof, means responsive to an output from said dot product means for squashing said output according to a predetermined nonlinear function to generate an output value for the computational device, means for substituting a second nonlinear function for a first nonlinear function as the predetermined function in said squashing means, said weight vector being related to said first nonlinear function, said second nonlinear function being less computationally complex than said first nonlinear function.

2. The computational device as defined in claim 1 wherein said second nonlinear function is a piecewise linear approximation of said first nonlinear function.

3. The computational device as defined in claim 2 wherein said first nonlinear function is a hyperbolic tangent function.

4. The computational device as defined in claim 3 wherein said second nonlinear function is a piecewise linear function defined as, where x is said output.

5. A neural network comprising a plurality of computational devices interconnected in a predetermined hierarchical structure to form said network, each computational device comprising means jointly responsive to a data input vector and a weight vector for performing a dot product thereof, means responsive to an output from said dot product means for squashing said output according to a predetermined nonlinear function to generate an output value for the computational device, means for substituting a second nonlinear function for a first nonlinear function as the predetermined function in said squashing means, said weight vector being related to said fist nonlinear function, said second nonlinear function being less computationally complex than said first nonlinear function.

6. The neural network as defined in claim 5 wherein said second nonlinear function is a piecewise linear approximation of said first nonlinear function.

7. The neural network as defined in claim 6 wherein said first nonlinear function is a hyperbolic tangent function.

8. The neural network as defined in claim 7 wherein said second nonlinear function is a piecewise linear function defined as, where x is said output.

9. A method for improving a neural network, said network comprising a plurality of computational devices, each computational device comprising means responsive to a data input value for squashing said data input value according to a predetermined nonlinear function to generate an output value for the computational device, said method comprising the step of: substituting a second nonlinear function for a first nonlinear function as the predetermined function in each said squashing means, said data input value being related to said first nonlinear function, said second nonlinear function being less computationally complex than said first nonlinear function.

10. The method as defined in claim 9 wherein said second nonlinear function is a piecewise linear approximation of said first nonlinear function.

11. The method as defined in claim 10 wherein said first nonlinear function is a hyperbolic tangent function.

12. The method as defined in claim 11 wherein said second nonlinear function is a piecewise linear function defined as, where x is said data input value.

13. The method as defined in claim 9 further including a step of:
training said neural network according to a predetermined training algorithm and having said first nonlinear function in each said computational device.

14. The method as defined in claim 13 wherein said first nonlinear function is differentiable.

15. The method as defined in claim 14 wherein said predetermined training algorithm is back propagation.

16. The method as defined in claim 15 wherein said second nonlinear function is a piecewise linear approximation of said first nonlinear function.

17. The method as defined in claim 16 wherein said first nonlinear function is a hyperbolic tangent function.

18. The method as defined in claim 17 wherein said second nonlinear function is a piecewise linear function defined as, where x is said data input value.