CN107316078B

CN107316078B - Apparatus and method for performing artificial neural network self-learning operation

Info

Publication number: CN107316078B
Application number: CN201610267211.0A
Authority: CN
Inventors: 李震; 郭崎; 陈云霁; 陈天石
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2016-04-27
Filing date: 2016-04-27
Publication date: 2021-05-07
Anticipated expiration: 2036-04-27
Also published as: CN110188870B; CN110188870A; CN107316078A

Abstract

An apparatus and method for performing artificial neural network self-learning operations includes a controller unit, an interconnection module, a master operation module, and a plurality of slave operation modules. The self-learning pre-training of the multi-layer neural network can be completed by the self-learning pre-training of each layer network after the self-learning pre-training of the layer network is completed through multiple operation iterations until the weight updating is smaller than a certain threshold value according to a training mode of layer-by-layer training. The first visible layer intermediate value and the second hidden layer intermediate value are respectively calculated and generated in the first three stages, and the weights are updated in the last stage by using the intermediate values in the first three stages.

Description

Apparatus and method for performing artificial neural network self-learning operation

Technical Field

The present disclosure relates to artificial neural network technology, and in particular, to an apparatus and method for performing artificial neural network self-learning operations.

Background

The multilayer artificial neural network is widely applied to the fields of pattern recognition, image processing, function approximation, optimization calculation and the like, and in recent years, the multilayer artificial neural network is more and more widely concerned by academia and industry due to higher recognition accuracy and better parallelism.

A typical multi-layer artificial neural network training method is the Back Propagation (BP) algorithm. This method is representative of supervised learning, and requires a large number of labeled training samples during the training process, however, the cost required for sample collection is expensive. Meanwhile, in the training process of the method, the error correction signal is reduced along with the increase of the number of the propagation layers, the training is easy to converge on the local minimum value, and the convergence speed is low. Therefore, the network parameters are pre-trained by adopting a self-learning algorithm with high convergence rate and without labeled training samples, and then the multi-layer neural network is finely adjusted by adopting back propagation training to become a new hot spot. Among them, the self-learning operation as the pre-training is particularly important.

One known method of supporting multi-layer artificial neural network self-learning operations is to use a general purpose processor. The method supports the above algorithm by executing general instructions using a general register file and general functional units. One of the disadvantages of this method is that the single general-purpose processor has a low operation performance and cannot meet the performance requirements of the common multi-layer artificial neural network operation. When multiple general-purpose processors are executed in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck. In addition, the general processor needs to decode the multilayer artificial neural network pre-training operation into a long-row operation and access instruction sequence, and the front-end decoding of the processor brings large power consumption overhead

Another known approach to support multi-layer artificial neural network pre-training is to use a Graphics Processor (GPU). The method supports the above algorithm by executing general purpose SIMD instructions using a general purpose register file and a general purpose stream processing unit. Because the GPU is a device specially used for performing graphic image operations and scientific calculations, there is no special support for operations of the multilayer artificial neural network, and a large amount of front-end decoding work is still required to perform operations of the multilayer artificial neural network, which brings a large amount of additional overhead. In addition, the GPU only has small on-chip cache, model data (weight) of the multilayer artificial neural network needs to be carried from the outside of the chip repeatedly, and the bandwidth of the outside of the chip becomes a main performance bottleneck. In addition, the GPU has only a small on-chip cache, and model data (weight) of the multilayer artificial neural network needs to be repeatedly carried off-chip, and off-chip bandwidth becomes a main performance bottleneck, and brings huge power consumption overhead.

Disclosure of Invention

The method aims to solve the problems that in the prior art, a series of simple operations and access operations are needed for pre-training a multi-layer neural network by a general purpose processor (GPU, CPU), the front-end decoding power consumption overhead is high, the data access overhead of the conventional general purpose processor is high, the operation performance of a single general purpose processor is low, and the like.

The present disclosure proposes a device for performing an artificial neural network self-learning operation, comprising an instruction storage unit, a controller unit, a data access unit, an interconnection module, a master operation module, and a plurality of slave operation modules, wherein: the instruction storage unit is used for reading in instructions through the data access unit and caching the read instructions; the controller unit is used for reading an instruction from the instruction storage unit, decoding the instruction into control signals for controlling the behaviors of the interconnection module, the main operation module and the slave operation module, and then distributing the respective control signals to the modules; the data access unit is used for accessing an external address space and finishing the loading and the storage of data; the interconnection module has different topology realization and is used for distributing the input vector of the master operation module to the plurality of slave operation modules and combining the calculation results of the slave operation modules and returning the combined calculation results to the master operation module; the main operation module is used for carrying out activation function and Gibbs sampling on the intermediate value returned by the interconnection module and updating the bias of the activation function; the slave operation module is used for performing dot product operation on the input vector and the corresponding weight matrix, performing product operation on the corresponding component scalar in the input vector and the corresponding weight matrix, and updating the weight matrix.

According to a specific embodiment of the present disclosure, the main operation module includes an operation unit, a data dependency relationship determination unit, and a storage unit, where the storage unit is configured to cache input data and output data used by the main operation module in a calculation process, and the operation unit is configured to complete an operation of the main operation module; the data dependency relationship judging unit is a port of the operation unit and the read-write storage unit and is used for ensuring the read-write consistency of data in the storage unit.

According to a specific embodiment of the present disclosure, the data dependency relationship determining unit is configured to determine whether a dependency relationship exists between a control signal that is not yet executed and data of a control signal that is being executed, and if not, allow the set of control signals to be immediately transmitted, otherwise, it is required to wait until all control signals that are depended on by the set of control signals are completely executed before allowing the set of control signals to be transmitted.

According to a specific embodiment of the present disclosure, the data dependency relationship determination unit is further configured to send the read data to the slave computing module through the interconnection module.

According to a specific embodiment of the present disclosure, each slave operation module includes an operation unit, a data dependency relationship determination unit, a first storage unit, a second storage unit, and a third storage unit, wherein the operation unit is configured to receive a control signal sent by the controller unit and perform an arithmetic logic operation; the data dependency relationship judging unit is used for monitoring the read-write operation of the cache unit so as to ensure that consistency conflict does not exist in the read-write operation of the cache unit; the first storage unit is used for caching input vectors and calculation results of the neurons; the second storage unit is used for caching weight data required by the slave operation module in the calculation process; the third storage unit is used for caching weight gradient data required by the corresponding slave operation module in the process of updating the weight.

The present disclosure also provides a method for performing a layer-by-layer self-learning operation of an artificial neural network, the artificial neural network comprising a plurality of neurons of two or more layers, the self-learning pre-training of the artificial neural network employing layer-by-layer training, the pre-training being divided into four stages for each layer:

the first stage, inputting neuron vector

And weight vector matrix

Performing dot product operation to obtain a local induction domain, performing nonlinear transformation on the local induction domain through an activation function, and then performing Gibbs sampling calculation to obtain a first-order hidden layer intermediate value

In the second stage, the transposition of the weight vector matrix is performed first

And transposing of first-order hidden layer intermediate values

Performing dot product operation, wherein the local induction domain of the linear transformation is subjected to nonlinear transformation of an activation function, and then Gibbs sampling is adopted to obtain a first-order visible layer intermediate value

The third stage, inputting the middle value of the first-order visible layer

And weight vector matrix

Performing dot product operation to obtain a local induced domain, and performing nonlinear transformation on the local induced domain through an activation function to obtain a second hidden layer intermediate value

The fourth stage, updating the weights according to the following formula:

wherein the vector

The sum of the dot product of the vector and the weight matrix before the activation function is applied for the first and third stages, the vector

The bias at the second stage; in the formula, "x" represents cross multiplication of the vector, and e is the learning rate.

Compared with the prior art, the method and the device optimize the multilayer neural network pre-training instruction, the processor can finish pre-training learning of one layer of the neural network by only one instruction, and the front-end decoding overhead of the instruction of the general processor is reduced; meanwhile, the method comprises a main operation module, a plurality of slave operation modules and a large amount of distributed on-chip storage and memory access alleviation overhead, and can execute neural network pre-training operation in parallel without frequent off-chip data access. In summary, the performance power consumption ratio of the present disclosure is much higher than that of a general purpose processor.

The present disclosure may be applied in the following (including but not limited to) scenarios: the system comprises various electronic products such as a data processing device, a robot, a computer, a printer, a scanner, a telephone, a tablet computer, an intelligent terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage device and a wearable device; various vehicles such as airplanes, ships, vehicles, and the like; various household appliances such as televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas stoves, range hoods and the like; and various medical devices including nuclear magnetic resonance apparatuses, B-ultrasonic apparatuses, electrocardiographs and the like.

Drawings

For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

FIG. 1 illustrates an example block diagram of the overall structure of an apparatus for performing artificial neural network self-learning pre-training in accordance with an embodiment of the present disclosure.

FIG. 2 schematically illustrates an H-tree structured implementation of interconnect modules in an apparatus for performing artificial neural network self-learning pre-training in accordance with an embodiment of the present disclosure.

FIG. 3 illustrates an example block diagram of a structure of a main operation module in an apparatus for performing artificial neural network self-learning pre-training in accordance with an embodiment of the present disclosure.

FIG. 4 illustrates an example block diagram of a slave operational module structure in an apparatus for performing artificial neural network self-learning pre-training in accordance with an embodiment of the present disclosure.

FIG. 5 illustrates an example block diagram of the first and third stages of a neural network self-learning pre-training process in accordance with an embodiment of this disclosure.

FIG. 6 illustrates an example block diagram of a second stage of a neural network self-learning pre-training process in accordance with an embodiment of this disclosure.

FIG. 7 illustrates an example flow diagram of a fourth stage of a neural network self-learning pre-training process in accordance with an embodiment of the present disclosure.

FIG. 8 illustrates an example flow diagram of a single-layer neural network self-learning pre-training iteration in accordance with an embodiment of the present disclosure.

Like devices, components, units, etc. are designated with like reference numerals throughout the drawings.

Detailed Description

Other aspects, advantages, and salient features of the disclosure will become apparent to those skilled in the art from the following detailed description, which, taken in conjunction with the annexed drawings, discloses exemplary embodiments of the disclosure.

In the present disclosure, the terms "include" and "comprise," as well as derivatives thereof, mean inclusion without limitation; the term "or" is inclusive, meaning and/or.

In this specification, the various embodiments described below which are used to describe the principles of the present disclosure are by way of illustration only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the present disclosure as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for similar functions and operations.

According to the self-learning pre-training of the multilayer artificial neural network of the embodiment of the disclosure, the artificial neural network comprises a plurality of neurons with two layers or more than two layers. The self-learning pre-training of the artificial neural network adopts layer-by-layer training, and the training is started from the first layer to the last layer. For each layer, the pre-training is divided into four phases:

the first stage, inputting neuron vector

First, the weight vector matrix is summed

And transposing of first-order hidden layer intermediate values

The third stage is similar to the first stage except that the third stage input is a first-order visible layer intermediate value

Calculating the intermediate value of the second hidden layer

Gibbs sampling is not needed before;

the fourth stage, updating the weights according to the following formula:

wherein the vector

FIG. 1 illustrates an example block diagram of the overall structure of an apparatus for performing artificial neural network self-learning pre-training in accordance with this disclosure. As shown in fig. 1, the apparatus includes an instruction storage unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a master operation module 5, and a plurality of slave operation modules 6. The instruction storage unit 1, the controller unit 2, the data access unit 3, the interconnect module 4, the master operation module 5 and the slave operation module 6 may all be implemented by hardware circuits (e.g., application specific integrated circuits ASIC).

The instruction storage unit 1 reads in instructions through the data access unit 3 and buffers the read instructions.

The controller unit 2 reads the instruction from the instruction storage unit 1, translates the instruction into a control signal for controlling the behavior of other modules, and sends the control signal to other modules such as the data access unit 3, the master operation module 5, the slave operation module 6, and the like.

The data access unit 3 can access and store an external address space, and directly read and write data to each cache unit in the device to finish the loading and storage of the data.

FIG. 2 showsThe structure of the interconnect module 4 is schematically shown. The interconnect module 4 constitutes a data path between the master operational module 5 and the plurality of slave operational modules 6 and has a different structure. The interconnection is a binary tree path formed by a plurality of nodes, each node sends upstream data to two downstream nodes in the same way, combines the data returned by the two downstream nodes and returns the data to the upstream node. For example, in the first and third stages of the neural network self-learning operation, the input vector in the master operation module 5 is sent to each slave operation module 6 through the interconnection module 4; after the calculation process of the operation module 6 is completed, after the calculation process of the slave operation module is completed, the values of the neurons output by each slave operation module are gradually spliced into a complete vector consisting of local induction domains in the interconnection module, and the complete vector is returned to the master operation module 5 as an intermediate result vector to perform an activation function and perform Gibbs sampling according to requirements. And during the second stage, the intermediate value vector of the first hidden layer in the main operation module 5

Sent to the various slave calculation modules 6 through the interconnection module 4; after the calculation process of the slave operation module 6 is completed, the vectors returned by the two nodes at the downstream are added into one vector at the current node and returned to the upstream node, and the vector is returned to the master operation module 5 as an intermediate result vector to perform the activation function and Gibbs sampling.

Fig. 3 shows an example block diagram of the structure of the main operation module 5 in an apparatus for performing an artificial neural network pre-training operation according to the present disclosure. As shown in fig. 3, the main operation block 5 includes an operation unit 51, a data dependency relationship judgment unit 52, and a storage unit 53.

The storage unit 53 is used for caching input data and output data used by the main operation module 5 in a calculation process, the operation unit 51 completes various operation functions of the main operation module 5, and the data dependency relationship judgment unit 52 is a port for the operation unit 51 to read and write the storage unit 53, and can ensure the read-write consistency of data in the storage unit. Specifically, the data dependency relationship determining unit 52 determines whether there is a dependency relationship between the control signals that have not yet been executed and the data of the control signals that are being executed, and if not, allows the set of control signals to be immediately transmitted, otherwise, it is required to wait until all the control signals that are depended on by the set of control signals are completely executed and then allow the set of control signals to be transmitted. For example, all control signals to the data dependency unit 52 are stored in an instruction queue within the data dependency unit 52, in which queue a read data range of a read instruction must wait until the dependent write instruction is executed if it conflicts with a write data range of a write instruction located earlier in the queue. Meanwhile, the data dependency relationship determination unit 52 is also responsible for sending the read data to the slave computation module through the interconnection module 4, and the output data of the slave computation module 6 is directly sent to the operation unit 51 through the interconnection module 4. The instruction output by the controller unit 2 is sent to the calculation unit 51 and the data dependency relationship judgment unit 52 to control the behavior thereof.

Fig. 4 shows an example block diagram of the structure of the slave operational module 6 in an apparatus for performing artificial neural network pre-training according to the present disclosure. As shown in fig. 4, each slave operation module 6 includes an operation unit 61, a data dependency relationship judgment unit 62, a first storage unit 63, a second storage unit 64, and a third storage unit 65.

The arithmetic unit 61 receives the control signal from the controller unit 2 and performs arithmetic logic operation.

The data dependency relationship determination unit 62 is responsible for reading and writing operations on the cache unit in the calculation process. The data dependency judgment unit 62 ensures that there is no consistency conflict for the reading and writing of the cache unit. For example, all control signals to the data dependency unit 62 are stored in an instruction queue within the data dependency unit 62, in which queue a read data range of a read instruction must wait until the dependent write instruction is executed if it conflicts with a write data range of a write instruction located earlier in the queue.

The first storage unit 63 buffers the input neuron vectors in the respective stage processes

First order hidden layer intermediate value

First order visible layer median

First order hidden layer intermediate value

And the dot product result of the input vector and the weight matrix calculated in each stage.

The second storage unit 64 buffers the weight data required by the slave operation module 6 in the calculation process. For each slave, only the column of the weight matrix corresponding to the scalar data stored by the slave 6 is stored.

The third storage unit 65 buffers weight gradient data required by the corresponding slave operation module in the process of updating the weights. Each weight gradient data stored in the slave operation module 6 corresponds to the weight data stored therein.

The slave operation module 6 realizes the updating of the weight of the formula (1) in the first half part and the last stage of the parallel first three stages in the self-learning pre-training process of the artificial neural network.

Taking the pre-training of the artificial neural network Deep Belief Network (DBN) as an example, the weight matrix of the first three stages is used

(or

) And input neuron vector

Can be divided into uncorrelated parallel computing subtasks. In the first and third stages, each slave operation module 6 performs dot product multiplication operation by using the same input vector value and the weights corresponding to different components of the output vector to respectively obtain the corresponding parts of different components in the output vectorAnd, after a plurality of accumulations, these parts of their respective corresponding output components are obtained and are pieced together step by step in the interconnection module 4 to form a complete local induction domain vector. Each slave operation module 6 only needs to calculate the corresponding local induction domain of the corresponding output neuron value of the module. Different local induction domain components are spliced into a complete local induction domain vector step by step in the interconnection module 4 and transmitted to the main operation module for activation function and subsequent sampling. In the second stage, each slave operation module 6 only calculates the intermediate value vector of the input first-order hidden layer

Corresponding partial scalar quantities and weight matrix

And each output vector obtained by multiplying the corresponding columns is a partial sum to be accumulated of the final result, and the partial sums are added pairwise by pairwise in the interconnection module to obtain the final result. Each slave operation module 6 calculates partial sums of output first-order visible layer vector local induced domains, and all the partial sums are summed in the interconnection module 4 to obtain the final local induced domain. The intermediate values obtained by calculation in the first three stages are used for updating the weight, and the main operation module 5 performs subsequent operation based on the output of the operation in the first three stages to obtain a weight updating value. In the last phase, the slave operation module 6 can update the weight according to the formula (1) and can also be divided into three small steps:

1. each slave operation module 6 calculates an input first-order hidden layer intermediate value vector

And input neurons

The product median of the corresponding partial scalars;

2. each slave operation module 6 calculates an input first-order hidden layer intermediate value vector

And a first order visible layer vector

Multiplying the corresponding partial scalars and calculating the vector difference value with the first small stage intermediate value;

3. each slave operation module 6 calculates the product of the difference value of the second small stage and the learning rate to obtain a weight update value, and then the weight update value and the weight

And carrying out vector subtraction to obtain updated weight.

It is noted that the three small phases described above are merely an example description of updating the weights from the calculation module 6, and the user may perform fine-tuning of details, for example, the calculation of the product in the first small phase and the calculation of the product in the second small phase may be interchanged; or the third minor phase multiplied by the learning rate may be advanced to the second minor phase or even split to the first two minor phases.

According to an embodiment of the present disclosure, there is also provided an instruction set for performing an artificial neural network forward operation on the aforementioned apparatus. The instruction set comprises a CONFIG instruction, a COMPUTE instruction, an IO instruction, a NOP instruction, a JUMP instruction and a MOVE instruction, wherein:

configuring various constants required by calculation of a current layer by the CONFIG instruction before calculation of each layer of artificial neural network is started;

the COMPUTE instruction completes the arithmetic logic calculation of each layer of artificial neural network;

the IO instruction reads input data required by calculation from an external address space and stores the data back to the external space after the calculation is finished;

the NOP instruction is responsible for emptying the control signals currently loaded in all the control signal cache queues in the NOP instruction, and all instructions before the NOP instruction are guaranteed to be finished. NOP instructions do not contain any operations themselves;

the JUMP instruction is responsible for the JUMP of the next instruction address to be read from the instruction storage unit by the controller and is used for realizing the JUMP of a control flow;

the MOVE instruction is responsible for carrying data at one address in the internal address space of the device to another address in the internal address space of the device, and the process is independent of the arithmetic unit and does not occupy the resources of the arithmetic unit in the execution process.

FIG. 5 illustrates an example block diagram of the first and third stages of a neural network self-learning pre-training process in accordance with an embodiment of this disclosure. In different slave operation modules 6, the input vector broadcasted by the interconnection module 4 is respectively subjected to dot product operation with the weight vector of the slave operation module 6 to obtain the partial induction domain partial sum of the corresponding output neuron values, all the output partial induction domain values form an intermediate result vector, the intermediate result vector is subjected to offset vector addition and activation operation to obtain the final output neuron vector of the layer of neural network, and the formula is described as out ═ f (w in + b), wherein out is the output vector, in is the input vector, b is the offset vector, w is the weight matrix, and f is the activation function. The weight vector of each slave operation module 6 is the column vector corresponding to the slave operation module 6 in the weight matrix. The interconnection module 4 inputs the vector [ I ]₀，…，I_n]The data are sent to all the slave operation units and temporarily stored in the first storage unit. For the ith slave arithmetic unit, calculate its corresponding weight vector [ W ]_i0，…，W_in]Dot product with the input vector. The results output from the operation units are pieced together into a complete local induction domain vector through the interconnection module 4 and returned to the main operation module 5, and the activation function operation and possible Gibbs sampling thereof are carried out in the main operation module 5 to obtain the final output vector [ O₀，O₁，…，O_n]。

FIG. 6 illustrates an example block diagram of a second stage of a neural network self-learning pre-training process in accordance with an embodiment of this disclosure. Computing and outputting a first-order visible layer vector

By broadcasting a first-order hidden vector value for the interconnection module 4, each taken from the calculation module 6

Of the corresponding partial scalar h_0iAnd weight matrix

Corresponding column [ W ]_i0，…，W_in]Each output vector obtained is a partial sum to be accumulated of the local induction domain of the first-order visible layer vector, and the partial sums are added pairwise by pairwise in the interconnection module 4 to obtain the final local induction domain. The calculated local induction domain is returned to the main operation module 5, and the activation function operation and possible Gibbs sampling are carried out in the main operation module 5 to obtain the final output first-order visible layer vector

FIG. 7 shows a flowchart of a fourth stage of a neural network self-learning pre-training process in accordance with an embodiment of the present disclosure. In the last stage, the slave operation module 6 can update the weight according to the formula (1) and can also be divided into three small steps:

And input neurons

The intermediate value of the product of the corresponding partial scalar is cached to the third storage unit shown in fig. 4; this small stage is similar to the block diagram of the second stage shown in FIG. 6, but its inputs are first-hidden-layer intermediate-value vectors

And input neurons

And a first order visible layer vector

The product of the corresponding partial scalar in the first small stage, and the vector difference value with the first small stage intermediate value is calculated and cached to the third storage unit shown in fig. 4;

And carrying out vector subtraction to obtain updated weight.

FIG. 8 illustrates a flow diagram of a one-layer artificial neural network self-learning pre-training operation, according to an embodiment, since the multi-layer artificial neural network self-learning pre-training may be implemented in a layer-by-layer training manner, the flow may be invoked multiple times for the multi-layer artificial neural network pre-training. The flow chart describes a process for implementing a single-layer neural network self-learning pre-training operation of the type shown in figure 4 using the apparatus and instruction set of the present disclosure.

In step S1, an IO instruction is pre-stored at the first address of instruction cache unit 1.

In step S2, the operation starts, the controller unit 2 reads the IO instruction from the first address of the instruction cache unit 1, and according to the translated control signal, the data access unit 3 reads all corresponding artificial neural network operation instructions from the external address space and caches them in the instruction storage unit 1.

In step S3, the controller unit 2 reads in the next IO instruction from the instruction storage unit, and the data access unit 3 reads all the data required by the main operation module 5 from the external address space (e.g. including input) according to the decoded control signalEntry neuron vector

Activation function interpolation table, learning rate, offset, and the like) to the storage unit 53 of the main operation block 5.

In step S4, the controller unit 2 then reads in the next IO instruction from the instruction storage unit, and the data access unit 3 reads the weight matrix data required from the operation module 6 from the external address space according to the decoded control signal.

At step S5, the controller unit 2 then reads in the next CONFIG instruction from the instruction storage unit, and based on the translated control signal, the device configures the various constants required for the first stage calculation of the layer neural network. For example, the

arithmetic units

51, 61 configure the values of the unit internal registers according to parameters in the control signals, such as the precision setting of the calculation of the layer, the data of the activation function.

At step S6, the controller unit 2 then reads in the next component instruction from the instruction storage unit, and starts the first-stage calculation based on the translated control signal. The main operation module 5 firstly inputs the neuron vector through the interconnection module 4

The data is sent to each slave operation module 6 and stored in the first storage unit 63 of the slave operation module 6. The operation unit 61 of the slave operation module 6 reads the weight vector (the column vector in the weight matrix corresponding to the slave operation module 6) from the second storage unit 64, and the input neuron vector from the first storage unit

Completing weight vector and input neuron vector

And returning the intermediate result through the interconnection module. In the interconnection module 4, intermediate results returned from the operation module 6 are pieced into complete local induction domain vectors step by step. The main operation module 5 obtains the return value of the interconnection module 4 according to COMPUThe control signal translated by the TE instruction reads the offset vector from the storage unit 53, adds the offset vector with the vector returned by the interconnection module 4, activates the addition result, performs Gibbs sampling, and makes the last first-order hidden vector

Written back to the memory cell 53.

The controller unit 2 then reads in the next CONFIG instruction from the instruction storage unit at step S7, and based on the translated control signal, the device configures the various constants required for the second stage calculation of the layer neural network.

At step S8, the controller unit 2 then reads in the next component instruction from the instruction storage unit, and starts the second stage of calculation based on the translated control signal. The main operation module 5 firstly uses the first-order hidden layer vector quantity through the interconnection module 4

The data is sent to each slave operation module 6 and stored in the first storage unit 63 of the slave operation module 6. The operation unit 61 of the slave operation module 6 reads the weight vector (corresponding to the column vector of the slave operation module 6 in the weight matrix) from the second storage unit 64, and selects the first-order hidden-layer vector from the first storage unit

Scalar quantity of (1), completion weight vector and first-order hidden vector

And performing product operation on corresponding scalars, and returning an intermediate result through the interconnection module. In the interconnection block 4, the intermediate results returned from the operation block 6 are added step by step to form a complete local induction domain vector. The main operation module 5 obtains the return value of the interconnection module 4, reads the offset vector from the storage unit 53 according to the control signal decoded by the COMPUTE instruction, adds the offset vector with the vector returned by the interconnection module 4, then activates the addition result, performs Gibbs sampling, and samples the vector of the last visible layer

Written back to the memory cell 53.

At step S9, the controller unit 2 then reads in the next CONFIG instruction from the instruction storage unit, and based on the translated control signal, the device configures the various constants required for the third stage calculation of the layer of neural network. The configuration of the layer is basically the same as that of the first stage, but one more learning rate parameter is required to be configured.

At step S10, the controller unit 2 then reads in the next component instruction from the instruction storage unit, and starts the third-stage calculation based on the translated control signal. The main operation module 5 firstly uses the first-order hidden layer vector quantity through the interconnection module 4

The data is sent to each slave operation module 6 and stored in the first storage unit 63 of the slave operation module 6. Reading a first order visible layer vector from a first memory cell

Completion weight vector and first order visible layer vector

And returning the intermediate result through the interconnection module. In the interconnection module 4, intermediate results returned from the operation module 6 are pieced into complete local induction domain vectors step by step. The main operation module 5 obtains the return value of the interconnection module 4, reads the offset vector from the storage unit 53 according to the control signal decoded by the COMPUTE instruction, adds the offset vector with the vector returned by the interconnection module 4, activates the addition result, and adds the last first-order hidden layer vector

Written back to the memory cell 53.

At step S11, the controller unit 2 then reads in the next component instruction from the instruction storage unit, and starts the calculation at the fourth stage based on the decoded control signal. The first small-stage main operation module 5 firstly inputs the input nerve through the interconnection module 4Element vector

And first order hidden layer vector

The weight gradient is sent to each slave operation module 6 and stored in the weight gradient buffer unit 65 of the slave operation module 6. The second small stage reads the first-order hidden layer vector from the first storage unit from the operation unit 61 of the operation module 6

And selecting input neuron vectors

Corresponding component to complete the first-order hidden vector

And corresponding input neuron vector

The intermediate result and the intermediate value cached in the previous small stage read from the weight gradient caching unit 65 are subjected to vector subtraction, and the computed intermediate result is cached in the weight gradient caching unit 65. The last small stage reads the weight update value obtained by multiplying the intermediate value of the last small stage by the learning rate from the weight gradient buffer unit 65 from the operation unit 61 of the operation module 6, reads the corresponding weight and the weight update value from the weight buffer unit 64, performs vector subtraction to obtain an updated weight, and buffers the updated weight back to the weight buffer unit 64. Thus, one-time self-learning pre-training iteration of the single-layer neural network is completed, and after multiple times of iterative learning, the weight reaches a certain convergence criterion (the weight update value is less than a certain threshold), the pre-training of the single-layer neural network is finished, and the pre-training of the next-layer neural network can be started.

By adopting the device and the instruction set for executing the artificial neural network self-learning pre-training operation, the problems of insufficient operation performance of a CPU and a GPU and high front-end decoding overhead are solved. The support for the forward operation of the multilayer artificial neural network is effectively improved.

By adopting the special on-chip cache for the forward operation of the multilayer artificial neural network, the reusability of input neurons and weight data is fully mined, the data are prevented from being read to the memory repeatedly, the memory access bandwidth is reduced, and the problem that the memory bandwidth becomes the bottleneck of the forward operation performance of the multilayer artificial neural network is avoided.

Each function/unit/module/submodule in the present disclosure may be hardware, for example, the hardware may be a circuit including a digital circuit, an analog circuit, and the like. Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The computing module in the computing device may be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like. The memory unit may be any suitable magnetic or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, etc.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions.

The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software embodied on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.

In the foregoing specification, embodiments of the present disclosure have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the disclosure as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

1. An apparatus for performing artificial neural network self-learning operations, comprising a controller unit, an interconnection module, a master operation module, and a plurality of slave operation modules, wherein:

the controller unit is used for reading an instruction, decoding the instruction into control signals for controlling the behaviors of the interconnection module, the main operation module and the slave operation module, and then distributing the respective control signals to the modules;

the interconnection module has different topology realization and is used for distributing the input vector of the master operation module to the plurality of slave operation modules and combining the calculation results of the slave operation modules and returning the combined calculation results to the master operation module;

the main operation module comprises: the activation function arithmetic unit is used for carrying out activation function arithmetic on the intermediate value returned by the interconnection module; the sampling arithmetic unit is used for carrying out Gibbs sampling on the operation result of the activation function; the adder is used for updating the offset of the sampling result;

the slave operation module is used for performing dot product operation on the input vector and the corresponding weight matrix, performing product operation on a corresponding component scalar in the input vector and the corresponding weight matrix, and updating the weight matrix;

the artificial neural network comprises a plurality of neurons with two or more layers, the self-learning pre-training of the artificial neural network adopts layer-by-layer training, and for each layer of neurons, the pre-training of the artificial neural network comprises the following steps:

first phase, in the slave operation module, the input neuron vector broadcast by the interconnection module

And weight vector matrix

Carry out the transportation of the dot productCalculating to obtain a local induction domain, and calculating to obtain a first-order hidden layer intermediate value by adopting Gibbs sampling after the local induction domain is subjected to nonlinear transformation of an activation function

2. The apparatus for performing artificial neural network self-learning operations of claim 1, further comprising:

the instruction storage unit is used for reading in the instructions through the data access unit and caching the read instructions;

and the data access unit is used for accessing the external address space and finishing the loading and the storing of the data.

3. The apparatus for performing artificial neural network self-learning operations of claim 1, wherein the instruction comprises a compote instruction.

4. The apparatus for performing artificial neural network self-learning operations of claim 1, wherein the instructions further comprise:

the CONFIG instruction is used for configuring various constants required by calculation of a current layer before calculation of each layer of artificial neural network starts;

a COMPUTE instruction for completing arithmetic logic calculation of each layer of artificial neural network;

the IO instruction is used for reading input data required by calculation from the external address space and storing the data back to the external space after the calculation is finished;

the NOP instruction is used for emptying the control signals currently loaded in all the control signal cache queues in the NOP instruction, and all instructions before the NOP instruction are ensured to be finished, and the NOP instruction does not contain any operation;

a JUMP instruction for the controller to JUMP to a next instruction address to be read from the instruction storage unit to realize a JUMP of a control flow;

the MOVE instruction is used for transporting data of a certain address in the internal address space of the device to another address in the internal address space of the device, is independent of the arithmetic unit, and does not occupy the resources of the arithmetic unit in the execution process.

5. The apparatus for performing artificial neural network self-learning operation according to claim 1, wherein the main operation module includes an operation unit, a data dependency judgment unit, and a storage unit, wherein,

the storage unit is used for caching input data and output data used by the main operation module in the calculation process,

the operation unit is used for completing the operation of the main operation module;

the data dependency relationship judging unit is a port of the operation unit and the read-write storage unit and is used for ensuring the read-write consistency of data in the storage unit.

6. The apparatus of claim 5, wherein the data dependency relationship determining unit is configured to determine whether there is a dependency relationship between the control signal that has not been executed and the data of the control signal being executed, and if not, allow the control signal to be immediately transmitted, otherwise, allow the control signal to be transmitted after all the control signals depended on by the control signal are completely executed.

7. The apparatus for performing artificial neural network self-learning operation according to claim 6, wherein the data dependency judgment unit is further configured to send the read data to the slave computing module through the interconnection module.

8. The apparatus for performing artificial neural network self-learning operation according to claim 1, wherein each of the slave operation modules includes an operation unit, a data dependency judgment unit, a first storage unit, a second storage unit, and a third storage unit, wherein,

the arithmetic unit is used for receiving the control signal sent by the controller unit and carrying out arithmetic logic operation;

the data dependency relationship judging unit is used for monitoring the read-write operation of the storage unit so as to ensure that consistency conflict does not exist in the read-write operation of the storage unit;

the first storage unit is used for caching input vectors and calculation results of the neurons;

the second storage unit is used for caching weight data required by the slave operation module in the calculation process;

the third storage unit is used for caching weight gradient data required by the corresponding slave operation module in the process of updating the weight.

9. The apparatus for performing artificial neural network self-learning operations of claim 1, wherein the pre-training of the artificial neural network for each layer of neurons further comprises:

in the second stage, the slave operation module firstly transposes the weight vector matrix

And transposing of first-order hidden layer intermediate values

Performing dot product operation, and obtaining a first-order visible layer intermediate value by Gibbs sampling after the local induction domain in the main operation module is subjected to nonlinear transformation of an activation function

The third stage of receiving the intermediate value of the first-order visible layer from the operation module

And weight vector matrix

Performing dot product operation to obtain local induction domainAnd outputting the local induction domain to a main operation module, and obtaining a second hidden layer intermediate value after nonlinear transformation of an activation function

In the fourth stage, the slave operation module updates the weight according to the following formula:

wherein the vector

The vector and the weight matrix dot product part before the activation function is carried out for the first stage and the third stage and the added bias, the vector

10. A method for performing artificial neural network self-learning operation, applied to the apparatus for performing artificial neural network self-learning operation of any one of claims 1 to 9, comprising:

the controller unit reads the instruction, decodes the instruction into control signals for controlling the behaviors of the interconnection module, the main operation module and the slave operation module, and then distributes the respective control signals to the modules;

the interconnection module has different topology realization, distributes the input vector of the master operation module to the plurality of slave operation modules, combines the calculation results of the slave operation modules and returns the combined calculation results to the master operation module;

the main operation module carries out activation function and Gibbs sampling on the intermediate value returned by the interconnection module and updates the bias of the activation function;

inputting dot product operation of the vector and the corresponding weight matrix from the operation module, product operation of the corresponding component scalar in the input vector and the corresponding weight matrix, and updating the weight matrix;

And weight vector matrix

11. The method for performing artificial neural network self-learning operations of claim 10, further comprising:

the instruction storage unit reads in the instruction through the data access unit and caches the read instruction;

and the data access unit accesses the external address space to complete the loading and the storing of the data.

12. The method for performing artificial neural network self-learning operations of claim 10, wherein the instruction comprises a compote instruction.

13. The method for performing artificial neural network self-learning operations of claim 10, wherein the instructions further comprise:

14. The method for performing artificial neural network self-learning operation according to claim 10, wherein the main operation module includes an operation unit, a data dependency judgment unit, and a storage unit, wherein,

the memory unit caches input data and output data used by the main operation module in the calculation process,

the operation unit completes the operation of the main operation module;

15. The method for performing artificial neural network self-learning operation as claimed in claim 14, wherein the data dependency relationship determining unit is configured to determine whether there is a dependency relationship between the control signal that has not been executed and the data of the control signal that is being executed, and if not, allow the control signal to be immediately transmitted, otherwise, allow the control signal to be transmitted after all the control signals that the control signal depends on are completely executed.

16. The method for performing artificial neural network self-learning operations of claim 15, wherein the data dependency determination unit is further configured to send the read data to the slave computing module via the interconnection module.

17. The method for performing artificial neural network self-learning operation according to claim 10, wherein each of the slave operation modules includes an operation unit, a data dependency judgment unit, a first storage unit, a second storage unit, and a third storage unit, wherein,

18. The method of claim 10, wherein the artificial neural network comprises a plurality of neurons in two or more layers, and wherein the self-learning pre-training of the artificial neural network employs layer-by-layer training.

19. The method for performing artificial neural network self-learning operations of claim 18, wherein the pre-training is divided into four phases for each layer of neurons:

And weight vector matrix

And transposing of first-order hidden layer intermediate values

And weight vector matrix

Performing dot product operation to obtain a local induced domain, outputting the local induced domain to a main operation module, and performing nonlinear transformation on an activation function to obtain a second hidden layer intermediate value

wherein the vector

20. An electronic device comprising the apparatus for performing artificial neural network self-learning operations of any one of claims 1-9.