WO2017185248A1 - 用于执行人工神经网络自学习运算的装置和方法 - Google Patents

用于执行人工神经网络自学习运算的装置和方法 Download PDF

Info

Publication number
WO2017185248A1
WO2017185248A1 PCT/CN2016/080320 CN2016080320W WO2017185248A1 WO 2017185248 A1 WO2017185248 A1 WO 2017185248A1 CN 2016080320 W CN2016080320 W CN 2016080320W WO 2017185248 A1 WO2017185248 A1 WO 2017185248A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
unit
vector
storage unit
neural network
Prior art date
Application number
PCT/CN2016/080320
Other languages
English (en)
French (fr)
Inventor
李震
郭崎
陈云霁
陈天石
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to PCT/CN2016/080320 priority Critical patent/WO2017185248A1/zh
Priority to EP16899762.5A priority patent/EP3451240A4/en
Publication of WO2017185248A1 publication Critical patent/WO2017185248A1/zh
Priority to US16/174,108 priority patent/US20190065953A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present invention relates to artificial neural network technology, and in particular to an apparatus and method for performing an artificial neural network self-learning operation.
  • Multi-layer artificial neural networks are widely used in the fields of pattern recognition, image processing, function approximation and optimization calculation.
  • Multi-layer artificial networks have been accepted by Kirin, image processing, function approximation and optimization calculation.
  • Multi-layer artificial networks have been accepted by Kirin, image processing, function approximation and optimization calculation.
  • Multi-layer artificial networks have been accepted by Kirin, image processing, function approximation and optimization calculation.
  • Multi-layer artificial networks have been accepted by Kir in recent years due to their high recognition accuracy and good parallelism. The industry is getting more and more attention.
  • a typical multi-layer artificial neural network training method is the back propagation (BP) algorithm.
  • BP back propagation
  • This method is a representative type of supervised learning, which requires a large number of labeled training samples during the training process, but the cost of collecting the samples is very costly.
  • the error correction signal decreases with the increase of the number of propagation layers, and the training tends to converge to the local minimum and the convergence speed is slow. Therefore, it is a new hotspot to use the self-learning algorithm with fast convergence and no need to train the sample to pre-train the network parameters, and then use the back propagation training to fine-tune the multi-layer neural network.
  • self-learning operation as a pre-training is particularly important.
  • One known method of supporting multi-layer artificial neural network self-learning operations is to use a general purpose processor.
  • the method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions.
  • One of the disadvantages of this approach is that the performance of a single general purpose processor is low and cannot meet the performance requirements of conventional multi-layer artificial neural network operations.
  • communication between general-purpose processors becomes a performance bottleneck.
  • the general-purpose processor needs to decode the multi-layer artificial neural network pre-training operation into a long column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.
  • Another known method of supporting multi-layer artificial neural network pre-training is to use a graphics processing unit (GPU).
  • the method supports the above algorithm by executing a generic SIMD instruction using a general purpose register file and a generic stream processing unit.
  • the GPU is a device dedicated to performing graphics and image operations and scientific calculations, without the special support for multi-layer artificial neural network operations, a large amount of front-end decoding work is still required to perform multi-layer artificial neural network operations, bringing a large number of Additional overhead.
  • the GPU has only a small on-chip cache, and the model data (weight) of the multi-layer artificial neural network needs to be repeatedly transferred from off-chip, and the off-chip bandwidth becomes the main performance bottleneck.
  • the GPU has only a small on-chip cache, and the model data (weight) of the multi-layer artificial neural network needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck, and brings huge power consumption overhead.
  • the present invention is to solve the problem that the general-purpose processor (GPU, CPU) in the prior art performs multi-layer neural network pre-training, which requires a series of simple operations and memory access operations, and the front-end decoding consumes a large power consumption and the existing general processing.
  • the data access memory overhead is large, and the performance of a single general-purpose processor is low.
  • the present invention provides an apparatus for performing an artificial neural network self-learning operation, comprising an instruction storage unit, a controller unit, a data access unit, an interconnection module, a main operation module, and a plurality of slave operation modules, wherein: the instruction The storage unit is configured to read in an instruction by the data access unit and cache the read instruction; the controller unit is configured to read the instruction from the instruction storage unit, and decode the instruction into a control interconnection module, a main operation module, and Control signals from the operation of the module, and then distribute the respective control signals to the respective modules; the data access unit is used to access an external address space to complete loading and storing of data; the interconnect module has different topology implementations for Distributing an input vector of the main operation module to the plurality of slave operation modules, and combining the calculation results of the slave operation modules and returning to the main operation module; the main operation module is configured to return the interconnection module Intermediate value for activation function, Gibbs sampling, and update of the offset of the activation function; the slave arithmetic module The dot
  • the main operation module includes an operation unit, a data dependency determination unit, and a storage unit, wherein the storage unit is configured to cache input data and output data used by the main operation module in the calculation process.
  • the operation unit is configured to complete an operation of the main operation module;
  • the data dependency determination unit is a port of the operation unit and the read-write storage unit, and is used for ensuring read and write consistency of data in the storage unit.
  • the data dependency determining unit is configured to determine whether there is a dependency relationship between the control signal that has not been executed and the data of the control signal that is being executed, and if not, allow the group of control signals to be immediately Transmit, otherwise the control signal is allowed to be transmitted after all the control signals on which the control signals are relied upon are all executed.
  • the data dependency determining unit is further configured to send the read data to the slave computing module through the interconnect module.
  • each of the slave operation modules includes an operation unit, a data dependency determination unit, a first storage unit, a second storage unit, and a third storage unit, wherein the operation unit is configured to receive control a control signal sent by the unit and performing an arithmetic logic operation; the data dependency determining unit is configured to monitor a read/write operation of the cache unit to ensure that there is no consistency conflict between the read and write of the cache unit; the first storage The unit is configured to cache an input vector and a calculation result of the neuron; the second storage unit is configured to buffer the weight data required by the slave operation module in the calculation process; and the third storage unit is configured to cache the corresponding slave operation module.
  • the invention also provides a method for performing an artificial neural network layer-by-layer self-learning operation, the artificial neural network comprising two or more layers of multiple neurons, and the self-learning pre-training of the artificial neural network adopts layer-by-layer training, For each level, the pre-training is divided into four phases:
  • the first stage, input neuron vector Weight vector matrix The dot product operation is performed to obtain the local induction domain.
  • the local induction domain is nonlinearly transformed by the activation function, and then the Gibbs sampling is used to calculate the first-order hidden layer intermediate value.
  • the transpose of the weight vector matrix Transposition of the intermediate value of the first-order hidden layer Perform a dot product operation, and the local induction domain is nonlinearly transformed by the activation function, and then the Gibbs sampling is used to obtain the first-order visible layer intermediate value.
  • the weights are updated according to the following formula:
  • the present invention optimizes the multi-layer neural network pre-training instruction, and the processor can complete the pre-training learning of the neural network layer with only one instruction, and the front-end decoding overhead of the general processor instruction is simplified;
  • the invention comprises a main operation module, a plurality of slave operation modules and a large number of distributed on-chip storage to alleviate the memory access overhead, and the neural network pre-training operation can be performed in parallel without frequent off-chip data access.
  • the performance-to-power ratio of the present invention is much higher than that of a general purpose processor.
  • the invention can be applied to the following (including but not limited to) scenarios: data processing, robots, computers, printers, scanners, telephones, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers , cameras, camcorders, projectors, watches, earphones, mobile storage, wearable devices and other electronic products; aircraft, ships, vehicles and other types of transportation; televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, Electric lights, gas stoves, range hoods and other household appliances; and including nuclear magnetic resonance instruments, B-ultrasound, electrocardiograph and other medical equipment.
  • FIG. 1 shows an example block diagram of the overall structure of an apparatus for performing artificial neural network self-learning pre-training in accordance with an embodiment of the present invention.
  • FIG. 2 is a schematic diagram showing an H-tree structure implementation of an interconnect module in an apparatus for performing artificial neural network self-learning pre-training in accordance with an embodiment of the present invention.
  • FIG. 3 illustrates an example block diagram of a main operational module structure in an apparatus for performing artificial neural network self-learning pre-training in accordance with an embodiment of the present invention.
  • FIG. 4 illustrates an example block diagram of a slave module structure in an apparatus for performing artificial neural network self-learning pre-training in accordance with an embodiment of the present invention.
  • FIG. 5 illustrates an example block diagram of first and third stages of a neural network self-learning pre-training process in accordance with an embodiment of the present invention.
  • FIG. 6 shows an example block diagram of a second phase of a neural network self-learning pre-training process in accordance with an embodiment of the present invention.
  • FIG. 7 shows an example flow diagram of a fourth phase of a neural network self-learning pre-training process in accordance with an embodiment of the present invention.
  • FIG. 8 is a flow chart showing an example of a single layer neural network self-learning pre-training one iteration according to an embodiment of the present invention.
  • the artificial neural network includes a plurality of neurons of two or more layers.
  • Self-learning pre-training of artificial neural networks Practice layer by layer training, starting from the first layer to the last layer. For each layer, pre-training is divided into four phases:
  • the first stage, input neuron vector First and weight vector matrix The dot product operation is performed to obtain the local induction domain.
  • the local induction domain is nonlinearly transformed by the activation function, and then the Gibbs sampling is used to calculate the first-order hidden layer intermediate value.
  • the transpose of the weight vector matrix Transposition of the intermediate value of the first-order hidden layer Perform a dot product operation, and the local induction domain is nonlinearly transformed by the activation function, and then the Gibbs sampling is used to obtain the first-order visible layer intermediate value.
  • the third phase is similar to the first phase, except that the third phase input is the intermediate value of the first-order visible layer. Calculate the middle value of the second hidden layer No prior sampling by Gibbs;
  • the weights are updated according to the following formula:
  • the apparatus includes an instruction storage unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a main operation module 5, and a plurality of slave operation modules 6.
  • the instruction storage unit 1, the controller unit 2, the data access unit 3, the interconnection module 4, the main operation module 5, and the slave operation module 6 can all be implemented by a hardware circuit such as an application specific integrated circuit (ASIC).
  • ASIC application specific integrated circuit
  • the instruction storage unit 1 reads in an instruction through the data access unit 3 and caches the read instruction.
  • the controller unit 2 reads an instruction from the instruction storage unit 1, translates the instruction into a control signal that controls the behavior of other modules, and transmits it to other modules such as the data access unit 3, the main operation module 5, and the slave operation module 6.
  • the data access unit 3 can access the external address space, directly read and write data to each cache unit inside the device, and complete data loading and storage.
  • FIG. 2 schematically shows the structure of the interconnection module 4.
  • the interconnection module 4 constitutes a data path between the main operation module 5 and the plurality of slave operation modules 6, and has a different structure.
  • the interconnection is a binary tree path composed of a plurality of nodes. Each node sends the upstream data to the downstream two nodes in the same manner, and the data returned by the two downstream nodes are combined and returned to the upstream node.
  • the input vectors in the main operation module 5 are sent to the respective slave operation modules 6 through the interconnection module 4; after the calculation process of the operation module 6 is completed, when the slave operation module is completed, After the calculation process is completed, the value of each neuron output from the arithmetic module will be merged into a complete vector consisting of local induction domains in the interconnect module, and returned to the main operation module 5 as an intermediate result vector for the activation function. Gibbs sampling is performed as needed.
  • the first-order hidden layer intermediate value vector in the main operation module 5 It is sent to each slave operation module 6 through the interconnection module 4; when the calculation process from the operation module 6 is completed, the vectors returned by the two downstream nodes are added to a vector at the current node and returned to the upstream node as an intermediate result vector. Return to the main operation module 5 to perform the activation function and Gibbs sampling.
  • Fig. 3 shows an example block diagram of the structure of the main arithmetic module 5 in the apparatus for performing an artificial neural network pre-training operation according to the present invention.
  • the main operation module 5 includes an operation unit 51, a data dependency determination unit 52, and a storage unit 53.
  • the storage unit 53 is configured to buffer input data and output data used by the main operation module 5 in the calculation process, and the operation unit 51 performs various operation functions of the main operation module 5, and the data dependency determination unit 52 is the operation unit 51 for reading and writing storage.
  • the port of unit 53 can ensure the read and write consistency of data in the storage unit.
  • the data dependency determining unit 52 determines whether there is a dependency relationship between the control signal that has not been executed and the data of the control signal that is being executed, and if not, allows the group of control signals to be immediately transmitted, otherwise it is necessary to wait for the group control.
  • the control signal is allowed to be transmitted after all the control signals on which the signal depends are executed.
  • all control signals sent to the data dependency unit 52 are stored in an instruction queue inside the data dependency unit 52, in which the range of read data of the read command is a write command ahead of the queue position.
  • the range of writing data occurs In case of conflict, the instruction must wait until the dependent write instruction is executed before it can execute.
  • the data dependency determination unit 52 is also responsible for transmitting the read data to the slave calculation module through the interconnection module 4, and the output data from the calculation module 6 is directly transmitted to the operation unit 51 through the interconnection module 4.
  • the command output from the controller unit 2 is sent to the calculation unit 51 and the data dependency determination unit 52 to control its behavior.
  • each slave arithmetic module 6 includes an arithmetic unit 61, a data dependency determining unit 62, a first storage unit 63, a second storage unit 64, and a third storage unit 65.
  • the arithmetic unit 61 receives the control signal from the controller unit 2 and performs an arithmetic logic operation.
  • the data dependency judging unit 62 is responsible for reading and writing operations on the cache unit in the calculation process.
  • the data dependency judging unit 62 ensures that there is no consistency conflict between the reading and writing of the cache unit. For example, all control signals sent to the data dependency unit 62 are stored in an instruction queue internal to the data dependency unit 62, in which the range of read data of the read command is a write command ahead of the queue position. If the range of write data conflicts, the instruction must wait until the write instruction it depends on is executed.
  • the first storage unit 63 buffers the input neuron vector during each stage First order hidden layer intermediate value First order visible layer intermediate value First order hidden layer intermediate value And the result of the input vector and the weight matrix dot product calculated at each stage.
  • the second storage unit 64 buffers the weight data required by the slave computing module 6 in the calculation process. For each slave arithmetic module, only the columns in the weight matrix corresponding to the scalar data stored by the slave arithmetic module 6 are stored.
  • the third storage unit 65 buffers the weight gradient data required by the corresponding computing module in the process of updating the weights.
  • Each of the weight gradient data stored from the arithmetic module 6 corresponds to its stored weight data.
  • the computing module 6 From the computing module 6, the first half of the first three stages of parallel and the last stage of the artificial neural network self-learning pre-training process are updated.
  • each slave operation module 6 performs a dot product multiplication operation using the same input vector value and a weight corresponding to a different component of the output vector, respectively, and obtains a partial sum corresponding to different components in the output vector, respectively. These portions of their respective output components are obtained after accumulation and are progressively combined into a complete local induced domain vector in the interconnect module 4.
  • Each slave arithmetic module 6 only needs to calculate the local induction domain corresponding to the corresponding output neuron value of the module.
  • each slave arithmetic module 6 only calculates the input first-order hidden layer intermediate value vector.
  • Corresponding partial scalar and weight matrix The product of the corresponding column, each output vector obtained is a sum of the sum of the final results, and these parts are summed up by two in the interconnect module to obtain the final result.
  • Each of the slave arithmetic modules 6 calculates a partial sum of the output first-order visible layer vector local induction domain, and all the parts and the summation operation in the interconnect module 4 to obtain the final local induced domain.
  • the first three stages calculate the intermediate value for updating the weight, and the main operation module 5 performs subsequent operations based on the output of the first three stage operations to obtain a weight update value.
  • updating the weight from the arithmetic module 5 according to the formula (1) can also be divided into three small steps:
  • Each of the first-order hidden layer intermediate value vectors calculated from the arithmetic module 6 And input neurons The median of the product of the corresponding partial scalar;
  • Each input from the arithmetic module 6 calculates the input first-order hidden layer intermediate value vector And first-order visible layer vectors The product of the corresponding partial scalar, and calculate the vector difference from the intermediate value of the first small stage;
  • Each calculation module 6 calculates the product of the difference between the second small stage and the learning rate to obtain a weight update value, followed by a weight Perform vector subtraction to get the updated weight.
  • the above three small stages are only an example description of updating the weights from the calculation module 6, and the application can fine-tune the details.
  • the calculation of the product of the first small stage and the product of the second small stage can be performed.
  • the calculation is interchangeable; or the third small stage multiplied by the learning rate can be advanced to the second small stage or even split into the first two small stages.
  • an instruction set for performing an artificial neural network forward operation on the aforementioned apparatus includes the CONFIG instruction, the COMPUTE instruction, the IO instruction, the NOP instruction, the JUMP instruction, and the MOVE instruction, where:
  • the CONFIG command configures various constants required for current layer calculation before each layer of artificial neural network calculation begins;
  • the COMPUTE instruction completes the arithmetic logic calculation of each layer of artificial neural network
  • the IO instruction realizes reading input data required for calculation from the external address space and storing the data back to the external space after the calculation is completed;
  • the NOP instruction is responsible for clearing the control signals currently loaded into all internal control signal buffer queues, ensuring that all instructions preceding the NOP instruction are completed.
  • the NOP instruction itself does not contain any operations;
  • the JUMP instruction is responsible for the jump of the next instruction address that the controller will read from the instruction storage unit, and is used to implement the jump of the control flow;
  • the MOVE instruction is responsible for carrying data of an address in the internal address space of the device to another address in the internal address space of the device.
  • the process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
  • FIG. 5 illustrates an example block diagram of first and third stages of a neural network self-learning pre-training process in accordance with an embodiment of the present invention.
  • the input vectors broadcast by the interconnect module 4 are respectively subjected to a dot product operation with the weight vector of the slave arithmetic module 6, and the local induced domain portion of the corresponding output neuron value is obtained, and all of the output portions are localized.
  • the weight vector of each slave arithmetic module 6 is a column vector corresponding to the slave arithmetic module 6 in the weight matrix.
  • the interconnect module 4 transmits the input vectors [I 0 , . . . , I n ] to all of the slave arithmetic units for temporary storage in the first memory unit.
  • the dot product of its corresponding weight vector [W i0 , . . . , W in ] and the input vector is calculated.
  • the result output from the arithmetic unit is merged into a complete local induced domain vector via the interconnect module 4 and returned to the main arithmetic unit 5, where the active function operation and its possible Gibbs sampling are performed to obtain the final output vector [ O 0, O 1, ..., O n].
  • FIG. 6 shows an example block diagram of a second phase of a neural network self-learning pre-training process in accordance with an embodiment of the present invention.
  • Calculate the output first-order visible layer vector The process of broadcasting the first-order hidden layer vector values for the interconnect module 4, each taking from the arithmetic module 6 Corresponding partial scalar h 0i and weight matrix Each output vector obtained by the product of the corresponding column [W i0 , . . . , W in ] is a part of the local induced domain of the first-order visible layer vector to be accumulated, and these parts are in the interconnection module 4 The two pairs are added to obtain the final local induction domain.
  • the calculated local induction domain is returned to the main operation unit 5, and the activation function operation and its possible Gibbs sampling are performed in the main operation unit 5 to obtain the final output first-order visible layer vector.
  • FIG. 7 shows a flow chart of a fourth stage of a neural network self-learning pre-training process in accordance with an embodiment of the present invention.
  • updating the weight from the arithmetic module 5 according to the formula (1) can also be divided into three small steps:
  • Each of the first-order hidden layer intermediate value vectors calculated from the arithmetic module 6 And input neurons The intermediate value of the corresponding partial scalar is buffered to the third storage unit shown in FIG. 4; this small stage is similar to the block diagram of the second stage shown in FIG. 6, but the input is a first-order hidden layer intermediate value vector And input neurons
  • Each input from the arithmetic module 6 calculates the input first-order hidden layer intermediate value vector And first-order visible layer vectors a product of the corresponding partial scalar, and calculating a vector difference from the intermediate value of the first small stage and buffering to the third storage unit shown in FIG. 4;
  • Each calculation module 6 calculates the product of the difference between the second small stage and the learning rate to obtain a weight update value, followed by a weight Perform vector subtraction to get the updated weight.
  • the above three small stages are only an example description of updating the weights from the calculation module 6, and the application can fine-tune the details.
  • the calculation of the product of the first small stage and the product of the second small stage can be performed.
  • the calculation is interchangeable; or the third small stage multiplied by the learning rate can be advanced to the second small stage or even split into the first two small stages.
  • FIG. 8 illustrates a flow chart of a self-learning pre-training operation of a layer of artificial neural network according to an embodiment. Since multi-layer artificial neural network self-learning pre-training can adopt a layer-by-layer training manner, pre-training of a multi-layer artificial neural network can be invoked. This process is implemented multiple times. The flow chart describes the benefit The process of a single layer neural network self-learning pre-training operation shown in FIG. 4 is implemented by the apparatus and instruction set of the present invention.
  • step S1 an IO instruction is pre-stored at the first address of the instruction cache unit 1.
  • step S2 the operation starts, the controller unit 2 reads the IO instruction from the first address of the instruction cache unit 1, and according to the decoded control signal, the data access unit 3 reads all the corresponding artificial neural network operations from the external address space.
  • the instruction is cached in the instruction storage unit 1.
  • step S3 the controller unit 2 then reads in the next IO instruction from the instruction storage unit, and according to the decoded control signal, the data access unit 3 reads all data required by the main operation module 5 from the external address space (for example, including input). Neuron vector The function interpolation table, learning rate, offset, etc. are activated) to the storage unit 53 of the main arithmetic module 5.
  • the controller unit 2 then reads the next IO instruction from the instruction storage unit, and based on the translated control signal, the data access unit 3 reads the weight matrix data required from the arithmetic module 6 from the external address space.
  • step S5 the controller unit 2 then reads the next CONFIG command from the instruction storage unit, and according to the decoded control signal, the device configures various constants required for the first stage calculation of the layer neural network.
  • the arithmetic unit 51, 61 configures the value of the unit internal register according to the parameters in the control signal, for example, including the accuracy setting of the layer calculation, and the data of the activation function.
  • the controller unit 2 then reads the next COMPUTE command from the instruction storage unit, and starts the calculation of the first stage based on the translated control signal.
  • the main operation module 5 first inputs the input neuron vector through the interconnection module 4. It is sent to each slave arithmetic module 6 and stored in the first storage unit 63 of the slave arithmetic module 6. Reading the weight vector (the column vector corresponding to the slave operation module 6 in the weight matrix) from the second storage unit 64 from the operation unit 61 of the operation module 6, reading the input neuron vector from the first storage unit Complete weight vector and input neuron vector The dot product operation returns the intermediate result through the interconnect module.
  • the intermediate results returned from each of the arithmetic modules 6 are progressively assembled into a complete local induced domain vector.
  • the main operation module 5 obtains the return value of the interconnection module 4, reads the offset vector from the storage unit 53 according to the control signal decoded by the COMPUTE instruction, adds it to the vector returned by the interconnection module 4, and then performs the addition result. Activate and perform Gibbs sampling, and the last first-order hidden layer vector Write back to the storage unit 53.
  • the controller unit 2 then reads the next CONFIG command from the instruction storage unit, and according to the translated control signal, the device configures various constants required for the second stage of the layer neural network calculation.
  • the controller unit 2 then reads the next COMPUTE instruction from the instruction storage unit, and starts the calculation of the second stage based on the translated control signal.
  • the main operation module 5 first passes the first-order hidden layer vector through the interconnect module 4. It is sent to each slave arithmetic module 6 and stored in the first storage unit 63 of the slave arithmetic module 6. Reading the weight vector from the second storage unit 64 from the operation unit 61 of the operation module 6 (the column vector corresponding to the slave operation module 6 in the weight matrix), and selecting the first-order hidden layer vector from the first storage unit Scalar, complete weight vector and first-order hidden layer vector The corresponding scalar product operation returns the intermediate result through the interconnect module.
  • the intermediate results returned from each of the arithmetic modules 6 are added step by step into a complete local induced domain vector.
  • the main operation module 5 obtains the return value of the interconnection module 4, reads the offset vector from the storage unit 53 according to the control signal decoded by the COMPUTE instruction, adds it to the vector returned by the interconnection module 4, and then performs the addition result. Activate and perform Gibbs sampling and the last first visible layer vector Write back to the storage unit 53.
  • step S9 the controller unit 2 then reads the next CONFIG command from the instruction storage unit, and according to the decoded control signal, the device configures various constants required for the third-stage calculation of the layer neural network.
  • the configuration of this layer is basically the same as the first phase, but you need to configure a learning rate parameter.
  • the controller unit 2 then reads the next COMPUTE instruction from the instruction storage unit, and starts the calculation of the third stage based on the translated control signal.
  • the main operation module 5 first passes the first-order hidden layer vector through the interconnect module 4. It is sent to each slave arithmetic module 6 and stored in the first storage unit 63 of the slave arithmetic module 6. Reading the first-order visible layer vector from the first storage unit Complete weight vector and first-order visible layer vector The dot product operation returns the intermediate result through the interconnect module. In the interconnect module 4, the intermediate results returned from each of the arithmetic modules 6 are progressively assembled into a complete local induced domain vector.
  • the main operation module 5 obtains the return value of the interconnection module 4, reads the offset vector from the storage unit 53 according to the control signal decoded by the COMPUTE instruction, adds it to the vector returned by the interconnection module 4, and then performs the addition result. Activate and put the last first-order hidden layer vector Write back to the storage unit 53.
  • the controller unit 2 then reads the next COMPUTE command from the instruction storage unit, and starts the calculation of the fourth stage based on the translated control signal.
  • the first small stage main operation module 5 first inputs the input neuron vector through the interconnect module 4.
  • First order hidden layer vector It is sent to each slave arithmetic module 6 and stored in the weight gradient buffer unit 65 of the slave arithmetic module 6.
  • the second small stage reads the first-order hidden layer vector from the first storage unit from the arithmetic unit 61 of the arithmetic module 6.
  • the product of the components is subjected to vector subtraction of the intermediate result and the intermediate value of the previous small stage buffer read from the weight gradient buffer unit 65, and the calculated intermediate result is buffered to the weight gradient buffer unit 65.
  • the last small stage reads the intermediate value of the last small stage and the learning rate from the weighting gradient buffer unit 65 from the arithmetic unit 61 of the arithmetic module 6 to obtain a weight update value, and reads the corresponding weight and weight from the weight buffer unit 64.
  • the update value is vector subtracted to obtain the updated weight and cached back to the weight buffer unit 64.
  • the weight reaches a certain convergence criterion (the weight update value is less than a certain threshold).
  • the single-layer neural network pre-training ends, and the next step can be started. Pre-training of layer neural networks.

Abstract

一种用于执行人工神经网络自学习运算的装置和方法,所述装置包括指令存储单元(1)、控制器单元(2)、数据访问单元(3)、互连模块(4)、主运算模块(5)、以及多个从运算模块(6)。所述方法可对多层神经网络的自学习预训练按照逐层训练的训练方式,对于每一层网络,经过多次运算迭代直至权重更新小于一定阈值后,该层网络的自学习预训练完成。每次迭代过程可分为四个阶段,前三个阶段分别计算生成一阶隐层中间值、一阶可见层中间值和二阶隐层中间值,最后一阶段则利用前三个阶段的中间值更新权重。

Description

用于执行人工神经网络自学习运算的装置和方法 技术领域
本发明涉及人工神经网络技术,具体地涉及一种用于执行人工神经网络自学习运算的装置和方法。
背景技术
多层人工神经网络被广泛应用于模式识别,图像处理,函数逼近和优化计算等领域,多层人工网络在近年来由于其较高的识别准确度和较好的可并行性,受到学术界和工业界越来越广泛的关注。
典型的多层人工神经网络训练方法为反向传播(BP)算法。此方法是监督学习的代表类型,在训练过程中需要大量的带标签的训练样本,然而样本的收集所需的成本代价很高。同时,此方法的训练过程中,误差校正信号随着传播层数的增加而减小,训练容易收敛于局部最小值而且收敛速度较慢。因此,先采用收敛速度快且不需带标签训练样本的自学习算法对网络参数预训练,然后再采用反向传播训练进行微调多层神经网络成为一个新的热点。其中,作为预训练的自学习运算尤为重要。
一种支持多层人工神经网络自学习运算的已知方法是使用通用处理器。该方法通过使用通用寄存器堆和通用功能部件执行通用指令来支持上述算法。该方法的缺点之一是单个通用处理器的运算性能较低,无法满足通常的多层人工神经网络运算的性能需求。而多个通用处理器并行执行时,通用处理器之间相互通信又成为了性能瓶颈。另外,通用处理器需要把多层人工神经网络预训练运算译码成一长列运算及访存指令序列,处理器前端译码带来了较大的功耗开销
另一种支持多层人工神经网络预训练的已知方法是使用图形处理器(GPU)。该方法通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来支持上述算法。由于GPU是专门用来执行图形图像运算以及科学计算的设备,没有对多层人工神经网络运算的专门支持,仍然需要大量的前端译码工作才能执行多层人工神经网络运算,带来了大量的 额外开销。另外GPU只有较小的片上缓存,多层人工神经网络的模型数据(权值)需要反复从片外搬运,片外带宽成为了主要性能瓶颈。另外,GPU只有较小的片上缓存,多层人工神经网络的模型数据(权值)需要反复从片外搬运,片外带宽成为了主要性能瓶颈,同时带来了巨大的功耗开销。
发明内容
本发明所要解决的是现有技术中通用处理器(GPU、CPU)进行多层神经网络预训练需要一系列的简单运算以及访存运算,前端译码功耗开销较大以及现有的通用处理器数据访存开销大、单个通用处理器运算性能低等问题。
本发明提出一种用于执行人工神经网络自学习运算的装置,包括指令存储单元、控制器单元、数据访问单元、互连模块、主运算模块、以及多个从运算模块,其中:所述指令存储单元用于通过数据访问单元读入指令并缓存读入的指令;所述控制器单元用于从指令存储单元读取指令,并将该指令译码成控制互连模块、主运算模块、以及从运算模块行为的控制信号,然后将各自的控制信号分发至各个模块;所述数据访问单元用于访问外部地址空间,完成数据的加载和存储;所述互连模块具有不同拓扑实现,用于将所述主运算模块的输入向量分发给所述多个从运算模块,以及将各从运算模块的计算结果合并后返回给主运算模块;所述主运算模块用于对所述互连模块返回的中间值进行激活函数、吉布斯采样,以及对激活函数的偏置的更新;所述从运算模块用于输入向量和相应权重矩阵的点积运算,输入向量中的相应分量标量和对应权重矩阵的乘积运算,以及权重矩阵的更新。
根据本发明的具体实施方式,所述主运算模块包括运算单元、数据依赖关系判断单元和存储单元,其中,所述存储单元用于缓存主运算模块在计算过程中用到的输入数据和输出数据,所述运算单元用于完成主运算模块的运算;所述数据依赖关系判断单元是所述运算单元和读写存储单元的端口,用于保证存储单元中数据的读写一致性。
根据本发明的具体实施方式,所述数据依赖关系判断单元用于判断尚未执行的控制信号与正在执行过程中的控制信号的数据之间是否存在依赖关系,如果不存在,允许该组控制信号立即发射,否则需要等到该组控制信号所依赖的所有控制信号全部执行完成后该条控制信号才允许被发射。
根据本发明的具体实施方式,所述数据依赖关系判断单元还用于将读取数据通过互连模块发送给从计算模块。
根据本发明的具体实施方式,每个所述从运算模块包括运算单元、数据依赖关系判断单元、第一存储单元、第二存储单元和第三存储单元,其中,所述运算单元用于接收控制器单元发出的控制信号并进行算数逻辑运算;所述数据依赖关系判断单元用于对缓存单元的读写操作进行监控,以保证对缓存单元的读写不存在一致性冲突;所述第一存储单元用于缓存神经元的输入向量和计算结果;所述第二存储单元用于缓存所述从运算模块在计算过程中需要的权值数据;所述第三存储单元用于缓存相应从运算模块在更新权值过程中需要的权值梯度数据。
本发明还提出一种执行人工神经网络逐层自学习运算的方法,所述人工神经网络包括两层或者两层以上的多个神经元,人工神经网络的自学习预训练采用逐层训练,对于每一层来说,所述预训练分为四个阶段:
第一阶段,输入神经元向量
Figure PCTCN2016080320-appb-000001
和权值向量矩阵
Figure PCTCN2016080320-appb-000002
进行点积运算得到局部诱导域,局部诱导域经过激活函数非线性变换后再采用吉布斯(Gibbs)采样计算得到一阶隐层中间值
Figure PCTCN2016080320-appb-000003
第二阶段,先将权值向量矩阵的转置
Figure PCTCN2016080320-appb-000004
和一阶隐层中间值的转置
Figure PCTCN2016080320-appb-000005
进行点积运算,其局部诱导域经过激活函数非线性变换后再采用Gibbs采样得到一阶可见层中间值
Figure PCTCN2016080320-appb-000006
第三阶段,输入一阶可见层中间值
Figure PCTCN2016080320-appb-000007
和权值向量矩阵
Figure PCTCN2016080320-appb-000008
进行点积运算得到局部诱导域,局部诱导域经过激活函数非线性变换后得到第二隐层中间值
Figure PCTCN2016080320-appb-000009
第四阶段,根据如下公式更新权重:
Figure PCTCN2016080320-appb-000010
Figure PCTCN2016080320-appb-000011
Figure PCTCN2016080320-appb-000012
其中,向量
Figure PCTCN2016080320-appb-000013
为上述第一、三阶段进行激活函数之前向量和权重矩阵点积部分和加的偏置,向量
Figure PCTCN2016080320-appb-000014
则为第二阶段时的偏置;公式中“×”表示向量的叉乘,则是学习率。
相比于现有技术,本发明对多层神经网络预训练指令进行优化,处理器可仅用一条指令完成神经网络一层的预训练学习,精简了通用处理器指令的前端译码开销;同时,本发明包含一个主运算模块、多个从运算模块以及大量分布式片上存储缓解访存开销,可并行执行神经网络预训练运算而不需进行频繁的片外数据访存。总而言之,本发明的性能功耗比远高于通用处理器。
本发明可以应用于以下(包括但不限于)场景中:数据处理、机器人、电脑、打印机、扫描仪、电话、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备等各类电子产品;飞机、轮船、车辆等各类交通工具;电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机等各类家用电器;以及包括核磁共振仪、B超、心电图仪等各类医疗设备。
附图说明
为了更完整地理解本发明及其优势,现在将参考结合附图的以下描述,其中:
图1示出了根据本发明实施例的用于执行人工神经网络自学习预训练的装置的整体结构的示例框图。
图2示意性示出了根据本发明实施例的用于执行人工神经网络自学习预训练的装置中互连模块的H树型结构实现。
图3示出了根据本发明实施例的用于执行人工神经网络自学习预训练的装置中主运算模块结构的示例框图。
图4示出了根据本发明实施例的用于执行人工神经网络自学习预训练的装置中从运算模块结构的示例框图。
图5示出了根据本发明实施例的神经网络自学习预训练过程第一、三阶段的示例框图。
图6示出了根据本发明实施例的神经网络自学习预训练过程第二阶段的示例框图。
图7示出了根据本发明实施例的神经网络自学习预训练过程第四阶段的实例流程图。
图8示出了根据本发明实施例的单层神经网络自学习预训练一次迭代的实例流程图。
在所有附图中,相同的装置、部件、单元等使用相同的附图标记来表示。
具体实施方式
根据结合附图对本发明示例性实施例的以下详细描述,本发明的其它方面、优势和突出特征对于本领域技术人员将变得显而易见。
在本发明中,术语“包括”和“含有”及其派生词意为包括而非限制;术语“或”是包含性的,意为和/或。
在本说明书中,下述用于描述本发明原理的各种实施例只是说明,不应该以任何方式解释为限制发明的范围。参照附图的下述描述用于帮助全面理解由权利要求及其等同物限定的本发明的示例性实施例。下述描述包括多种具体细节来帮助理解,但这些细节应认为仅仅是示例性的。因此,本领域普通技术人员应认识到,在不背离本发明的范围和精神的情况下,可以对本文中描述的实施例进行多种改变和修改。此外,为了清楚和简洁起见,省略了公知功能和结构的描述。此外,贯穿附图,相同参考数字用于相似功能和操作。
根据本发明实施例的多层人工神经网络的自学习预训练,人工神经网络包括两层或者两层以上的多个神经元。人工神经网络的自学习预训 练采用逐层训练,从第一层开始训练直至最后一层。对于每一层来说,预训练分为四个阶段:
第一阶段,输入神经元向量
Figure PCTCN2016080320-appb-000015
首先和权值向量矩阵
Figure PCTCN2016080320-appb-000016
进行点积运算得到局部诱导域,局部诱导域经过激活函数非线性变换后再采用吉布斯(Gibbs)采样计算得到一阶隐层中间值
Figure PCTCN2016080320-appb-000017
第二阶段,先将权值向量矩阵的转置
Figure PCTCN2016080320-appb-000018
和一阶隐层中间值的转置
Figure PCTCN2016080320-appb-000019
进行点积运算,其局部诱导域经过激活函数非线性变换后再采用Gibbs采样得到一阶可见层中间值
Figure PCTCN2016080320-appb-000020
第三阶段和第一阶段类似,区别在于第三阶段输入为一阶可见层中间值
Figure PCTCN2016080320-appb-000021
计算第二隐层中间值
Figure PCTCN2016080320-appb-000022
之前不需经过Gibbs采样;
第四阶段,根据如下公式更新权重:
Figure PCTCN2016080320-appb-000023
Figure PCTCN2016080320-appb-000024
Figure PCTCN2016080320-appb-000025
其中,向量
Figure PCTCN2016080320-appb-000026
为上述第一、三阶段进行激活函数之前向量和权重矩阵点积部分和加的偏置,向量
Figure PCTCN2016080320-appb-000027
则为第二阶段时的偏置;公式中“×”表示向量的叉乘,则是学习率。
图1示出了根据本发明的用于执行人工神经网络自学习预训练的装置的整体结构的示例框图。如图1所示,该装置包括指令存储单元1、控制器单元2、数据访问单元3、互连模块4、主运算模块5和多个从运算模块6。指令存储单元1、控制器单元2、数据访问单元3、互连模块4、主运算模块5和从运算模块6均可以通过硬件电路(例如专用集成电路ASIC)实现。
指令存储单元1通过数据访问单元3读入指令并缓存读入的指令。
控制器单元2从指令存储单元1中读取指令,将指令译成控制其他模块行为的控制信号并发送给其他模块如数据访问单元3、主运算模块5和从运算模块6等。
数据访问单元3能够访存外部地址空间,直接向装置内部的各个缓存单元读写数据,完成数据的加载和存储。
图2示意性示出了互连模块4的结构。互连模块4构成主运算模块5和多个从运算模块6之间的数据通路,并具有不同的结构。互连是由多个节点构成的二叉树通路,每个节点将上游的数据同样地发给下游的两个节点,将下游的两个节点返回的数据进行合并,并返回给上游的节点。例如,在神经网络自学习运算第一、三阶段过程中,主运算模块5内的输入向量通过互连模块4发送给各个从运算模块6;运算模块6的计算过程完成后,当从运算模块的计算过程完成后,每个从运算模块输出的神经元的值会在互连模块中逐级拼成一个完整的由局部诱导域组成的向量,作为中间结果向量返回主运算模块5进行激活函数并根据需求进行Gibbs采样。而在第二阶段过程中,主运算模块5内的一阶隐层中间值向量
Figure PCTCN2016080320-appb-000028
通过互连模块4发送给各个从运算模块6;当从运算模块6的计算过程完成后,下游两个节点返回的向量会在当前节点相加成一个向量并返回给上游节点,作为中间结果向量返回主运算模块5进行激活函数以及Gibbs采样。
图3示出了根据本发明的用于执行人工神经网络预训练运算的装置中主运算模块5的结构的示例框图。如图3所示,主运算模块5包括运算单元51、数据依赖关系判断单元52和存储单元53。
存储单元53用于缓存主运算模块5在计算过程中用到的输入数据和输出数据,运算单元51完成主运算模块5的各种运算功能,数据依赖关系判断单元52是运算单元51读写存储单元53的端口,同时能够保证存储单元中数据的读写一致性。具体地,数据依赖关系判断单元52判断尚未执行的控制信号与正在执行过程中的控制信号的数据之间是否存在依赖关系,如果不存在,允许该组控制信号立即发射,否则需要等到该组控制信号所依赖的所有控制信号全部执行完成后该条控制信号才允许被发射。例如,所有发往数据依赖关系单元52的控制信号都会被存入数据依赖关系单元52内部的指令队列里,在该队列中,读指令的读取数据的范围如果与队列位置靠前的写指令写数据的范围发生 冲突,则该指令必须等到所依赖的写指令被执行后才能够执行。同时,数据依赖关系判断单元52也负责将读取数据通过互连模块4发送给从计算模块,而从计算模块6的输出数据通过互连模块4直接发送给运算单元51。控制器单元2输出的指令发送给计算单元51和数据依赖关系判断单元52,来控制其行为。
图4示出了根据本发明的用于执行人工神经网络预训练的装置中从运算模块6的结构的示例框图。如图4所示,每个从运算模块6包括运算单元61、数据依赖关系判断单元62、第一存储单元63、第二存储单元64和第三存储单元65。
运算单元61接收控制器单元2发出的控制信号并进行算数逻辑运算。
数据依赖关系判断单元62负责计算过程中对缓存单元的读写操作。数据依赖关系判断单元62保证对缓存单元的读写不存在一致性冲突。例如,所有发往数据依赖关系单元62的控制信号都会被存入数据依赖关系单元62内部的指令队列里,在该队列中,读指令的读取数据的范围如果与队列位置靠前的写指令写数据的范围发生冲突,则该指令必须等到所依赖的写指令被执行后才能够执行。
第一存储单元63缓存各阶段过程中的输入神经元向量
Figure PCTCN2016080320-appb-000029
一阶隐层中间值
Figure PCTCN2016080320-appb-000030
一阶可见层中间值
Figure PCTCN2016080320-appb-000031
一阶隐层中间值
Figure PCTCN2016080320-appb-000032
以及各个阶段计算的输入向量和权重矩阵点积结果。
第二存储单元64缓存该从运算模块6在计算过程中需要的权值数据。对于每一个从运算模块,都只会存储权值矩阵中与该从运算模块6所存储的标量数据相对应的列。
第三存储单元65缓存相应从运算模块在更新权值过程中需要的权值梯度数据。每一个从运算模块6存储的权值梯度数据与其存储的权值数据相对应。
从运算模块6实现人工神经网络自学习预训练过程中前三阶段并行的前半部分以及最后一个阶段公式(1)权值的更新。
以人工神经网络深度信念网络(DBN)的预训练为例,将前三阶段的权值矩阵
Figure PCTCN2016080320-appb-000033
(或
Figure PCTCN2016080320-appb-000034
)和输入神经元向量
Figure PCTCN2016080320-appb-000035
的乘法可以划分为不相关的并行计算子任务。第一、三阶段中,每个从运算模块6利用相同的输入向量值,和输出向量不同分量相对应的权值进行点积乘法运算,分别得到输出向量中不同分量相应的部分和,多次累加后得到其各自对应输出分量的这些部分和在互连模块4中逐级拼成一个完整局部诱导域向量。每个从运算模块6只需要计算出本模块对应输出神经元值相应的局部诱导域即可。不同的局部诱导域分量在互连模块4中逐级拼成一个完整局部诱导域向量传输给主运算模块进行激活函数以及随后的采样。第二阶段中,每个从运算模块6只计算输入的一阶隐层中间值向量
Figure PCTCN2016080320-appb-000036
中相应的部分标量和权值矩阵
Figure PCTCN2016080320-appb-000037
对应的列的乘积,得到的每个输出向量都是最终结果的一个待累加的部分和,这些部分和在互连模块中逐级两两相加得到最后的结果。每个从运算模块6计算出输出一阶可见层向量局部诱导域的部分和,所有的部分和在互连模块4中完成求和运算得到最后局部诱导域。前三个阶段计算得到中间值用于更新权重,主运算模块5基于前三个阶段运算的输出进行后续运算得出权重更新值。在最后一阶段,从运算模块5按照公式(1)更新权重也可分为三个小步骤:
1.每个从运算模块6计算输入的一阶隐层中间值向量
Figure PCTCN2016080320-appb-000038
和输入神经元
Figure PCTCN2016080320-appb-000039
中相应的部分标量的乘积中间值;
2.每个从运算模块6计算输入的一阶隐层中间值向量
Figure PCTCN2016080320-appb-000040
和一阶可见层向量
Figure PCTCN2016080320-appb-000041
中相应的部分标量的乘积,并计算和第一小阶段中间值的向量差值;
3.每个从运算模块6计算第二小阶段的差值和学习率的乘积,得到权重更新值,之后和权重
Figure PCTCN2016080320-appb-000042
进行向量减法,得到更新后的权重。
值得注意的是,上述三个小阶段仅仅是对从计算模块6更新权重一个实例描述,应用者可以进行细节的微调,例如,可以将第一小阶段的乘积的计算和第二小阶段中乘积的计算互换;或者可以将第三小阶段乘以学习率提前到第二小阶段甚至是拆分至前两个小阶段。
根据本发明实施例,还提供了在前述装置上执行人工神经网络正向运算的指令集。指令集中包括CONFIG指令、COMPUTE指令、IO指令、NOP指令、JUMP指令和MOVE指令,其中:
CONFIG指令在每层人工神经网络计算开始前配置当前层计算需要的各种常数;
COMPUTE指令完成每层人工神经网络的算术逻辑计算;
IO指令实现从外部地址空间读入计算需要的输入数据以及在计算完成后将数据存回至外部空间;
NOP指令负责清空当前装至内部所有控制信号缓存队列中的控制信号,保证NOP指令之前的所有指令全部指令完毕。NOP指令本身不包含任何操作;
JUMP指令负责控制器将要从指令存储单元读取的下一条指令地址的跳转,用来实现控制流的跳转;
MOVE指令负责将装置内部地址空间某一地址的数据搬运至装置内部地址空间的另一地址,该过程独立于运算单元,在执行过程中不占用运算单元的资源。
图5示出了根据本发明实施例的神经网络自学习预训练过程第一、三阶段的示例框图。在不同从运算模块6中,互连模块4广播的输入向量分别与该从运算模块6的权值向量进行点积运算,得到对应的输出神经元值的局部诱导域部分和,所有这些输出局部诱导域值组成中间结果向量,该中间结果向量经过加偏置向量以及激活运算得到该层神经网络的最终输出神经元向量,公式描述为out=f(w*in+b),其中out输出向量、in是输入向量、b是偏置向量,w是权值矩阵,f是激活函数。每个从运算模块6的权值向量是权值矩阵中与该从运算模块6相对应的列向量。互连模块4将输入向量[I0,…,In]发送给所有的从运算单元,暂存在第一存储单元中。对于第i个从运算单元,计算其相应的权值向量[Wi0,…,Win]与输入向量的点积。从运算单元输出的结果经过互连模块4拼成完整的局部诱导域向量并返回给主运算单元5,在主运算单元5中 进行激活函数运算以及其可能的Gibbs采样,得到最后的输出向量[O0,O1,…,On]。
图6示出了根据本发明实施例的神经网络自学习预训练过程第二阶段的示例框图。计算输出一阶可见层向量
Figure PCTCN2016080320-appb-000043
的过程为互连模块4广播一阶隐层向量值,每个从运算模块6取
Figure PCTCN2016080320-appb-000044
中相应的部分标量h0i与权值矩阵
Figure PCTCN2016080320-appb-000045
对应的列[Wi0,…,Win]的乘积,得到的每个输出向量都是一阶可见层向量的局部诱导域的一个待累加的部分和,这些部分和在互连模块4中逐级两两相加得到最后的局部诱导域。计算得的局部诱导域返回给主运算单元5,在主运算单元5中进行激活函数运算以及其可能的Gibbs采样,得到最后的输出一阶可见层向量
Figure PCTCN2016080320-appb-000046
图7示出了根据本发明实施例的神经网络自学习预训练过程第四阶段的流程图。最后一阶段,从运算模块5按照公式(1)更新权重也可分为三个小步骤:
1.每个从运算模块6计算输入的一阶隐层中间值向量
Figure PCTCN2016080320-appb-000047
和输入神经元
Figure PCTCN2016080320-appb-000048
中相应的部分标量的乘积中间值缓存至图4所示的第三存储单元;此小阶段类似于图6所示的第二阶段的框图,不过其输入分别为一阶隐层中间值向量
Figure PCTCN2016080320-appb-000049
和输入神经元
Figure PCTCN2016080320-appb-000050
2.每个从运算模块6计算输入的一阶隐层中间值向量
Figure PCTCN2016080320-appb-000051
和一阶可见层向量
Figure PCTCN2016080320-appb-000052
中相应的部分标量的乘积,并计算和第一小阶段中间值的向量差值并缓存至图4所示的第三存储单元;
3.每个从运算模块6计算第二小阶段的差值和学习率的乘积,得到权重更新值,之后和权重
Figure PCTCN2016080320-appb-000053
进行向量减法,得到更新后的权重。
值得注意的是,上述三个小阶段仅仅是对从计算模块6更新权重一个实例描述,应用者可以进行细节的微调,例如,可以将第一小阶段的乘积的计算和第二小阶段中乘积的计算互换;或者可以将第三小阶段乘以学习率提前到第二小阶段甚至是拆分至前两个小阶段。
图8示出根据一个实施例的一层人工神经网络自学习预训练运算流程图,由于多层人工神经网络自学习预训练可以采用逐层训练的方式,多层人工神经网络的预训练可以调用多次该流程实现。该流程图描述利 用本发明的装置和指令集实现图4所示的一种单层神经网络自学习预训练运算的过程。
在步骤S1,在指令缓存单元1的首地址处预先存入一条IO指令。
在步骤S2,运算开始,控制器单元2从指令缓存单元1的首地址读取该条IO指令,根据译出的控制信号,数据访问单元3从外部地址空间读取相应的所有人工神经网络运算指令,并将其缓存在指令存储单元1中。
在步骤S3,控制器单元2接着从指令存储单元读入下一条IO指令,根据译出的控制信号,数据访问单元3从外部地址空间读取主运算模块5需要的所有数据(例如,包括输入神经元向量
Figure PCTCN2016080320-appb-000054
激活函数插值表、学习率和偏置等)至主运算模块5的存储单元53。
在步骤S4,控制器单元2接着从指令存储单元读入下一条IO指令,根据译出的控制信号,数据访问单元3从外部地址空间读取从运算模块6需要的权值矩阵数据。
在步骤S5,控制器单元2接着从指令存储单元读入下一条CONFIG指令,根据译出的控制信号,装置配置该层神经网络第一阶段计算需要的各种常数。例如,运算单元51、61根据控制信号里的参数配置单元内部寄存器的值,所述参数例如包括本层计算的精度设置、激活函数的数据。
在步骤S6,控制器单元2接着从指令存储单元读入下一条COMPUTE指令,根据译出的控制信号,开始第一阶段的计算。主运算模块5首先通过互连模块4将输入神经元向量
Figure PCTCN2016080320-appb-000055
发给各从运算模块6,保存至从运算模块6的第一存储单元63。从运算模块6的运算单元61从第二存储单元64读取权值向量(权值矩阵中对应于该从运算模块6的列向量),从第一存储单元读取输入神经元向量
Figure PCTCN2016080320-appb-000056
完成权值向量和输入神经元向量
Figure PCTCN2016080320-appb-000057
的点积运算,将中间结果通过互连模块返回。在互连模块4中,各从运算模块6返回的中间结果被逐级拼成完整的局部诱导域向量。主运算模块5得到互连模块4的返回值,根据COMPUTE指令译出的控制信号,从存储单元53读取偏置向量,与互连模块4返回的向 量相加,然后再对相加结果做激活,并进行Gibbs采样,并将最后的一阶隐层向量
Figure PCTCN2016080320-appb-000058
写回至存储单元53。
在步骤S7控制器单元2接着从指令存储单元读入下一条CONFIG指令,根据译出的控制信号,装置配置该层神经网络第二阶段计算需要的各种常数。
在步骤S8,控制器单元2接着从指令存储单元读入下一条COMPUTE指令,根据译出的控制信号,开始第二阶段的计算。主运算模块5首先通过互连模块4将一阶隐层向量
Figure PCTCN2016080320-appb-000059
发给各从运算模块6,保存至从运算模块6的第一存储单元63。从运算模块6的运算单元61从第二存储单元64读取权值向量(权值矩阵中对应于该从运算模块6的列向量),从第一存储单元选取一阶隐层向量
Figure PCTCN2016080320-appb-000060
的标量,完成权值向量和一阶隐层向量
Figure PCTCN2016080320-appb-000061
对应的标量的乘积运算,将中间结果通过互连模块返回。在互连模块4中,各从运算模块6返回的中间结果被逐级相加成完整的局部诱导域向量。主运算模块5得到互连模块4的返回值,根据COMPUTE指令译出的控制信号,从存储单元53读取偏置向量,与互连模块4返回的向量相加,然后再对相加结果做激活,并进行Gibbs采样,并将最后的一阶可见层向量
Figure PCTCN2016080320-appb-000062
写回至存储单元53。
在步骤S9,控制器单元2接着从指令存储单元读入下一条CONFIG指令,根据译出的控制信号,装置配置该层神经网络第三阶段计算需要的各种常数。本层的配置基本和第一阶段相同,不过还需多配置一个学习率参数。
在步骤S10,控制器单元2接着从指令存储单元读入下一条COMPUTE指令,根据译出的控制信号,开始第三阶段的计算。主运算模块5首先通过互连模块4将一阶隐层向量
Figure PCTCN2016080320-appb-000063
发给各从运算模块6,保存至从运算模块6的第一存储单元63。从第一存储单元读取一阶可见层向量
Figure PCTCN2016080320-appb-000064
完成权值向量和一阶可见层向量
Figure PCTCN2016080320-appb-000065
的点积运算,将中间结果通过互连模块返回。在互连模块4中,各从运算模块6返回的中间结果被逐级拼成完整的局部诱导域向量。主运算模块5得到互连模块4的返回值,根据COMPUTE指令译出的控制信号,从存储单元53读取偏置向 量,与互连模块4返回的向量相加,然后再对相加结果做激活,并将最后的一阶隐层向量
Figure PCTCN2016080320-appb-000066
写回至存储单元53。
在步骤S11,控制器单元2接着从指令存储单元读入下一条COMPUTE指令,根据译出的控制信号,开始第四阶段的计算。第一小阶段主运算模块5首先通过互连模块4将输入神经元向量
Figure PCTCN2016080320-appb-000067
和一阶隐层向量
Figure PCTCN2016080320-appb-000068
发给各从运算模块6,保存至从运算模块6的权重梯度缓存单元65。第二小阶段从运算模块6的运算单元61从第一存储单元读取一阶隐层向量
Figure PCTCN2016080320-appb-000069
和选取输入神经元向量
Figure PCTCN2016080320-appb-000070
对应的分量,完成一阶隐层向量
Figure PCTCN2016080320-appb-000071
和相应输入神经元向量
Figure PCTCN2016080320-appb-000072
的分量的乘积运算,将中间结果和从权重梯度缓存单元65读取的前一小阶段缓存的中间值进行向量减法运算,并将运算得的中间结果缓存至权重梯度缓存单元65。最后一小阶段从运算模块6的运算单元61从权重梯度缓存单元65读取上一小阶段的中间值和学习率相乘得到权重更新值,并从权重缓存单元64读取相应的权重和权重更新值进行向量减法得到更新后的权重,并将其缓存回权重缓存单元64。如此,单层神经网络的一次自学习预训练迭代完成,经过多次迭代学习,权重达到某种收敛评判标准则(权重更新值小于某个阈值)单层神经网络预训练结束,可以开始下一层神经网络的预训练。
通过采用用于执行人工神经网络自学习预训练运算的装置和指令集,解决了CPU和GPU运算性能不足,前端译码开销大的问题。有效提高了对多层人工神经网络正向运算的支持。
通过采用针对多层人工神经网络正向运算的专用片上缓存,充分挖掘了输入神经元和权值数据的重用性,避免了反复向内存读取这些数据,降低了内存访问带宽,避免了内存带宽成为多层人工神经网络正向运算性能瓶颈的问题。
前面的附图中所描绘的进程或方法可通过包括硬件(例如,电路、专用逻辑等)、固件、软件(例如,被具体化在非瞬态计算机可读介质上的软件),或两者的组合的处理逻辑来执行。虽然上文按照某些顺序操作描述了进程或方法,但是,应该理解,所描述的某些操作能以不同顺序来执行。此外,可并行地而非顺序地执行一些操作。
在前述的说明书中,参考其特定示例性实施例描述了本发明的各实施例。显然,可对各实施例做出各种修改,而不背离所附权利要求所述的本发明的更广泛的精神和范围。相应地,说明书和附图应当被认为是说明性的,而不是限制性的。

Claims (6)

  1. 一种用于执行人工神经网络自学习运算的装置,包括指令存储单元、控制器单元、数据访问单元、互连模块、主运算模块、以及多个从运算模块,其中:
    所述指令存储单元用于通过数据访问单元读入指令并缓存读入的指令;
    所述控制器单元用于从指令存储单元读取指令,并将该指令译码成控制互连模块、主运算模块、以及从运算模块行为的控制信号,然后将各自的控制信号分发至各个模块;
    所述数据访问单元用于访问外部地址空间,完成数据的加载和存储;
    所述互连模块具有不同拓扑实现,用于将所述主运算模块的输入向量分发给所述多个从运算模块,以及将各从运算模块的计算结果合并后返回给主运算模块;
    所述主运算模块用于对所述互连模块返回的中间值进行激活函数、吉布斯采样,以及对激活函数的偏置的更新;
    所述从运算模块用于输入向量和相应权重矩阵的点积运算,输入向量中的相应分量标量和对应权重矩阵的乘积运算,以及权重矩阵的更新。
  2. 如权利要求1所述的用于执行人工神经网络自学习运算的装置,所述主运算模块包括运算单元、数据依赖关系判断单元和存储单元,其中,
    所述存储单元用于缓存主运算模块在计算过程中用到的输入数据和输出数据,
    所述运算单元用于完成主运算模块的运算;
    所述数据依赖关系判断单元是所述运算单元和读写存储单元的端口,用于保证存储单元中数据的读写一致性。
  3. 如权利要求2所述的用于执行人工神经网络自学习运算的装置,所述数据依赖关系判断单元用于判断尚未执行的控制信号与正在执行过程中的控制信号的数据之间是否存在依赖关系,如果不存在,允许该 组控制信号立即发射,否则需要等到该条控制信号所依赖的所有控制信号全部执行完成后该组控制信号才允许被发射。
  4. 如权利要求3所述的用于执行人工神经网络自学习运算的装置,所述数据依赖关系判断单元还用于将读取数据通过互连模块发送给从计算模块。
  5. 如权利要求1所述的用于执行人工神经网络自学习运算的装置,每个所述从运算模块包括运算单元、数据依赖关系判断单元、第一存储单元、第二存储单元和第三存储单元,其中,
    所述运算单元用于接收控制器单元发出的控制信号并进行算数逻辑运算;
    所述数据依赖关系判断单元用于对存储单元的读写操作进行监控,以保证对存储单元的读写不存在一致性冲突;
    所述第一存储单元用于缓存神经元的输入向量和计算结果;
    所述第二存储单元用于缓存所述从运算模块在计算过程中需要的权值数据;
    所述第三存储单元用于缓存相应从运算模块在更新权值过程中需要的权值梯度数据。
  6. 一种执行人工神经网络逐层自学习运算的方法,所述人工神经网络包括两层或者两层以上的多个神经元,人工神经网络的自学习预训练采用逐层训练,对于每一层来说,所述预训练分为四个阶段:
    第一阶段,输入神经元向量
    Figure PCTCN2016080320-appb-100001
    和权值向量矩阵
    Figure PCTCN2016080320-appb-100002
    进行点积运算得到局部诱导域,局部诱导域经过激活函数非线性变换后再采用吉布斯(Gibbs)采样计算得到一阶隐层中间值
    Figure PCTCN2016080320-appb-100003
    第二阶段,先将权值向量矩阵的转置
    Figure PCTCN2016080320-appb-100004
    和一阶隐层中间值的转置
    Figure PCTCN2016080320-appb-100005
    进行点积运算,其局部诱导域经过激活函数非线性变换后再采用Gibbs采样得到一阶可见层中间值
    Figure PCTCN2016080320-appb-100006
    第三阶段,输入一阶可见层中间值
    Figure PCTCN2016080320-appb-100007
    和权值向量矩阵
    Figure PCTCN2016080320-appb-100008
    进行点积运算得到局部诱导域,局部诱导域经过激活函数非线性变换后得到第二隐层中间值
    Figure PCTCN2016080320-appb-100009
    第四阶段,根据如下公式更新权重:
    Figure PCTCN2016080320-appb-100010
    Figure PCTCN2016080320-appb-100011
    Figure PCTCN2016080320-appb-100012
    其中,向量
    Figure PCTCN2016080320-appb-100013
    为上述第一、三阶段进行激活函数之前向量和权重矩阵点积部分和加的偏置,向量
    Figure PCTCN2016080320-appb-100014
    则为第二阶段时的偏置;公式中“×”表示向量的叉乘,∈则是学习率。
PCT/CN2016/080320 2016-04-27 2016-04-27 用于执行人工神经网络自学习运算的装置和方法 WO2017185248A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2016/080320 WO2017185248A1 (zh) 2016-04-27 2016-04-27 用于执行人工神经网络自学习运算的装置和方法
EP16899762.5A EP3451240A4 (en) 2016-04-27 2016-04-27 DEVICE AND METHOD FOR CARRYING OUT A SELF-LEARNING OPERATION OF AN ARTIFICIAL NEURONAL NETWORK
US16/174,108 US20190065953A1 (en) 2016-04-27 2018-10-29 Device and Method for Performing Self-Learning Operations of an Artificial Neural Network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/080320 WO2017185248A1 (zh) 2016-04-27 2016-04-27 用于执行人工神经网络自学习运算的装置和方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/174,108 Continuation-In-Part US20190065953A1 (en) 2016-04-27 2018-10-29 Device and Method for Performing Self-Learning Operations of an Artificial Neural Network

Publications (1)

Publication Number Publication Date
WO2017185248A1 true WO2017185248A1 (zh) 2017-11-02

Family

ID=60161728

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/080320 WO2017185248A1 (zh) 2016-04-27 2016-04-27 用于执行人工神经网络自学习运算的装置和方法

Country Status (3)

Country Link
US (1) US20190065953A1 (zh)
EP (1) EP3451240A4 (zh)
WO (1) WO2017185248A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190065958A1 (en) * 2016-04-29 2019-02-28 Cambricon Technologies Corporation Limited Apparatus and Methods for Training in Fully Connected Layers of Convolutional Networks
WO2020093654A1 (en) * 2018-11-06 2020-05-14 Genesys Logic, Inc. Multichip system and data processing method adapted to the same for implementing neural network application

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10417304B2 (en) 2017-12-15 2019-09-17 International Business Machines Corporation Dual phase matrix-vector multiplication system
US11765604B2 (en) 2021-12-16 2023-09-19 T-Mobile Usa, Inc. Providing configuration updates to wireless telecommunication networks

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5631469A (en) * 1996-04-15 1997-05-20 The United States Of America As Represented By The Secretary Of The Army Neural network computing system for pattern recognition of thermoluminescence signature spectra and chemical defense
CN101625735A (zh) * 2009-08-13 2010-01-13 西安理工大学 基于ls-svm分类和回归学习递归神经网络的fpga实现方法
CN101833691A (zh) * 2010-03-30 2010-09-15 西安理工大学 一种基于fpga的最小二乘支持向量机串行结构实现方法
CN103150596A (zh) * 2013-02-22 2013-06-12 百度在线网络技术(北京)有限公司 一种反向传播神经网络dnn的训练系统
CN105144203A (zh) * 2013-03-15 2015-12-09 谷歌公司 信号处理系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5909681A (en) * 1996-03-25 1999-06-01 Torrent Systems, Inc. Computer system and computerized method for partitioning data for parallel processing
US6058206A (en) * 1997-12-01 2000-05-02 Kortge; Chris Alan Pattern recognizer with independent feature learning
US10521715B1 (en) * 2016-01-15 2019-12-31 Google Llc Long short-term memory cells with saturating gating functions

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5631469A (en) * 1996-04-15 1997-05-20 The United States Of America As Represented By The Secretary Of The Army Neural network computing system for pattern recognition of thermoluminescence signature spectra and chemical defense
CN101625735A (zh) * 2009-08-13 2010-01-13 西安理工大学 基于ls-svm分类和回归学习递归神经网络的fpga实现方法
CN101833691A (zh) * 2010-03-30 2010-09-15 西安理工大学 一种基于fpga的最小二乘支持向量机串行结构实现方法
CN103150596A (zh) * 2013-02-22 2013-06-12 百度在线网络技术(北京)有限公司 一种反向传播神经网络dnn的训练系统
CN105144203A (zh) * 2013-03-15 2015-12-09 谷歌公司 信号处理系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3451240A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190065958A1 (en) * 2016-04-29 2019-02-28 Cambricon Technologies Corporation Limited Apparatus and Methods for Training in Fully Connected Layers of Convolutional Networks
WO2020093654A1 (en) * 2018-11-06 2020-05-14 Genesys Logic, Inc. Multichip system and data processing method adapted to the same for implementing neural network application
TWI715281B (zh) * 2018-11-06 2021-01-01 創惟科技股份有限公司 用於實施神經網路應用之多晶片系統、適用於多晶片系統的資料處理方法、和非暫態電腦可讀取媒體
CN112970037A (zh) * 2018-11-06 2021-06-15 创惟科技股份有限公司 用于实施神经网络应用的多芯片系统、适用于多芯片系统的数据处理方法、和非暂时性计算机可读介质
CN112970037B (zh) * 2018-11-06 2024-02-02 创惟科技股份有限公司 用于实施神经网络应用的多芯片系统、适用于多芯片系统的数据处理方法、和非暂时性计算机可读介质

Also Published As

Publication number Publication date
EP3451240A4 (en) 2020-01-01
EP3451240A1 (en) 2019-03-06
US20190065953A1 (en) 2019-02-28

Similar Documents

Publication Publication Date Title
CN107316078B (zh) 用于执行人工神经网络自学习运算的装置和方法
WO2017185387A1 (zh) 一种用于执行全连接层神经网络正向运算的装置和方法
CN107341547B (zh) 一种用于执行卷积神经网络训练的装置和方法
WO2017185394A1 (zh) 一种用于执行全连接层神经网络反向训练的装置和方法
WO2017124641A1 (zh) 用于执行人工神经网络反向训练的装置和方法
CN109284825B (zh) 用于执行lstm运算的装置和方法
WO2017185347A1 (zh) 用于执行循环神经网络和lstm运算的装置和方法
CN109358900B (zh) 支持离散数据表示的人工神经网络正向运算装置和方法
CN111260025B (zh) 用于执行lstm神经网络运算的装置和运算方法
WO2017124642A1 (zh) 用于执行人工神经网络正向运算的装置和方法
WO2017185386A1 (zh) 一种用于执行卷积神经网络正向运算的装置和方法
CN107886166B (zh) 一种执行人工神经网络运算的装置和方法
WO2018120016A1 (zh) 用于执行lstm神经网络运算的装置和运算方法
WO2017185336A1 (zh) 用于执行pooling运算的装置和方法
WO2017185248A1 (zh) 用于执行人工神经网络自学习运算的装置和方法
EP3561732A1 (en) Operation apparatus and method for artificial neural network
WO2018058452A1 (zh) 一种执行人工神经网络运算的装置和方法
WO2017177446A1 (zh) 支持离散数据表示的人工神经网络反向训练装置和方法
WO2017185335A1 (zh) 一种用于执行batch normalization运算的装置和方法
CN111178492A (zh) 计算装置及相关产品、执行人工神经网络模型的计算方法
CN109993276B (zh) 用于执行人工神经网络反向训练的装置和方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2016899762

Country of ref document: EP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16899762

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016899762

Country of ref document: EP

Effective date: 20181127