CN107316078A

CN107316078A - Apparatus and method for performing artificial neural network self study computing

Info

Publication number: CN107316078A
Application number: CN201610267211.0A
Authority: CN
Inventors: 李震; 郭崎; 陈云霁; 陈天石
Original assignee: Beijing Zhongke Cambrian Technology Co Ltd
Current assignee: Cambricon Technologies Corp Ltd; Beijing Zhongke Cambrian Technology Co Ltd
Priority date: 2016-04-27
Filing date: 2016-04-27
Publication date: 2017-11-03
Anticipated expiration: 2036-04-27
Also published as: CN110188870B; CN110188870A; CN107316078B

Abstract

The invention discloses a kind of apparatus and method for performing artificial neural network self study computing, described device includes the location of instruction, controller unit, data access unit, interconnecting modules, main computing module and multiple from computing module.The present invention can be to the self study pre-training of multilayer neural network according to the training method successively trained, and for each layer network, the present invention is by multiple computing iteration until weight is updated after being less than certain threshold value, and the self study pre-training of the layer network is completed.Each iterative process can be divided into four-stage, and preceding three phases calculate generation single order hidden layer median, single order visible layer median and second order hidden layer median respectively, and last stage then updates weight using the median of preceding three phases.

Description

Apparatus and method for performing artificial neural network self study computing

Technical field

The present invention relates to artificial neural network technology, it is used to perform ANN more particularly to one kind The apparatus and method of network self study computing.

Background technology

Multi-layer artificial neural network is widely used in pattern-recognition, image procossing, function approximation and excellent Change calculate etc. field, multilayer artificial network in recent years due to its higher recognition accuracy and preferably Can concurrency, more and more widely paid close attention to by academia and industrial quarters.

Typical multi-layer artificial neural network training method is backpropagation (BP) algorithm.The method is The representative type of supervised learning, needs the training sample of substantial amounts of tape label in the training process, but Cost price needed for the collection of sample is very high.Meanwhile, in the training process of the method, error correction Signal reduces with the increase for propagating the number of plies, and training easily converges on local minimum and convergence speed Degree is slower.Therefore, first use fast convergence rate and be not required to the self-learning algorithm pair of tape label training sample Network parameter pre-training, is then finely adjusted multilayer neural network as one using backpropagation training again Individual new focus.Wherein, it is particularly important as the self study computing of pre-training.

A kind of known method for supporting multi-layer artificial neural network self study computing is to use general procedure Device.This method performs universal command come on supporting by using general-purpose register and general utility functions part State algorithm.One of shortcoming of this method is that the operational performance of single general processor is relatively low, it is impossible to met The performance requirement of common multi-layer artificial neural network computing.And multiple general processors when performing parallel, The intercommunication of general processor becomes performance bottleneck again.In addition, general processor is needed many Layer artificial neural network pre-training computing is decoded into before a queue of computing and access instruction sequence, processor End decoding brings larger power dissipation overhead

The known method of another support multi-layer artificial neural network pre-training is to use graphics processor (GPU).This method performs general SIMD by using general-purpose register and general stream processing unit Instruct to support above-mentioned algorithm.Because GPU is to be specifically used to perform graph image computing and science The equipment of calculating, not to the special support of multi-layer artificial neural network computing, it is still desirable to substantial amounts of Front end work decoding could perform multi-layer artificial neural network computing, bring substantial amounts of overhead. Other GPU only has less upper caching, the model data (weights) of multi-layer artificial neural network Need to carry outside piece repeatedly, the outer bandwidth of piece becomes main performance bottleneck.In addition, GPU only compared with Cached on small piece, the model data (weights) of multi-layer artificial neural network needs to remove outside piece repeatedly Fortune, the outer bandwidth of piece becomes main performance bottleneck, while bringing huge power dissipation overhead.

The content of the invention

To be solved by this invention is that general processor (GPU, CPU) carries out multilayer in the prior art Neutral net pre-training needs a series of simple operation and memory access computing, front end decoding power dissipation overhead Larger and existing general processor data memory access expense is big, single general processor operational performance is low The problems such as.

The present invention proposes a kind of device for being used to perform artificial neural network self study computing, including instruction Memory cell, controller unit, data access unit, interconnecting modules, main computing module, Yi Jiduo It is individual from computing module, wherein：The location of instruction, which is used to read in by data access unit, to be instructed And cache the instruction of reading；The controller unit, which is used to read from the location of instruction, to be instructed, and will The Instruction decoding is believed into control interconnecting modules, main computing module and from the control of computing module behavior Number, respective control signal is then distributed to modules；The data access unit is used to access External address space, completes the loading and storage of data；The interconnecting modules have different topology realization, It is the multiple from computing module for the input vector of the main computing module to be distributed to, and will be each Main computing module is returned to after merging from the result of calculation of computing module；The main computing module be used for pair The median that the interconnecting modules are returned carries out activation primitive, gibbs sampler, and to activation primitive Biasing renewal；It is described to be used for the dot-product operation of input vector and respective weights matrix from computing module, The product calculation of respective component scalar sum respective weights matrix in input vector, and weight matrix Update.

According to the present invention embodiment, the main computing module include arithmetic element, data according to Rely relation judging unit and memory cell, wherein, the memory cell exists for caching main computing module The input data and output data used in calculating process, the arithmetic element are used to complete main computing mould The computing of block；The data dependence relation judging unit is the arithmetic element and read-write memory cell Port, the read-write uniformity for ensureing data in memory cell.

According to the embodiment of the present invention, the data dependence relation judging unit is used to judge still Whether there is between the data of the control signal that is not carried out and the control signal during being carrying out according to The relation of relying, if there is no, it is allowed to this group of control signal is launched immediately, otherwise needs to wait until the group control This control signal just allows to be sent out after the completion of all control signals that signal processed is relied on all are performed Penetrate.

According to the embodiment of the present invention, the data dependence relation judging unit is additionally operable to read Access evidence is sent to from computing module by interconnecting modules.

It is each described to include arithmetic element, number from computing module according to the embodiment of the present invention According to dependence judging unit, the first memory cell, the second memory cell and the 3rd memory cell, its In, the arithmetic element is used to receiving the control signal that controller unit sends and carries out arithmetic logic fortune Calculate；The data dependence relation judging unit is used to be monitored the read-write operation of buffer unit, with Ensure that uniformity conflict is not present in the read-write to buffer unit；First memory cell is used to cache god Input vector and result of calculation through member；Second memory cell is described from computing module for caching The weight data needed in calculating process；3rd memory cell is used to cache accordingly from computing mould The weights gradient data that block needs during weights are updated.

The present invention also proposes a kind of method for performing artificial neural network successively self study computing, the people Artificial neural networks include two layers or more than two layers of multiple neurons, the self study of artificial neural network Pre-training is using successively training, and for each layer, the pre-training is divided into four-stage：

First stage, input neuron vectorWith weight vector matrixDot-product operation is carried out to obtain Local induction domain, local induction domain uses gibbs (Gibbs) again after activation primitive nonlinear transformation Sampling calculating obtains single order hidden layer median

Second stage, first by the transposition of weight vector matrixWith the transposition of single order hidden layer median Dot-product operation is carried out, it is locally adopted in induction domain using Gibbs again after activation primitive nonlinear transformation Sample obtains single order visible layer median

Phase III, input single order visible layer medianWith weight vector matrixCarry out dot product fortune Calculation obtains local induction domain, and local induction domain obtains the second hidden layer after activation primitive nonlinear transformation Median

Fourth stage, weight is updated according to equation below：

Wherein, it is vectorialFor vector sum weight matrix before progress of above-mentioned first and third stage activation primitive Dot product part and the biasing added, vectorBiasing when being then second stage；"×" is represented in formula The multiplication cross of vector, ∈ is then learning rate.

Compared to prior art, the present invention is optimized to the instruction of multilayer neural network pre-training, is handled The pre-training that device only can complete one layer of neutral net with an instruction learns, and has simplified general processor and has referred to The front end decoding overheads of order；Meanwhile, the present invention includes main computing module, multiple from computing module And memory access expense is alleviated in storage on a large amount of distributed pieces, and neutral net pre-training computing can be performed parallel Without carrying out data memory access outside frequently piece.Sum it up, the power dissipation ratio of performance of the present invention is far above General processor.

Present invention could apply in following (including but is not limited to) scene：Data processing, robot, Computer, printer, scanner, phone, tablet personal computer, intelligent terminal, mobile phone, drive recorder, Navigator, sensor, camera, cloud server, camera, video camera, projecting apparatus, wrist-watch, Each electronic products such as earphone, mobile storage, wearable device；Aircraft, steamer, vehicle etc. are all kinds of The vehicles；TV, air-conditioning, micro-wave oven, refrigerator, electric cooker, humidifier, washing machine, electric light, All kinds of household electrical appliance such as gas-cooker, lampblack absorber；And including NMR, B ultrasound, electrocardiograph Etc. all kinds of Medical Devices.

Brief description of the drawings

For a more complete understanding of the present invention and its advantage, referring now to the following description with reference to accompanying drawing, Wherein：

Fig. 1, which is shown, according to embodiments of the present invention to be used to perform artificial neural network self study pre-training Device integrally-built example block diagram.

Fig. 2, which is diagrammatically illustrated, according to embodiments of the present invention to be used to perform artificial neural network self study The H trees of interconnecting modules are realized in the device of pre-training.

Fig. 3, which is shown, according to embodiments of the present invention to be used to perform artificial neural network self study pre-training Device in main computing module structure example block diagram.

Fig. 4, which is shown, according to embodiments of the present invention to be used to perform artificial neural network self study pre-training Device in from the example block diagram of computing module structure.

Fig. 5 shows Neural Network Self-learning pre-training process first and third according to embodiments of the present invention The example block diagram in stage.

Fig. 6 shows Neural Network Self-learning pre-training process second stage according to embodiments of the present invention Example block diagram.

Fig. 7 shows Neural Network Self-learning pre-training process fourth stage according to embodiments of the present invention Example flow chart.

Fig. 8 shows monolayer neural networks self study pre-training an iteration according to embodiments of the present invention Example flow chart.

In all of the figs, identical device, part, unit etc. carry out table using identical reference Show.

Embodiment

According to reference to accompanying drawing, to the described in detail below of exemplary embodiment of the present, of the invention is other Aspect, advantage and prominent features will become obvious for those skilled in the art.

In the present invention, term " comprising " and " containing " and its derivative mean including and it is unrestricted； Term "or" is inclusive, mean and/or.

In this manual, following various embodiments for being used to describe the principle of the invention are explanation, no The scope of limitation invention should be construed in any way.Referring to the drawings described below is used to help complete The exemplary embodiment of the invention that foliation solution is limited by claim and its equivalent.Bag described below A variety of details are included to help to understand, but these details are considered as what is be merely exemplary.Therefore, It will be appreciated by those of ordinary skill in the art that in the case of without departing substantially from scope and spirit of the present invention, Embodiment described herein can be made various changes and modifications.In addition, for clarity and brevity For the sake of, eliminate the description of known function and structure.In addition, through accompanying drawing, same reference numbers are used In identity function and operation.

The self study pre-training of multi-layer artificial neural network according to embodiments of the present invention, ANN Network includes two layers or more than two layers of multiple neurons.The self study pre-training of artificial neural network is adopted With successively training, training is up to last layer since first layer.For each layer, pre-training It is divided into four-stage：

First stage, input neuron vectorFirst with weight vector matrixDot-product operation is carried out to obtain Domain is induced to local, local induction domain uses gibbs again after activation primitive nonlinear transformation (Gibbs) sampling calculating obtains single order hidden layer median

Phase III is similar with the first stage, and difference is phase III input in the middle of single order visible layer ValueCalculate the second hidden layer medianIt is not required to sample by Gibbs before；

Fourth stage, weight is updated according to equation below：

Fig. 1 shows the device for being used to perform artificial neural network self study pre-training according to the present invention Integrally-built example block diagram.As shown in figure 1, the device includes the location of instruction 1, control Device unit 2, data access unit 3, interconnecting modules 4, main computing module 5 and multiple from computing module 6.The location of instruction 1, controller unit 2, data access unit 3, interconnecting modules 4, main fortune Calculate module 5 and can pass through hardware circuit (such as application-specific integrated circuit ASIC) from computing module 6 Realize.

The location of instruction 1 reads in the instruction for instructing and caching reading by data access unit 3.

Controller unit 2 reads instruction from the location of instruction 1, and instruction is translated into and controls other moulds The control signal of block behavior is simultaneously sent to other modules such as data access unit 3, the and of main computing module 5 From computing module 6 etc..

Data access unit 3 can memory access external address space, directly to inside device each caching Unit reads and writes data, completes the loading and storage of data.

Fig. 2 diagrammatically illustrates the structure of interconnecting modules 4.Interconnecting modules 4 constitute main computing module 5 And multiple data paths between computing module 6, and with different structures.Interconnection is by multiple The binary tree path that node is constituted, each node similarly issues the data of upstream at two sections in downstream Point, the data that two nodes in downstream are returned are merged, and return to the node of upstream.For example, In the phase process of Neural Network Self-learning computing first and third, the input vector in main computing module 5 Each is sent to from computing module 6 by interconnecting modules 4；After the completion of the calculating process of computing module 6, After the completion of the calculating process from computing module, the value of the neuron each exported from computing module can be A complete vector by locally inducing domain to constitute is combined into interconnecting modules step by step, intermediate result is used as Vector returns to main computing module 5 and carries out activation primitive and carry out Gibbs samplings according to demand.And During two-stage, the single order hidden layer median vector in main computing module 5Pass through interconnecting modules 4 Each is sent to from computing module 6；After the completion of the calculating process from computing module 6, two, downstream The vector that node is returned can be summed into a vector in present node and return to upstream node, in Between result vector return to main computing module 5 and carry out activation primitive and Gibbs samplings.

Fig. 3 is shown in the device for performing artificial neural network pre-training computing according to the present invention The example block diagram of the structure of main computing module 5.As shown in figure 3, main computing module 5 includes computing list Member 51, data dependence relation judging unit 52 and memory cell 53.

Memory cell 53 be used to caching the input data that main computing module 5 uses in calculating process and Output data, arithmetic element 51 completes the various calculation functions of main computing module 5, and data dependence is closed Be judging unit 52 be the read-write memory cell 53 of arithmetic element 51 port, deposited while ensure that The read-write uniformity of data in storage unit.Specifically, data dependence relation judging unit 52 judges still Whether there is between the data of the control signal that is not carried out and the control signal during being carrying out according to The relation of relying, if there is no, it is allowed to this group of control signal is launched immediately, otherwise needs to wait until the group control This control signal just allows to be sent out after the completion of all control signals that signal processed is relied on all are performed Penetrate.For example, all control signals for being sent to data dependence relation unit 52 can all be stored into data according to Rely in the instruction queue inside relation unit 52, in the queue, the model of the reading data of reading instruction Clashed if enclosing the write command forward with queue position and writing the scopes of data, the instruction must be waited It can be performed after being performed to the write command relied on.Meanwhile, data dependence relation judging unit 52 It also is responsible for that data will be read and is sent to by interconnecting modules 4 from computing module, and from computing module 6 Output data is transmitted directly to arithmetic element 51 by interconnecting modules 4.The finger that controller unit 2 is exported Order is sent to computing unit 51 and data dependence relation judging unit 52, to control its behavior.

Fig. 4 is shown in the device for performing artificial neural network pre-training according to the present invention from fortune Calculate the example block diagram of the structure of module 6.As shown in figure 4, each include computing list from computing module 6 Member 61, data dependence relation judging unit 62, the first memory cell 63, the second memory cell 64 With the 3rd memory cell 65.

Arithmetic element 61 receives the control signal that sends of controller unit 2 and carries out arithmetic logic computing.

Data dependence relation judging unit 62 is responsible for the read-write operation to buffer unit in calculating process. Data dependence relation judging unit 62 ensures that uniformity conflict is not present in the read-write to buffer unit.Example Such as, all control signals for being sent to data dependence relation unit 62 can all be stored into data dependence relation In instruction queue inside unit 62, in the queue, if the scope of the reading data of reading instruction Forward write command is write the scopes of data and clashed with queue position, then the instruction must when institute according to Bad write command can be performed after being performed.

First memory cell 63 caches the input neuron vector in each phase processSingle order hidden layer MedianSingle order visible layer medianSingle order hidden layer medianAnd the calculating of each stage Input vector and weight matrix dot product result.

Second memory cell 64 caches the weight data that this needs from computing module 6 in calculating process. For each from computing module, all it can only store in weight matrix with that should be stored from computing module 6 The corresponding row of scalar data.

3rd memory cell 65 caches the corresponding weights needed from computing module during weights are updated Gradient data.The weight data that each weights gradient data stored from computing module 6 is stored with it It is corresponding.

From computing module 6 realize artificial neural network self study pre-training during first three stage pipeline The renewal of first half and the last stage formula (1) weights.

, will be preceding triphasic by taking the pre-training of artificial neural network depth belief network (DBN) as an example Weight matrix(or) and input neuron vectorMultiplication can be divided into it is incoherent simultaneously Row calculates subtask.In first and third stage, each identical input vector is utilized from computing module 6 Value, the corresponding weights of components different with output vector carry out dot product multiplying, respectively obtain output In vector the corresponding part of different components and, it is repeatedly cumulative after obtain its respective corresponding output component this A little parts and it is combined into a complete local induction domain vector step by step in interconnecting modules 4.Each from computing Module 6 only needs to calculate this module correspondence output neuron value and locally induces domain accordingly.No Same local induction domain component is combined into a complete part and induces domain vector to pass step by step in interconnecting modules 4 It is defeated by main computing module and carries out activation primitive and subsequent sampling.In second stage, each from computing The single order hidden layer median vector that module 6 calculates inputIn corresponding part scalar sum weight matrixThe product of corresponding row, obtained each output vector is one of final result and treats cumulative portion Point and, these parts obtain last result with being added two-by-two step by step in interconnecting modules.Each from fortune Calculate module 6 calculate the local induction domain of output single order visible layer vector part and, all part and Summation operation is completed in interconnecting modules 4 and obtains last local induction domain.Preceding three phases are calculated and obtained Median is used to update weight, and output of the main computing module 5 based on preceding three phases computing carries out follow-up Computing draws weight updated value.In last stage, updated from computing module 5 according to formula (1) Weight can also be divided into three small steps：

1. each single order hidden layer median vector that input is calculated from computing module 6With input nerve MemberIn corresponding part scalar product median；

2. each single order hidden layer median vector that input is calculated from computing module 6It is visible with single order Layer vectorIn corresponding part scalar product, and calculate and the first small stage median vector difference Value；

3. the product of each difference and learning rate for calculating for the second small stage from computing module 6, is weighed Weight updated value, afterwards and weightCarry out vector subtraction, the weight after being updated.

It is worth noting that, the above three small stage is only to updating weight one from computing module 6 Example is described, and application person can carry out the fine setting of details, for example, can be by the product in the first small stage Calculating and the second small stage in product calculating exchange；Or the 3rd small stage can be multiplied by study Rate advanceed to for the second small stage and even split to the first two small stage.

According to embodiments of the present invention, additionally provide and artificial neural network forward direction fortune is performed in aforementioned means The instruction set of calculation.Instruction set include CONFIG instructions, COMPUTE instructions, I/O instruction, NOP instruction, JUMP instructions and MOVE instructions, wherein：

CONFIG instructions configure current layer before every layer of artificial neural networks start and calculate what is needed Various constants；

The arithmetical logic that COMPUTE instructions complete every layer of artificial neural network is calculated；

I/O instruction is realized to read in from external address space and calculates the input data needed and calculating Data are stored back to exterior space after；

NOP instruction is responsible for emptying the control letter being currently filled in internal all control signal buffer queues Number, it is ensured that all instruction all instructions before NOP instruction are finished.NOP instruction does not include in itself Any operation；

The next IA that controller will be read from the location of instruction is responsible in JUMP instructions Redirect, for realizing redirecting for controlling stream；

MOVE instructions are responsible for the data of a certain address of device internal address space being carried in device Another address of portion's address space, the process is not take up fortune in the process of implementation independently of arithmetic element Calculate the resource of unit.

Fig. 5 shows Neural Network Self-learning pre-training process first and third according to embodiments of the present invention The example block diagram in stage.In difference from computing module 6, the input vector point that interconnecting modules 4 are broadcasted Dot-product operation is not carried out from the weight vector of computing module 6 with this, corresponding output neuron value is obtained Local induction domain part and, all these local induction thresholdings composition intermediate results vectors of output should Intermediate result vector by plus bias vector and activation computing obtain the final defeated of this layer of neutral net Go out neuron vector, formula is described as out=f (w*in+b), and wherein out output vectors, in is defeated Incoming vector, b are bias vectors, and w is weight matrix, and f is activation primitive.Each from computing module 6 Weight vector be in weight matrix with should be from the corresponding column vector of computing module 6.Interconnecting modules 4 By input vector [I₀..., I_n] be sent to all from arithmetic element, it is temporarily stored in the first memory cell. For i-th from arithmetic element, its corresponding weight vector [W is calculated_i0..., W_in] and input vector Dot product.The result exported from arithmetic element is combined into complete local induction domain vector by interconnecting modules 4 And main arithmetic element 5 is returned to, activation primitive computing is carried out in main arithmetic element 5 and it may Gibbs sampling, obtain last output vector [O₀, O₁..., O_n]。

Fig. 6 shows Neural Network Self-learning pre-training process second stage according to embodiments of the present invention Example block diagram.Calculate output single order visible layer vectorProcess for interconnecting modules 4 broadcast single order it is hidden Layer vector value, each takes from computing module 6In corresponding part scalar h_0iWith weight matrixCorrespondence Row [W_i0..., W_in] product, obtained each output vector be single order visible layer vector office Portion's induction one of domain treat cumulative part and, these parts and two two-phase step by step in interconnecting modules 4 Plus obtain last local induction domain.The local induction domain calculated returns to main arithmetic element 5, Activation primitive computing and its possible Gibbs sampling are carried out in main arithmetic element 5, obtains last Export single order visible layer vector

Fig. 7 shows Neural Network Self-learning pre-training process fourth stage according to embodiments of the present invention Flow chart.In last stage, updating weight according to formula (1) from computing module 5 can also be divided into Three small steps：

1. each single order hidden layer median vector that input is calculated from computing module 6With input nerve MemberIn the product median of corresponding part scalar cache to the 3rd memory cell shown in Fig. 4；This The small stage is similar to the block diagram of the second stage shown in Fig. 6, but in its input respectively single order hidden layer Between be worth vectorWith input neuron

2. each single order hidden layer median vector that input is calculated from computing module 6It is visible with single order Layer vectorIn corresponding part scalar product, and calculate and the first small stage median vector difference It is worth and caches to the 3rd memory cell shown in Fig. 4；

Fig. 8 shows one layer of artificial neural network self study pre-training computing flow according to one embodiment Figure, because multi-layer artificial neural network self study pre-training can be by the way of successively training, multilayer The pre-training of artificial neural network can call the multiple flow to realize.Flow chart description utilizes this hair Bright device and instruction set realizes a kind of monolayer neural networks self study pre-training computing shown in Fig. 4 Process.

In step S1, an I/O instruction is pre-deposited at the first address of instruction cache unit 1.

In step S2, computing starts, and controller unit 2 is read from the first address of instruction cache unit 1 This I/O instruction, according to the control signal translated, data access unit 3 is read from external address space Corresponding all artificial neural network operational orders are taken, and are buffered in the location of instruction 1.

In step S3, controller unit 2 then reads in next I/O instruction from the location of instruction, According to the control signal translated, data access unit 3 reads main computing module 5 from external address space All data needed (e.g., including input neuron vectorActivation primitive interpolation table, study Rate and biasing etc.) to main computing module 5 memory cell 53.

In step S4, controller unit 2 then reads in next I/O instruction from the location of instruction, According to the control signal translated, data access unit 3 is read from computing module 6 from external address space The weight matrix data needed.

In step S5, controller unit 2 then reads in next CONFIG from the location of instruction Instruction, according to the control signal translated, device configures what this layer of neutral net first stage calculating needed Various constants.For example, arithmetic element 51,61 according to the parameter configuration unit in control signal inside The value of register, precision setting, the data of activation primitive of the parameter for example including this layer of calculating.

In step S6, controller unit 2 then reads in next COMPUTE from the location of instruction Instruction, according to the control signal translated, starts the calculating of first stage.Main computing module 5 leads to first Neuron vector will be inputted by crossing interconnecting modules 4Issue each from computing module 6, preserve extremely from computing mould First memory cell 63 of block 6.From the arithmetic element 61 of computing module 6 from the second memory cell 64 Weight vector (corresponding to the column vector from computing module 6 in weight matrix) is read, from the first storage Unit reads input neuron vectorComplete weight vector and input neuron vectorDot product fortune Calculate, intermediate result is returned by interconnecting modules.In interconnecting modules 4, respectively returned from computing module 6 The intermediate result returned is combined into complete local induction domain vector step by step.Main computing module 5 is interconnected The return value of module 4, the control signal translated is instructed according to COMPUTE, from memory cell 53 Bias vector is read, then the addition of vectors returned with interconnecting modules 4 activates to addition result again, And Gibbs samplings are carried out, and last single order hidden layer is vectorialIt is written back to memory cell 53.

Then next CONFIG is read in step S7 controller units 2 from the location of instruction to refer to Order, according to the control signal translated, device configures this layer of neutral net second stage and calculates each of needs Plant constant.

In step S8, controller unit 2 then reads in next COMPUTE from the location of instruction Instruction, according to the control signal translated, starts the calculating of second stage.Main computing module 5 leads to first Interconnecting modules 4 are crossed by single order hidden layer vectorIssue each from computing module 6, preserve extremely from computing module 6 the first memory cell 63.Read from the arithmetic element 61 of computing module 6 from the second memory cell 64 Weight vector (corresponding to the column vector from computing module 6 in weight matrix) is taken, it is single from the first storage Member chooses single order hidden layer vectorScalar, complete weight vector and single order hidden layer vectorCorresponding mark The product calculation of amount, intermediate result is returned by interconnecting modules.In interconnecting modules 4, respectively from fortune The intermediate result for calculating the return of module 6 is summed into complete local induction domain vector step by step.Main computing mould Block 5 obtains the return value of interconnecting modules 4, and the control signal translated is instructed according to COMPUTE, from Memory cell 53 reads bias vector, the addition of vectors returned with interconnecting modules 4, then again to phase Plus result is activated, and Gibbs samplings are carried out, and last single order visible layer is vectorialIt is written back to Memory cell 53.

In step S9, controller unit 2 then reads in next CONFIG from the location of instruction Instruction, according to the control signal translated, device configures what this layer of neutral net phase III calculating needed Various constants.This layer of configuration is basic identical with the first stage, but also needs one learning rate of multi-configuration Parameter.

In step S10, controller unit 2 then reads in next COMPUTE from the location of instruction Instruction, according to the control signal translated, starts the calculating of phase III.Main computing module 5 leads to first Interconnecting modules 4 are crossed by single order hidden layer vectorIssue each from computing module 6, preserve extremely from computing module 6 the first memory cell 63.Single order visible layer vector is read from the first memory cellComplete weights Vector sum single order visible layer vectorDot-product operation, intermediate result is returned by interconnecting modules. In interconnecting modules 4, the intermediate result respectively returned from computing module 6 is combined into complete part and lured step by step Lead domain vector.Main computing module 5 obtains the return value of interconnecting modules 4, is instructed according to COMPUTE The control signal translated, from memory cell 53 read bias vector, with interconnecting modules 4 return to Amount is added, and then addition result is activated again, and last single order hidden layer is vectorialIt is written back to and deposits Storage unit 53.

In step S11, controller unit 2 then reads in next COMPUTE from the location of instruction Instruction, according to the control signal translated, starts the calculating of fourth stage.First small stage main computing mould Block 5 will input neuron vector by interconnecting modules 4 firstWith single order hidden layer vectorIssue it is each from Computing module 6, is preserved to the weight gradient buffer unit 65 from computing module 6.Second small stage from The arithmetic element 61 of computing module 6 reads single order hidden layer vector from the first memory cellIt is defeated with choosing Enter neuron vectorCorresponding component, completes single order hidden layer vectorWith corresponding input neuron vectorComponent product calculation, by intermediate result and from weight gradient buffer unit 65 read it is previous The median of small stage caching carries out vector subtraction computing, and will transport counted intermediate result and cache to power Weight gradient buffer unit 65.Last small stage is terraced from the arithmetic element 61 of computing module 6 from weight The median in the degree reading upper small stage of buffer unit 65 is multiplied with learning rate obtains weight updated value, And from weight buffer unit 64 read corresponding weight and weight updated value and carry out vector subtraction and obtain more Weight after new, and cached back weight buffer unit 64.In this way, monolayer neural networks are once Self study pre-training iteration is completed, and is learnt by successive ignition, and weight reaches certain convergence judgment criteria Then (weight updated value is less than some threshold value) monolayer neural networks pre-training terminates, can started next The pre-training of layer neutral net.

By using the device and instruction set for performing artificial neural network self study pre-training computing, Solve CPU and GPU operational performances not enough, the problem of front end decoding overheads are big.Effectively increase Support to multi-layer artificial neural network forward operation.

Cache, fully excavate by using special for multi-layer artificial neural network forward operation is upper The reusability of input neuron and weight data, it is to avoid repeatedly read these data to internal memory, drops Low EMS memory access bandwidth, it is to avoid memory bandwidth turns into multi-layer artificial neural network forward operation performance The problem of bottleneck.

The process or method described in accompanying drawing above can by including hardware (for example, circuit, specially With logic etc.), firmware, software is (for example, be embodied in non-transient computer-readable media Software), or both the processing logic of combination perform.Although being retouched above according to the operation of some orders Process or method are stated, however, it is to be understood that described some operations can be held with different order OK.In addition, concurrently rather than certain operations can be sequentially performed.

In foregoing specification, each implementation of the present invention is described with reference to its certain exemplary embodiments Example.Obviously, various modifications can be made to each embodiment, without departing from the sheet described in appended claims The wider spirit and scope of invention.Correspondingly, specification and drawings should be considered as illustrative , rather than it is restricted.

Claims

1. a kind of device for being used to perform artificial neural network self study computing, including instruction storage are single Member, controller unit, data access unit, interconnecting modules, main computing module and multiple from fortune Module is calculated, wherein：

The location of instruction is used to read in the finger for instructing and caching reading by data access unit Order；

The controller unit, which is used to read from the location of instruction, to be instructed, and by the Instruction decoding into control Interconnecting modules processed, main computing module and the control signal from computing module behavior, then will be respective Control signal be distributed to modules；

The data access unit is used to access external address space, completes the loading and storage of data；

The interconnecting modules have different topology realization, for by the input vector of the main computing module It is distributed to the multiple from computing module, and returns after respectively merging from the result of calculation of computing module To main computing module；

The main computing module is used to carry out activation primitive, Ji to the median that the interconnecting modules are returned Buss sample, and the biasing to activation primitive renewal；

It is described to be used for the dot-product operation of input vector and respective weights matrix, input vector from computing module In respective component scalar sum respective weights matrix product calculation, and weight matrix renewal.

2. the device as claimed in claim 1 for being used to perform artificial neural network self study computing, Characterized in that, the main computing module includes arithmetic element, data dependence relation judging unit and deposited Storage unit, wherein,

The memory cell be used to caching the input data that main computing module uses in calculating process and Output data,

The arithmetic element is used for the computing for completing main computing module；

The data dependence relation judging unit is the port of the arithmetic element and read-write memory cell, Read-write uniformity for ensureing data in memory cell.

3. the device as claimed in claim 2 for being used to perform artificial neural network self study computing, Characterized in that, the data dependence relation judging unit be used to judging the control signal that has not carried out with It whether there is dependence between the data of control signal during being carrying out, if it does not, Allow this group of control signal to launch immediately, otherwise need all controls relied on when this control signal This group of control signal just allows to be launched after the completion of signal processed is all performed.

4. the device as claimed in claim 3 for being used to perform artificial neural network self study computing, Characterized in that, the data dependence relation judging unit is additionally operable to reading data passing through interconnecting modules It is sent to from computing module.

5. the device as claimed in claim 1 for being used to perform artificial neural network self study computing, Characterized in that, it is each described from computing module include arithmetic element, data dependence relation judging unit, First memory cell, the second memory cell and the 3rd memory cell, wherein,

The arithmetic element is used to receiving the control signal that controller unit sends and carries out arithmetic logic Computing；

The data dependence relation judging unit is used to be monitored the read-write operation of memory cell, with Ensure that uniformity conflict is not present in the read-write to memory cell；

First memory cell is used for the input vector and result of calculation for caching neuron；

Second memory cell is used to cache the power needed in calculating process from computing module Value Data；

3rd memory cell is used to cache accordingly to be needed from computing module during weights are updated Weights gradient data.

6. a kind of method for performing artificial neural network successively self study computing, the ANN Network includes two layers or more than two layers of multiple neurons, and the self study pre-training of artificial neural network is adopted With successively training, for each layer, the pre-training is divided into four-stage：

Fourth stage, weight is updated according to equation below：

<mrow> <mover> <mi>W</mi> <mo>&LeftRightArrow;</mo> </mover> <mo>&LeftArrow;</mo> <mover> <mi>W</mi> <mo>&LeftRightArrow;</mo> </mover> <mo>-</mo> <mo>&Element;</mo> <mrow> <mo>(</mo> <mover> <msub> <mi>h</mi> <mn>0</mn> </msub> <mo>&RightArrow;</mo> </mover> <mo>&times;</mo> <msup> <mover> <msub> <mi>v</mi> <mn>0</mn> </msub> <mo>&RightArrow;</mo> </mover> <mi>T</mi> </msup> <mo>-</mo> <mover> <msub> <mi>h</mi> <mn>1</mn> </msub> <mo>&RightArrow;</mo> </mover> <mo>&times;</mo> <msup> <mover> <msub> <mi>v</mi> <mn>1</mn> </msub> <mo>&RightArrow;</mo> </mover> <mi>T</mi> </msup> <mo>)</mo> </mrow> <mn>......</mn> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

<mrow> <mover> <mi>b</mi> <mo>&RightArrow;</mo> </mover> <mo>&LeftArrow;</mo> <mover> <mi>b</mi> <mo>&RightArrow;</mo> </mover> <mo>-</mo> <mo>&Element;</mo> <mrow> <mo>(</mo> <mover> <msub> <mi>h</mi> <mn>0</mn> </msub> <mo>&RightArrow;</mo> </mover> <mo>-</mo> <mover> <msub> <mi>h</mi> <mn>1</mn> </msub> <mo>&RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mn>......</mn> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

<mrow> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mo>&LeftArrow;</mo> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mo>-</mo> <mo>&Element;</mo> <mrow> <mo>(</mo> <mover> <msub> <mi>v</mi> <mn>0</mn> </msub> <mo>&RightArrow;</mo> </mover> <mo>-</mo> <mover> <msub> <mi>v</mi> <mn>1</mn> </msub> <mo>&RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mn>......</mn> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>