CN106599989B

CN106599989B - Neural network unit and neural pe array

Info

Publication number: CN106599989B
Application number: CN201610866030.XA
Authority: CN
Inventors: G·葛兰·亨利; 泰瑞·派克斯
Original assignee: Shanghai Zhaoxin Integrated Circuit Co Ltd
Current assignee: Shanghai Zhaoxin Semiconductor Co Ltd
Priority date: 2015-10-08
Filing date: 2016-09-29
Publication date: 2019-04-09
Anticipated expiration: 2036-09-29
Also published as: TWI579694B; TW201714080A; TWI616825B; CN106598545A; TW201714079A; TW201714120A; CN106599991A; CN106599992A; TWI626587B; TW201714091A; CN106599992B; CN106599989A; CN106598545B; TW201714119A; TWI601062B; CN106650923B; TWI608429B; CN106599990B; CN106599990A; TW201714078A

Abstract

A kind of neural network unit.This neural network unit includes a buffer and multiple neural processing units (NPU).Buffer gives sequencing using a presentation (representation) of a reciprocal value of a divisor.Each nerve processing unit includes an arithmetic logic unit, an accumulator and a multiplication unit reciprocal.Arithmetic logic unit is to execute a series of arithmetic logical operations to sequence of operations number to generate series of results.Arithmetic logic unit simultaneously stores the cumulative accumulated value out of this series result to accumulator.Multiplication unit reciprocal receives presentation and the accumulated value of the reciprocal value of aforementioned divisor to generate a result.This result is that accumulated value and divisor quotient.

Description

Neural network unit and neural pe array

Technical field

The present invention relates to a kind of processor, in particular to a kind of place of the operation efficiency for promoting artificial neural network and efficiency Manage device.

Present application advocates the international priority of following United States provisional application.These priority cases are incorporated by this Case is for reference.

Present application is associated with US application case that is following while filing an application.These association request cases are incorporated by this Case is for reference.

Background technique

In recent years, artificial neural network (artificial neural networks, ANN) has attracted the note of people again Meaning.These researchs are commonly known as deep learning (deep learning), computer learning (computer learning) etc. Similar terms.The promotion of general processor operational capability also raised people after many decades now for artificial neural network Interest.The recent application of artificial neural network includes language and image identification etc..For promoting the operation of artificial neural network The demand of efficiency and efficiency seems increasing.

Summary of the invention

In view of this, the present invention provides a kind of neural network unit.This neural network unit include a buffer with it is multiple Neural processing unit (NPU).Buffer gives journey using a presentation (representation) of a reciprocal value of a divisor Sequence.Each nerve processing unit includes an arithmetic logic unit, an accumulator and a multiplication unit reciprocal.Arithmetic logic unit To execute a series of arithmetic logical operations to sequence of operations number to generate series of results.Arithmetic logic unit and by this The cumulative accumulated value out of series result is stored to accumulator.Multiplication unit reciprocal receives the presentation of the reciprocal value of aforementioned divisor and tires out It is value added to generate a result.This result is that accumulated value and divisor quotient.

The present invention also provides a kind of method for operating a neural network unit, this neural network unit has at multiple nerves Unit is managed, each nerve processing unit has an arithmetic logic unit, an accumulator and a multiplication unit reciprocal.The method packet It includes: one buffer of (representation) sequencing is presented using the one of a reciprocal value of a divisor；It is patrolled using each arithmetic Unit is collected, a series of arithmetic logical operations are executed to generate series of results to sequence of operations number；It is patrolled using each arithmetic Unit is collected, the cumulative accumulated value out of this series result is stored to accumulator；And using multiplication unit reciprocal, receive reciprocal value Presentation and accumulated value with generate one as a result, this result is that accumulated value and divisor quotient.

Being encoded at least one non-instantaneous computer the present invention also provides one kind can be used media so that a computer installation makes One computer program product.This computer program product includes the usable program code of computer for being included in the media, To describe a neural network unit.It includes the first program code and the second program code that program code, which can be used, in this computer.The One program code to describe a buffer, this buffer using a reciprocal value of a divisor a presentation (representation) give sequencing.Second program code is to describe multiple neural processing units (NPU), each nerve Processing unit includes an arithmetic logic unit, an accumulator and a multiplication unit reciprocal.Arithmetic logic unit is to sequence of operations Number executes a series of arithmetic logical operations to generate series of results.Arithmetic logic unit and by this series result it is cumulative go out one Accumulated value is stored to accumulator.Multiplication unit reciprocal receives presentation and the accumulated value of aforementioned reciprocal value to generate a result.This knot Fruit is the quotient of accumulated value and divisor.

Specific embodiment of the present invention will be further described by embodiment below and schema.

Detailed description of the invention

Fig. 1 is the square signal of processor of the display comprising neural network unit (neural network unit, NNU) Figure.

Fig. 2 is the block schematic diagram for showing the neural processing unit (neural processing unit, NPU) of Fig. 1.

Fig. 3 is block diagram, and display is cached using N number of multitask of N number of neural processing unit of the neural network unit of Fig. 1 Device executes the rotator such as N number of text for a column data text of the data random access memory acquirement by Fig. 1 (rotator) or the running of cyclic shifter (circular shifter).

Fig. 4 is table, and display one is stored in the program storage of the neural network unit of Fig. 1 and by the neural network list The program that member executes.

Fig. 5 is to show that neural network unit executes the timing diagram of the program of Fig. 4.

Fig. 6 A is to show that the neural network unit of Fig. 1 executes the block schematic diagram of the program of Fig. 4.

Fig. 6 B is flow chart, shows that the processor of Fig. 1 executes framework program, to be associated with using the execution of neural network unit The running of the multiply-accumulate run function operation of typical case of the neuron of the hidden layer of artificial neural network, such as the program by Fig. 4 The running of execution.

Fig. 7 is the block schematic diagram for showing another embodiment of neural processing unit of Fig. 1.

Fig. 8 is the block schematic diagram for showing the another embodiment of neural processing unit of Fig. 1.

Fig. 9 is table, and display one is stored in the program storage of the neural network unit of Fig. 1 and by the neural network list The program that member executes.

Figure 10 is to show that neural network unit executes the timing diagram of the program of Fig. 9.

Figure 11 is the block schematic diagram for showing an embodiment of neural network unit of Fig. 1.In the embodiment in figure 11, one A neuron is divided into two parts, i.e., (this part also includes shifting cache to run function unit part with arithmetic logic unit part Device part), and each run function unit part is by multiple arithmetic logic unit partial sharings.

Figure 12 is to show that the neural network unit of Figure 11 executes the timing diagram of the program of Fig. 4.

Figure 13 is to show that the neural network unit of Figure 11 executes the timing diagram of the program of Fig. 4.

Figure 14 is block schematic diagram, and display is moved to the instruction of neural network (MTNN) framework and its mind for corresponding to Fig. 1 The running of part through network unit.

Figure 15 is block schematic diagram, and display is moved to the instruction of neural network (MTNN) framework and its mind for corresponding to Fig. 1 The running of part through network unit.

Figure 16 is the block schematic diagram for showing an embodiment of data random access memory of Fig. 1.

Figure 17 is the block schematic diagram for showing an embodiment of weight random access memory and buffer of Fig. 1.

Figure 18 is the block schematic diagram for showing the dynamically configurable neural processing unit of Fig. 1.

Figure 19 is block schematic diagram, and embodiment of the display according to Figure 18 utilizes N number of nerve of the neural network unit of Fig. 1 2N multitask buffer of processing unit holds a column data text of the data random access memory acquirement by Fig. 1 Running of the row such as rotator (rotator).

Figure 20 is table, and display one is stored in the program storage of the neural network unit of Fig. 1 and by the neural network The program that unit executes, and this neural network unit has neural processing unit as shown in the embodiment of figure 18.

Figure 21 is to show that neural network unit executes the timing diagram of the program of Figure 20, this neural network unit has such as Figure 18 Shown in nerve processing unit be implemented in narrow configuration.

Figure 22 is the block schematic diagram for showing the neural network unit of Fig. 1, this neural network unit has as shown in figure 18 Neural processing unit to execute the program of Figure 20.

Figure 23 is the block schematic diagram for showing another embodiment of dynamically configurable neural processing unit of Fig. 1.

Figure 24 is block schematic diagram, and display is used by the neural network unit of Fig. 1 to execute convolution (convolution) One example of the data structure of operation.

Figure 25 is flow chart, shows that the processor of Fig. 1 executes framework program to utilize neural network unit foundation Figure 24's The convolution algorithm of data array execution convolution kernel.

Figure 26 A is the program listing of neural network unit program, this neural network unit program utilizes the convolution kernel of Figure 24 It executes the convolution algorithm of data matrix and is write back weight random access memory.

Figure 26 B is to show that the square of an embodiment of certain fields of the control buffer of neural network unit of Fig. 1 shows It is intended to.

Figure 27 is block schematic diagram, shows the example that the weight random access memory of input data is inserted in Fig. 1, this Input data executes common source operation (pooling operation) by the neural network unit of Fig. 1.

Figure 28 is the program listing of neural network unit program, this neural network unit program executes the input data of Figure 27 The common source operation of matrix is simultaneously write back weight random access memory.

Figure 29 A is the block schematic diagram for showing an embodiment of control buffer of Fig. 1.

Figure 29 B is the block schematic diagram for showing another embodiment of control buffer of Fig. 1.

Figure 29 C is that display is illustrated with the square of an embodiment of the inverse (reciprocal) of two section store Figure 29 A Figure.

Figure 30 is the block schematic diagram for showing an embodiment of run function unit (AFU) of Fig. 2.

Figure 31 is the example for showing the running of run function unit of Figure 30.

Figure 32 is second example for showing the running of run function unit of Figure 30.

Figure 33 is the third example for showing the running of run function unit of Figure 30.

Figure 34 is the block schematic diagram for showing the part details of processor and neural network unit of Fig. 1.

Figure 35 is block diagram, and showing has the processor of variable rate neural network unit.

Figure 36 A is timing diagram, shows the running example that there is the processor of neural network unit to operate on general modfel, This general modfel i.e. with it is main when frequency operation.

Figure 36 B is timing diagram, shows the running example that there is the processor of neural network unit to operate on mitigation mode, Frequency when frequency is lower than main when the running of mitigation mode.

Figure 37 is flow chart, shows the running of the processor of Figure 35.

Figure 38 is block diagram, displays the details of the sequence of neural network unit.

Figure 39 is block diagram, shows the control of neural network unit and certain fields of status register.

Figure 40 is block diagram, is shown Elman time recurrent neural network (recurrent neural network, RNN) An example.

Figure 41 is block diagram, and display is associated with the Elman time recurrent neural network of Figure 40 when the execution of neural network unit Calculating when, one of the data configuration in the data random access memory of neural network unit and weight random access memory Example.

Figure 42 is table, and display is stored in the program of the program storage of neural network unit, this program is by neural network Unit executes, and uses data and weight according to the configuration of Figure 41, to reach Elman time recurrent neural network.

Figure 43 is the example that block diagram shows Jordan time recurrent neural network.

Figure 44 is block diagram, and display is associated with the Jordan time recurrent neural network of Figure 43 when the execution of neural network unit Calculating when, one of the data configuration in the data random access memory of neural network unit and weight random access memory Example.

Figure 45 is table, and display is stored in the program of the program storage of neural network unit, this program is by neural network Unit executes, and uses data and weight according to the configuration of Figure 44, to reach Jordan time recurrent neural network.

Figure 46 is block diagram, the embodiment of display shot and long term memory (long short term memory, LSTM) born of the same parents.

Figure 47 is block diagram, and display is associated with the calculating of the shot and long term memory cell layer of Figure 46 when the execution of neural network unit When, an example of the data random access memory and the data configuration in weight random access memory of neural network unit.

Figure 48 is table, and display is stored in the program of the program storage of neural network unit, this program is by neural network Unit executes and uses data and weight according to the configuration of Figure 47, to reach the calculating for being associated with shot and long term memory cell layer.

Figure 49 is block diagram, shows the embodiment of neural network unit, has in the neural processing unit group of this embodiment There are the masking of output buffering and feedback capability.

Figure 50 is block diagram, and display is associated with the calculating of the shot and long term memory cell layer of Figure 46 when the execution of neural network unit When, the data random access memory of the neural network unit of Figure 49, in weight random access memory and output buffer One example of data configuration.

Figure 51 is table, display be stored in neural network unit program storage program, this program by Figure 49 mind It is executed through network unit and uses data and weight according to the configuration of Figure 50, be associated in terms of shot and long term memory cell layer by reaching It calculates.

Figure 52 is block diagram, shows the embodiment of neural network unit, has in the neural processing unit group of this embodiment There are the masking of output buffering and feedback capability, and shared run function unit.

Figure 53 is block diagram, and display is associated with the calculating of the shot and long term memory cell layer of Figure 46 when the execution of neural network unit When, the data random access memory of the neural network unit of Figure 49, in weight random access memory and output buffer Another embodiment of data configuration.

Figure 54 is table, display be stored in neural network unit program storage program, this program by Figure 49 mind It is executed through network unit and uses data and weight according to the configuration of Figure 53, be associated in terms of shot and long term memory cell layer by reaching It calculates.

Figure 55 is block diagram, shows the partial nerve processing unit of another embodiment of the present invention.

Figure 56 is block diagram, and display is associated with the Jordan time recurrent neural network of Figure 43 when the execution of neural network unit Calculating and when using the embodiment of Figure 55, the data random access memory of neural network unit and weight random access memory One example of the data configuration in device.

Figure 57 is table, and display is stored in the program of the program storage of neural network unit, this program is by neural network Unit executes and uses data and weight according to the configuration of Figure 56, to reach Jordan time recurrent neural network.

Specific embodiment

Processor with framework neural network unit

Fig. 1 is the side of processor 100 of the display comprising neural network unit (neural network unit, NNU) 121 Block schematic diagram.As shown in the figure, this processor 100 includes instruction acquisition unit 101, instruction cache 102, instruction translator 104, rename unit 106, multiple reservation stations 108, multiple media caches 118, multiple general caching devices 116, aforementioned neurological Multiple execution units 112 and memory sub-system 114 outside network unit 121.

Processor 100 is electronic device, the central processing unit as integrated circuit.Processor 100 receives the number of input Digital data handles these data according to by the instruction that memory is seized, and generates and made by the processing result of the operation of instruction instruction It is exported for it.This processor 100 can be used for desktop PC, running gear or tablet computer, and be used to calculate, text The application such as processing, multimedia display and web browsing.This processor 100 may also be disposed in embedded system, to control various packets Include equipment, mobile phone, smart phone, vehicle, the device with industrial controller.Central processing unit penetrates and executes packet to data The operations such as arithmetic, logical AND input/output are included, to execute computer program (or for computer applied algorithm or application program) The electronic circuit (i.e. hardware) of instruction.Integrated circuit be one group be made in small semiconductor material, usually silicon, electronics electricity Road.Integrated circuit is also commonly used to indicate chip, microchip or crystal grain.

The control of instruction acquisition unit 101 seizes framework instruction 103 to instruction cache 102 by system storage (not shown) Running.The instruction offer of acquisition unit 101 seizes address to instruction cache 102, is seized with given processor 100 to cache The storage address of the cache line of 102 framework command byte.Seize the selected instruction pointer based on processor 100 of address The current value or program counter of (not shown).In general, program counter can be incremented by proper order according to instruction size, until referring to Enable and occur such as control instruction of branch, calling or return in crossfire, or occur for example interruptions, trap (trap), make an exception or The exceptional conditions such as mistake, and need to update program with the non-sequential address such as such as branch target address, return address or exception vector Counter.To sum up, program counter can be executed instruction in response to execution unit 112/121 and is updated.Program counter It can also be updated when detecting exceptional condition, such as instruction translator 104 suffers from the finger for not being defined in processor 100 Enable the instruction 103 of collection framework.

Instruction cache 102 stores the framework instruction 103 for seizing the system storage that processor 100 is coupled to from one.This A little framework instructions 103 include being moved to neural network (MTNN) instruction to remove (MFNN) instruction with by neural network, in detail as after It states.In one embodiment, framework instruction 103 is the instruction of x86 instruction set architecture, and affix MTNN instruction is instructed with MFNN. In this disclosure, x86 instruction set architecture processor is interpreted as in the case where executing same mechanical sound instruction, withProcessor generates the processor of identical result in instruction set architecture layer.But, other instruction set architectures, For example, advanced reduced instruction set machine framework (ARM), the extendible processor architecture (SPARC) of rising Yang (SUN) or enhancing Reduced instruction set computer performance operational performance optimization architecture (PowerPC), it can also be used to other embodiments of the invention.Instruction cache 102 provide framework instruction 103 to instruction translator 104, and framework instruction 103 is translated to microcommand 105.

Microcommand 105 is provided to renaming unit 106 and is finally executed by execution unit 112/121.These microcommands 105 It can realize that framework instructs.For a preferred embodiment, instruction translator 104 include first part, to will frequently execute with It and/or is that relatively uncomplicated framework instruction 103 translates to microcommand 105.This instruction translator 104 further includes second Point, with microcode unit (not shown).There is microcode unit microcode memory to load micro-code instruction, to execute architecture instruction set Middle complicated and/or instruction less.Microcode unit further includes that micro-sequencer (microsequencer) provides nand architecture microprogram Counter (micro-PC) is to microcode memory.For a preferred embodiment, these microcommands (are not schemed via micro- transfer interpreter Show) translate to microcommand 105.Whether selector is currently possessed of control power according to microcode unit, and selection is from first part or the The microcommand 105 of two parts is provided to renaming unit 106.

Renaming unit 106 can be by the entity of the specified framework buffer renamed as processor 100 of framework instruction 103 Buffer.For a preferred embodiment, this processor 100 includes reorder buffer (not shown).Renaming unit 106 can be according to The allocation of items of reorder buffer is given to each microcommand 105 according to program sequence.So processor 100 can be made suitable according to program Sequence removes microcommand 105 and its corresponding framework instruction 103.In one embodiment, media cache 118 has 256 Width, and general caching device 116 has 64 bit widths.In one embodiment, media cache 118 is x86 media cache, Such as advanced vector expands (AVX) buffer.

In one embodiment, each project of reorder buffer has storage space to store the result of microcommand 105.This Outside, processor 100 includes framework register file, this framework register file has physical registers slow corresponding to each framework Storage, such as media cache 118, general caching device 116 and other framework buffers.(for a preferred embodiment, citing For, media cache 118 is of different sizes with general caching device 116, that is, can be used separated register file corresponding to this Two kinds of buffers.) for being assigned with each source operand of a framework buffer in microcommand 105, renaming unit can benefit With the reorder buffer catalogue of a newest microcommand in the old microcommand 105 of write-in framework buffer, microcommand 105 is inserted Source operand field.When execution unit 112/121 completes the execution of microcommand 105, execution unit 112/121 can be by its result The reorder buffer project of this microcommand 105 is written.When microcommand 105 is removed, unit meeting (not shown) is removed in the future since then The reorder buffer field of microcommand result write-in physical registers archives buffer, this physical registers profile associated in Thus framework purpose buffer specified by microcommand 105 is removed.

In another embodiment, processor 100 includes physical registers archives, and the quantity for the physical registers having is more In the quantity of framework buffer, but, this processor 100 does not include framework register file, and in reorder buffer project It does not include result storage space.(for a preferred embodiment, because of the size of media cache 118 and general caching device 116 Difference can be used separated register file corresponding to both buffers.) this processor 100 further includes pointer gauge, tool There is the corresponding pointer of each framework buffer.For being assigned with each operand of framework buffer in microcommand 105, order again Name unit can be directed toward the pointer of free buffer in physical registers archives using one, insert the purpose behaviour in microcommand 105 Make digital section.If free buffer is not present in physical registers archives, renaming unit 106 can lie over pipeline.It is right In each source operand for being assigned with framework buffer in microcommand 105, renaming unit can be cached using a direction entity In device archives, it is assigned to the pointer of the buffer of newest microcommand in the old microcommand 105 of write-in framework buffer, is inserted micro- Source operand field in instruction 105.When execution unit 112/121 is completed to execute microcommand 105,112/121 meeting of execution unit Write the result into the buffer that the destination operand field of microcommand 105 in physical registers archives is directed toward.When microcommand 105 is removed Except when, remove unit and the destination operand field value of microcommand 105 can be copied to and be associated with this to remove microcommand 105 specified The pointer of the pointer gauge of framework purpose buffer.

Reservation station 108 can load microcommand 105, until these microcommands completion be distributed to execution unit 112/121 for The preparation of execution.When all source operands of a microcommand 105 can all take and execution unit 112/121 can also be used for When execution, i.e., microcommand 105 completes the preparation issued thus.Execution unit 112/121 is by reorder buffer or aforementioned first reality Apply framework register file described in example, or the physical registers archives accession buffer source as described in aforementioned second embodiment Operand.In addition, execution unit 112/121 can be directed through result transmission bus reception buffer source operand (not shown). In addition, execution unit 112/121 can receive immediate operand specified by microcommand 105 from reservation station 108.MTNN and MFNN Framework instruction 103 includes immediate operand to specify 121 function to be performed of neural network unit, and this function is by MTNN The one or more microcommands 105 generated with 103 translation of MFNN framework instruction are provided, and the details will be described later.

Execution unit 112 includes one or more load/store units (not shown), is loaded by memory sub-system 114 Data and data are stored to memory sub-system 114.For a preferred embodiment, this memory sub-system 114 includes depositing Reservoir administrative unit (not shown), this memory management unit may include, such as (lookaside) buffering is searched in multiple translations Device, a table mobile (tablewalk) unit, one data quick of stratum (with instruction cache 102), two unification of stratum Cache and a Bus Interface Unit as the interface between processor 100 and system storage.In one embodiment, Fig. 1 Processor 100 is indicated with one of multiple processing cores of multi-core processor, and shared one of this multi-core processor is most Stratum's cache afterwards.Execution unit 112 may also include multiple integer units, multiple media units, multiple floating point units and one A branch units.

Neural network unit 121 includes weight random access memory (RAM) 124, data random access memory 122, N A nerve processing unit (NPU) 126, one sequencer 128 of program storage 129, one and multiple controls and status register 127.These neural processing units 126 are conceptually such as the function of the neuron in neural network.Weight random access memory Device 124, data random access memory 122 are both transparent for MTNN and MFNN framework instruction 103 with program storage 129 and write respectively Enter and reads.Weight random access memory 124 is arranged as W column, the N number of weight text of each column, data random access memory 122 It is arranged as D column, the N number of data literal of each column.Each data literal and each weight text are multiple positions, are preferably implemented with regard to one For example, it can be 8 positions, 9 positions, 12 positions or 16 positions.Neuron of each data literal as preceding layer in network Output valve (being indicated sometimes with initiation value), each weight text is as the neuron being associated in network into network current layer Connection weight.Although being loaded into weight random access memory 124 in many applications of neural network unit 121 Text or operand are actually to be associated with the weight into the connection of neuron, but should be noted that in nerve net In certain applications of network unit 121, the text for being loaded into weight random access memory 124 is not weight, but because these Text is stored in weight random access memory 124, so still being indicated with the term of " weight text ".For example, In certain applications of neural network unit 121, for example, Figure 24 to Figure 26 A convolution algorithm example or Figure 27 to Figure 28's The example of common source operation, weight random access memory 124 can load the object other than weight, such as data matrix (such as image Pixel data) element.Similarly, although being loaded into data random access in many applications of neural network unit 121 and depositing The text or operand of reservoir 122 are substantially exactly the output valve or initiation value of neuron, but be should be noted that in nerve In certain applications of network unit 121, the text for being loaded into data random access memory 122 is really not so, but because this A little texts are stored in data random access memory 122, so still being indicated with the term of " data literal ".Citing comes Say, in certain applications of neural network unit 121, such as Figure 24 to Figure 26 A convolution algorithm example, data random access Memory 122 can load the output of non-neuron, such as the element of convolution kernel.

In one embodiment, neural processing unit 126 and sequencer 128 include combinational logic, sequencing logic, state machine Device or a combination thereof.The content of status register 127 can be loaded one of them by framework instruction (such as MFNN instruction 1500) General caching device 116, to confirm the state of neural network unit 121, if neural network unit 121 is from program storage 129 complete an order or the running of program or neural network unit 121 can freely receive a new order or Start a new neural network unit program.

The quantity of neural processing unit 126 can deposit at random according to increase in demand, weight random access memory 124 with data The width of access to memory 122 can also adjust therewith with depth to be expanded.For a preferred embodiment, weight arbitrary access is deposited Reservoir 124 can be greater than data random access memory 122, this is because there are many connections in typical neural net layer, because And biggish storage space is needed to store the weight for being associated with each neuron.Many is disclosed herein about data and weight text Size, weight random access memory 124 and the size of data random access memory 122 and different nerve processing it is single The embodiment of first 126 quantity.In one embodiment, it is 64KB (8192 x64 column) that neural network unit 121, which has a size, Data random access memory 122, size is the weight random access memory 124 of 2MB (8192 x2048 column), And 512 neural processing units 126.This neural network unit 121 is 16 nanometers of processing procedures with Taiwan integrated circuit (TSMC) Manufacture, occupied area is about 3.3 square millimeters.

Sequencer 128 is seized instruction by program storage 129 and is executed, and the running executed further includes generating address and control Signal processed is supplied to data random access memory 122, weight random access memory 124 and neural processing unit 126.Sequencing Device 128 generates storage address 123 and reading order is supplied to data random access memory 122, uses N number of number in D column N number of neural processing unit 126 is supplied to according to selection one in text.Sequencer 128 can also generate storage address 125 and read Order is supplied to weight random access memory 124, uses the selection one in N number of weight text of W column and is supplied to N number of nerve Processing unit 126.Sequencer 128 generates the i.e. determining nerve of sequence for being also provided to the address 123,125 of neural processing unit 126 " connection " between member.Sequencer 128 can also generate storage address 123 and writing commands are supplied to data random access memory 122, it uses the selection one in N number of data literal of D column and is written by N number of neural processing unit 126.Sequencer 128 is also Storage address 125 can be generated and writing commands are supplied to weight random access memory 124, use N number of weight text in W column One is selected to be written in word by N number of neural processing unit 126.Sequencer 128 can also generate storage address 131 to journey To select the neural network unit instruction for being supplied to sequencer 128, this part will do it sequence memory 129 in following sections It is bright.Storage address 131 is corresponding to program counter (not shown), position of the sequencer 128 usually in accordance with program storage 129 Setting sequence is incremented by program counter, and except non-sequencer 128 suffers from control instruction, such as recursion instruction (is please referred to such as Figure 26 A It is shown), in the case, program counter can be updated the destination address of control instruction thus by sequencer 128.Sequencer 128 Control signal can also be generated to neural processing unit 126, indicate that neural processing unit 126 executes a variety of different operations or function Can, such as Initiation, arithmetic/logic, rotation/shift operation, run function and operation being write back, relevant example exists Following sections (please referring to as shown in micro- operation 3418 of Figure 34) can be described in more detail.

It is N number of nerve processing unit 126 can generate N number of result text 133, these result texts 133 can be written back into weight with Machine accesses a column of memory 124 or data random access memory 122.For a preferred embodiment, weight arbitrary access Memory 124 and data random access memory 122 are coupled directly to N neural processing unit 126.Furthermore, it is understood that weight Random access memory 124 belongs to these neural processing units 126 with 122 turns of data random access memory, without being shared with Other execution units 112 in processor 100, these neural processing units 126 can be constantly within each time-frequency period As soon as obtain and complete column from one or both of weight random access memory 124 and data random access memory 122, one compared with For good embodiment, processed in pipelined fashion can be used.In one embodiment, data random access memory 122 and weight are random Each of access memory 124 can provide 8192 positions to neural processing unit 126 within each time-frequency period. This 8192 positions can be considered as 512 16 bytes or 1024 8 bytes to be handled, and the details will be described later.

The data group size handled by neural network unit 121 is not limited to weight random access memory 124 and number According to the size of random access memory 122, and it can be only limited to the size of system storage, this is because data and weight can be Refer between system storage and weight random access memory 124 and data random access memory 122 through MTNN and MFNN The use (for example, through media cache 118) of order and move.In one embodiment, data random access memory 122 is assigned Dual-port is given, enables deposited at random by the reading data literal of data random access memory 122 or write-in data literal to data While access to memory 122, write-in data literal to data random access memory 122.In addition, including cache Memory sub-system 114 larger memory hierarchical structure can provide very big data bandwidth for system storage and nerve Carry out data transmission between network unit 121.In addition, this memory sub-system 114 includes hardware number for a preferred embodiment According to seizing device in advance, the access mode of trace memory, such as by the neural deta and weight of system storage load, and to cache Hierarchical structure executes data and is seized in advance in favor of being transmitted to weight random access memory 124 and data random access memory Reach the transmission of high bandwidth and low latency during 122.

Although in the embodiments herein, one of behaviour of each neural processing unit 126 is provided to by weights memory Count and be denoted as weight, this term is common in neural network, it is to be appreciated, however, that these operands be also possible to it is other with The data of related type are calculated, and its calculating speed can pass through these devices and be promoted.

Fig. 2 is the block schematic diagram for showing the neural processing unit 126 of Fig. 1.As shown in the figure, this neural processing unit Many functions or operation can be performed in 126 running.Especially, this neural processing unit 126 can be used as in artificial neural network Neuron or node are operated, to execute typical product accumulation function or operation.That is, in general, nerve net Network unit 126 (neuron) to: (1) from it is each with its have connection neuron receive input value, this connection would generally but It is not necessarily the preceding layer in artificial neural network；(2) each output valve is multiplied by the corresponding power for being associated with its connection Weight values are to generate product；(3) all products are added up to generate a sum；(4) run function is executed to generate mind to this sum Output through member.But, needed to be implemented different from traditional approach be associated with it is all connection input all multiplyings and by its The power for being associated with one of connection input can be performed in product aggregation, each neuron of the invention within the given time-frequency period Multiplying and by the cumulative of its product and the product for being associated with connection input performed in the time-frequency period before the time point again Value is added (cumulative).It is assumed that a shared M connection connects so far neuron, (M time-frequency is probably needed after M product adds up The time in period), this neuron can execute run function to this cumulative number to generate output or result.The advantages of this mode, is The quantity of required multiplier can be reduced, and only needs a smaller, simpler and more quick addition in neuron Device circuit (such as using two input adder), without use can by it is all connection input products aggregation or even To adder needed for the aggregation of a wherein subclass.This mode is also beneficial in neural network unit 121 use a myriad of (N) neuron (neural processing unit 126), in this way, neural network unit 121 just can produce after about M time-frequency period The output of this big quantity (N) neuron.Finally, for a large amount of different connection inputs, the nerve net being made of these neurons Network unit 121 can be effective as the execution of artificial neural network network layers.That is, if the quantity of M is increased in different layers Subtract, time-frequency periodicity needed for generating memory cell output also can correspondingly increase and decrease, and resource (such as multiplier and accumulator) It can be fully utilized.In comparison, traditional design has the part of certain multipliers and adder for lesser M value Fail to be utilized.Therefore, number is exported in response to the connection of neural network unit, embodiment as described herein has both elasticity and efficiency Advantage, and high efficiency can be provided.

Neural processing unit 126 includes buffer 205, dual input multitask buffers 208, an arithmetic logic unit (ALU) 204, accumulator 202 and run function unit (AFU) 212.Buffer 205 is connect by weight random access memory 124 Weight text 206 of retaking the power simultaneously provides its output 203 in the subsequent time-frequency period.Multitask buffer 208 is in two inputs 207,211 Selection one is stored in its buffer and is provided in its output 209 in the subsequent time-frequency period.Input 207 receives random from data Access the data literal of memory 122.Another input 211 then receives the output 209 of adjacent nerve processing unit 126.Fig. 2 institute The neural processing unit 126 shown is denoted as neural processing unit J in N number of neural processing unit shown in FIG. 1.That is, Neural processing unit J is the one of this N number of neural processing unit 126 to represent example.For a preferred embodiment, nerve processing is single The multitask that the input 211 of the multitask buffer 208 of the J example of member 126 receives the J-1 example of neural processing unit 126 is slow The output 209 of storage 208, and the output 209 of the multitask buffer 208 of neural processing unit J is supplied to neural processing unit The input 211 of the multitask buffer 208 of 126 J+1 example.In this way, the multitask buffer of N number of nerve processing unit 126 208 can cooperating syringe, such as the rotator or cyclic shifter of N number of text, this part has in more detail in subsequent figure 3 Explanation.Multitask buffer 208 can be by multitask buffer 208 using which in the two inputs of 213 control of control input Selection is stored in its buffer and is provided in output 209 in subsequent.

There are three inputs for the tool of arithmetic logic unit 204.One of input receives weight text 203 by buffer 205.Separately One input receives the output 209 of multitask buffer 208.Yet another input receives the output 217 of accumulator 202.This is calculated Art logic unit 204 can input execution arithmetic and/or logical operation to it and be provided in its output to generate a result.Preferably with regard to one For embodiment, arithmetic that arithmetic logic unit 204 executes and/or logical operation are by being stored in the instruction of program storage 129 It is specified.For example, multiply-accumulate operation is specified in the multiply-accumulate instruction in Fig. 4, that is, result 215 can be accumulator 202 The aggregation of the product of numerical value 217 and the data literal of weight text 203 and multitask buffer 208 output 209.But may be used To specify other operations, these operations include but is not limited to: result 215 is the numerical value of 209 transmitting of multitask buffer output； As a result 215 be weight text 203 transmit numerical value；As a result 215 be zero；As a result 215 be 202 numerical value 217 of accumulator and weight 203 aggregation；As a result 215 be 202 numerical value 217 of accumulator with multitask buffer output 209 aggregation；As a result 215 be cumulative 202 numerical value 217 of device and the maximum value in weight 203；As a result 215 be that 202 numerical value 217 of accumulator is exported with multitask buffer Maximum value in 209.

Arithmetic logic unit 204 provides its output 215 to accumulator 202 and stores.Arithmetic logic unit 204 includes multiplication Device 242 carries out multiplying to the data literal of weight text 203 and the output of multitask buffer 208 209 to generate a product 246.In one embodiment, two 16 positional operands are multiplied to produce one 32 results by multiplier 242.This arithmetical logic Unit 204 further includes that adder 244 is total to generate one plus product 246 in the output 217 of accumulator 202, this sum is It is stored in the result 215 of the accumulating operation of accumulator 202.In one embodiment, one 41 in accumulator 202 of adder 244 Place value 217 adds 32 results of multiplier 242 to generate 41 results.In this way, in the phase in multiple time-frequency periods Interior to utilize rotator characteristic possessed by multitask buffer 208, neural processing unit 126 may achieve needed for neural network Neuron product add up operation.It is other such as preceding institute to execute that this arithmetic logic unit 204 may also comprise other circuit units The arithmetic/logic stated.In one embodiment, second adder subtracts in the data literal of the output of multitask buffer 208 209 Weight text 203 is removed to generate a difference, subsequent adder 244 can add this difference in the output 217 of accumulator 202 to generate One result 215, this result are the accumulation result in accumulator 202.In this way, in a period of multiple time-frequency periods, at nerve Reason unit 126 can reach the operation of difference aggregation.For a preferred embodiment, although weight text 203 and data literal 209 size is identical (in bits), they can also have different binary point positions, and the details will be described later.It is preferably real with regard to one For applying example, multiplier 242 and adder 244 are integer multiplier and adder, compared to using the arithmetic of floating-point operation to patrol Volume unit, this arithmetic logic unit 204 have the advantages that low complex degree, it is small-sized, quickly with low power consuming.But, of the invention In other embodiments, floating-point operation is also can be performed in arithmetic logic unit 204.

Although only one multiplier 242 of display and adder 244 in the arithmetic logic unit 204 of Fig. 2, but, with regard to one compared with For good embodiment, this arithmetic logic unit 204 further includes having other components to execute aforementioned other different operations.Citing comes It says, this arithmetic logic unit 204 may include that comparator (not shown) compares accumulator 202 and data/weight text and multiplexing Device (not shown) selection the greater (maximum value) in the two values that comparator is specified is stored to accumulator 202.At another In example, arithmetic logic unit 204 includes selection logic (not shown), skips multiplier 242 using data/weight text, Store adder 224 to accumulator plus this data/weight text to generate a sum in the numerical value 217 of accumulator 202 202.These additional operations can be described in more detail in following sections such as Figure 18 to Figure 29 A, and these operations also contribute to Such as the execution of convolution algorithm and common source operation.

The output 217 of the reception accumulator 202 of run function unit 212.Run function unit 212 can be to accumulator 202 Output executes run function to generate the result 133 of Fig. 1.In general, in the neuron of the intermediary layer of artificial neural network Run function can be used to standardize the sum after product accumulation, it is particularly possible to be carried out using nonlinear mode.For " standard Change " progressive total, the run function of Current neural member can be used as defeated in the expected reception of other neurons of connection Current neural member An end value is generated in the numberical range entered.(result after standardization is known as " starting " sometimes, and herein, starting is to work as The output of front nodal point, and this output can be multiplied by the weight for being associated with and linking between output node and receiving node to produce by receiving node A raw product, and the product accumulation that this product can link with the other inputs for being associated with this receiving node.) for example, it is connecing In the case where the expected reception numerical value as input of neuron is received/be concatenated between 0 and 1, output neuron may require that non-thread Property squeeze and/or adjust (such as upward displacement is to be converted to positive value for negative value) beyond 0 and 1 range outside progressive total, Fall within it in this desired extent.Therefore, the operation that run function unit 212 executes 202 numerical value 217 of accumulator can will tie Fruit 133 is taken in known range.The result 133 of N number of nerve execution unit 126 can all be deposited by write back data arbitrary access simultaneously Reservoir 122 or weight random access memory 124.For a preferred embodiment, run function unit 212 is more to execute A run function, and it is cumulative for example one to be selected to be implemented in these run functions from the input for controlling buffer 127 The output 217 of device 202.These run functions may include but be not limited to step function, correction function, S type function, tanh letter It is several to add function (also referred to as smooth correction function) with soft.Soft plus function analytic formula is f (x)=ln (1+e^x), that is, 1 With e^xAggregation natural logrithm, wherein " e " is Euler's numbers (Euler ' s number), and x is the input 217 of this function.Just For one preferred embodiment, run function may also comprise transmitting (pass-through) function, directly transmitting accumulator 202 number Value 217 or in which a part, the details will be described later.In one embodiment, the circuit of run function unit 212 can be in single a time-frequency week Run function is executed in phase.In one embodiment, run function unit 212 includes multiple lists, receives accumulated value and exports One numerical value can be similar to really certain run functions, such as S type function, hyperbolic tangent function, soft plus function, this numerical value Numerical value provided by run function.

For a preferred embodiment, the width (in bits) of accumulator 202 is greater than the output of run function function 212 133 width.For example, in one embodiment, the width of this accumulator is 41, to avoid being added to most 512 (this part such as corresponds at Figure 30 in following sections to be described in more detail) loses precision in the case where 32 products, And the width of result 133 is 16.In one embodiment, in the subsequent time-frequency period, run function unit 212 can transmit cumulative Device 202 output 217 other untreated parts, and these parts can be write back data random access memory 122 or Weight random access memory 124, this part corresponds at Fig. 8 in following sections to be described in more detail.It can so incite somebody to action Untreated 202 numerical value of accumulator carries back media cache 118 through MFNN instruction, and whereby, other in processor 100 hold The instruction that row unit 112 executes can execute the complicated run function that run function unit 212 can not execute, such as common Soft greatly (softmax) function, this function also referred to as standardize exponential function.In one embodiment, the instruction of processor 100 Collection framework includes the instruction for executing this exponential function, is typically expressed as e^xOr exp (x), this instruction can be by the other of processor 100 Execution unit 112 uses the execution speed to promote soft very big run function.

In one embodiment, neural processing unit 126 uses pipeline designs.For example, neural processing unit 126 can wrap Include the buffer of arithmetic logic unit 204, for example, positioned at multiplier and adder and/or be arithmetic logic unit 204 its Buffer between its circuit, neural processing unit 126 may also include one and load the buffer that run function function 212 exports. The other embodiments of this neural processing unit 126 can be illustrated in following sections.

Fig. 3 is block diagram, and display utilizes the N number of more of N number of neural processing unit 126 of the neural network unit 121 of Fig. 1 Task buffer device 208 executes such as N the column data text 207 obtained by the data random access memory 122 of Fig. 1 The running of the rotator (rotator) or cyclic shifter (circular shifter) of a text.In the embodiment of Fig. 3 In, N is 512, and therefore, neural network unit 121 has 512 multitask buffers 208, is denoted as 0 to 511, respectively corresponds To 512 neural processing units 126.Each multitask buffer 208 can receive the D column of data random access memory 122 The wherein corresponding data literal 207 on a column.That is, multitask buffer 0 can be from data random access memory 122 Column receive data literal 0, and multitask buffer 1 can arrange from data random access memory 122 and receive data literal 1, multitask Buffer 2 can arrange from data random access memory 122 and receive data literal 2, and so on, multitask buffer 511 can be from The column of data random access memory 122 receive data literal 511.In addition, multitask buffer 1 can receive multitask buffer 0 Output 209 as another input 211, multitask buffer 2 can receive the output 209 of multitask buffer 1 as another defeated Enter 211, multitask buffer 3 can receive the output 209 of multitask buffer 2 as another input 211, and so on, more Business buffer 511 can receive the output 209 of multitask buffer 510 as another input 211, and multitask buffer 0 can connect The output 209 of multitask buffer 511 is received as other inputs 211.Each multitask buffer 208 can receive control input 213 select data literal 207 or circulation input 211 to control it.In the mode operated herein, control input 213 can be the In one time-frequency period, controls each multitask buffer 208 and select data literal 207 to store to buffer and in subsequent step It is supplied to arithmetic logic unit 204, and within the subsequent time-frequency period (such as aforementioned M-1 time-frequency period), control 213 meetings of input Each multitask buffer 208 selection circulation input 211 is controlled with storage to buffer and is supplied to arithmetic in subsequent step and patrols Collect unit 204.

Although in Fig. 3 (and subsequent Fig. 7 and Figure 19) described embodiment, multiple nerve processing units 126 can be used The numerical value of these multitask buffers 208/705 to be rotated to the right, namely by the neural neural processing unit of processing unit J direction J+1 is mobile, but the present invention is not limited thereto, in other embodiments (such as embodiment corresponding to Figure 24 to Figure 26), Multiple nerve processing units 126 can be used to rotate to the left the numerical value of multitask buffer 208/705, namely single by nerve processing First J is mobile towards nerve processing unit J-1.In addition, in other embodiments of the invention, these neural processing units 126 can Selectively the numerical value of multitask buffer 208/705 is rotated to the left or to the right, for example, this selection can be by neural network Unit instruction.

Fig. 4 is table, and display one is stored in the program storage 129 of the neural network unit 121 of Fig. 1 and by the mind The program executed through network unit 121.As previously mentioned, this example program executes one layer of related meter with artificial neural network It calculates.The table of Fig. 4 shows four column and three rows.Each column correspond to the address that the first row is shown in program storage 129.The Two rows specify corresponding instruction, and the third line points out the time-frequency periodicity for being associated with this instruction.For a preferred embodiment, Aforementioned time-frequency periodicity indicates the effective time-frequency periodicity of every instruction time-frequency periodic quantity in the embodiment that pipeline executes, rather than Instruction delay.As shown in the figure, because of the essence that there is neural network unit 121 pipeline to execute, each instruction has associated The time-frequency period, the instruction positioned at address 2 is an exception, this instruction actually be can do by myself and be repeated 511 times, thus need 511 time-frequency periods are wanted, the details will be described later.

Each instruction in all meeting parallel processing programs of neural processing unit 126.That is, all N number of minds All can be in the instruction of execution of same time-frequency period first row through processing unit 126, all N neural processing unit 126 is all It can be in the instruction of execution of same time-frequency period secondary series, and so on.But the present invention is not limited thereto, in following sections In other embodiments, some instructions are executed in a manner of the parallel section sequence of part, for example, such as the embodiment of Figure 11 It is described, in the embodiment that multiple neural processing units 126 share a run function units, run function be located at address 3 Output order with 4 is to execute by this method.One layer is assumed in the example of Fig. 4 has 512 neuron (neural processing units 126), each neuron has the connection input of 512 512 neurons from preceding layer, a total of 256K connection. Each neuron can from each connection input receive 16 bit data values, and by this 16 bit data value be multiplied by one it is appropriate 16 weighted values.

First row (also may specify to other addresses) positioned at address 0 can specify the neural processing unit instruction of initialization.This Initialization directive can remove 202 numerical value of accumulator and be allowed to be zero.In one embodiment, initialization directive can also be in accumulator 202 In one column of interior load data random access memory 122 or weight random access memory 124, thus instruction Corresponding text.Configuration Values can also be loaded control buffer 127 by this initialization directive, this part is in subsequent figure 29A and figure 29B can be described in more detail.For example, the width of data literal 207 and weight text 209 can be loaded, is patrolled for arithmetic The utilization of unit 204 is collected to confirm the operation size of circuit execution, this width also will affect the result 215 for being stored in accumulator 202. In one embodiment, neural processing unit 126 is stored in accumulator in the output 215 of arithmetic logic unit 204 including a circuit This output 215 is filled up before 202, and Configuration Values can be loaded this circuit by initialization directive, this Configuration Values will affect above-mentioned fill up Operation.It in one embodiment, can also be in arithmetic logic unit function instruction (the multiply-accumulate instruction of such as address 1) or output order It is so specified in (the write-in starting function unit output order of such as address 4), accumulator 202 is removed to zero.

Secondary series positioned at address 1 specifies multiply-accumulate instruction to indicate that this 512 neural processing units 126 are random from data The column for accessing memory 122 load corresponding data literal and the column load from weight random access memory 124 Corresponding weight text, and the first multiply-accumulate fortune is executed to this data literal input 207 and weight text input 206 It calculates, i.e., plus initialization 202 zero of accumulator.Furthermore, it is understood that this instruction can indicate that sequencer 128 is produced in control input 213 A raw numerical value is to select data literal to input 207.In the example of Fig. 4, the specified of data random access memory 122 is classified as Column 17, the specified of weight random access memory 124 is classified as column 0, therefore sequencer can be instructed to output numerical value 17 as data Random access memory address 123, output numerical value 0 are used as weight random access memory address 125.Therefore, from data with 512 data literals that machine accesses the column 17 of memory 122 provide the corresponding data as 512 neural processing units 126 Input 207, and 512 from the column of weight random access memory 124 0 weight texts are provided as 512 nerve processing The corresponding weight input 206 of unit 126.

Third column positioned at address 2 specify multiply-accumulate rotation to instruct, and it is 511 that this instruction, which has one to count its numerical value, with Indicate that this 512 neural processing units 126 execute 511 multiply-accumulate operations.This neural processing unit of instruction instruction this 512 126 will input the data literal 209 of arithmetic logic unit 204 in the operation each time of 511 multiply-accumulate operations, as from neighbour The rotational value 211 of nearly nerve processing unit 126.That is, this instruction can indicate that sequencer 128 is produced in control input 213 A raw numerical value is to select rotational value 211.In addition, this instruction can indicate that this 512 neural processing units 126 tire out 511 multiplication The corresponding weighted value in the operation each time of operation is added to load " next " column of weight random access memory 124.Namely It says, this instruction can indicate that sequencer 128 increases weight random access memory address 125 from the numerical value in previous time-frequency period One, in this example, the first time-frequency period of instruction is column 1, and next time-frequency period is exactly column 2, in next time-frequency period It is exactly column 3, and so on, the 511st time-frequency period is exactly column 511.In each of this 511 multiply-accumulate operations operation In, the product of rotation input 211 and weight text input 206 can be added into the previous numerical value of accumulator 202.This 512 minds This 511 multiply-accumulate operations, each nerve processing unit 126 can be executed within 511 time-frequency periods through processing unit 126 Can different data text-for the column 17 from data random access memory 122 it is, adjacent neural processing unit 126 execute the data literal of operation in the previous time-frequency period, and are associated with the different weight texts execution one of data literal A multiply-accumulate operation is conceptually the different connection inputs of neuron.This example assumes each neural processing unit 126 (neuron) has 512 connection inputs, therefore involves the processing of 512 data literals and 512 weight texts.In column 2 Multiply-accumulate rotation instruction repeat last time iteration after, will be stored in accumulator 202 this 512 connection input multiply Long-pending aggregation.In one embodiment, the instruction set of neural processing unit 126 includes that " execution " is instructed to indicate arithmetic logic unit 204 execute by initializing the specified arithmetic logic unit operation of neural processing unit instruction, such as the arithmetic logic unit of Figure 29 A Person specified by function 2926, rather than for each different types of arithmetic logical operation (such as multiply-accumulate, accumulator above-mentioned With the maximum value of weight etc.) there is independent instruction.

Specified run function instruction is arranged positioned at the 4th of address 3.This run function instruction instruction run function unit 212 is right Specified run function is executed in 202 numerical value of accumulator to generate result 133.The embodiment of run function is in following sections meeting It is described in more detail.

The specified write-in run function unit output order of the 5th column positioned at address 4, to indicate that this 512 nerve processing are single Its run function unit 212 is exported 133 column for being written back to data random access memory 122 as a result by member 216, herein It is column 16 in example.That is, this instruction can indicate 128 output numerical value 16 of sequencer as data random access memory Address 123 and writing commands (corresponding to by the reading order of the multiply-accumulate instruction of address 1).Preferably with regard to one For embodiment, because of the characteristic that pipeline executes, write-in run function unit output order can be performed simultaneously with other instructions, because This write-in run function unit output order can actually execute within single a time-frequency period.

For a preferred embodiment, each nerve processing unit 126 is used as a pipeline, this pipeline has various different function Energy component, such as multitask buffer 208 (and multitask buffer 705 of Fig. 7), arithmetic logic unit 204, accumulator 202, run function unit 212, multiplexer 802 (please referring to Fig. 8), column buffer 1104 (please join with run function unit 1112 According to Figure 11) etc., some of them component itself can pipeline execution.Other than data literal 207 and weight text 206, this pipeline It can also receive and instruct from program storage 129.These instructions can be flowed along pipeline and control multiple functions unit.Another Do not include run function in embodiment, in this program to instruct, but by initialize neural processing unit instruction it is specified be implemented in it is tired Add the run function of 202 numerical value 217 of device, it is indicated that the numerical value of appointed run function is stored in allocating cache device, for pipeline 212 part of run function unit is after generating last 202 numerical value 217 of accumulator, that is, the multiply-accumulate rotation in address 2 After instruction repeats last time execution, it is used.For a preferred embodiment, in order to save energy consumption, the starting letter of pipeline 212 part of counting unit can be opened in not starting state when instructing arrival before write-in run function unit output order reaches Dynamic function unit 212 will start and execute run function to the output of accumulator 202 217 that initialization directive is specified.

Fig. 5 is to show that neural network unit 121 executes the timing diagram of the program of Fig. 4.Each column of timing diagram are corresponding extremely The continuous time-frequency period that the first row is pointed out.Other rows are then to be respectively corresponding to mind different in this 512 neural processing units 126 Through processing unit 126 and point out its operation.Only show the operation of neural processing unit 0,1,511 to simplify explanation in figure.

In the time-frequency period 0, the neural processing unit 126 of each of this 512 neural processing units 126 can all execute figure 4 initialization directive is that zero is assigned to accumulator 202 in Fig. 5.

In the time-frequency period 1, the neural processing unit 126 of each of this 512 neural processing units 126 can all execute figure The multiply-accumulate instruction of address 1 in 4.As shown in the figure, neural processing unit 0 can be by 202 numerical value of accumulator (i.e. zero) plus number According to the product of the text 0 of the column 0 of the text 0 and weight random access memory 124 of the column 17 of random access memory 122；Mind It can be by 202 numerical value of accumulator (i.e. zero) plus the text 1 of the column 17 of data random access memory 122 and power through processing unit 1 The product of the text 1 of the column 0 of weight random access memory 124；The rest may be inferred, and neural processing unit 511 can be by accumulator 202 Text 511 and weight random access memory 124 of the numerical value (i.e. zero) plus the column 17 of data random access memory 122 The product of the text 511 of column 0.

In the time-frequency period 2, the neural processing unit 126 of each of this 512 neural processing units 126 can all carry out figure The first time iteration of the multiply-accumulate rotation instruction of address 2 in 4.As shown in the figure, neural processing unit 0 can be by accumulator 202 Numerical value adds the 209 received spin data text 211 of the output of multitask buffer 208 by neural processing unit 511 (i.e. by counting According to the received data literal 511 of random access memory 122) multiply with the text 0 of the column 1 of weight random access memory 124 Product；202 numerical value of accumulator can be added and export 209 by the multitask buffer 208 of neural processing unit 0 by neural processing unit 1 Received spin data text 211 (i.e. by the received data literal 0 of data random access memory 122) and weight arbitrary access The product of the text 1 of the column 1 of memory 124；The rest may be inferred, and neural processing unit 511 can add 202 numerical value of accumulator by mind Multitask buffer 208 through processing unit 510 exports 209 received spin data texts 211 and (is deposited by data random access The received data literal 510 of reservoir 122) product with the text 511 of the column 1 of weight random access memory 124.

In the time-frequency period 3, the neural processing unit 126 of each of this 512 neural processing units 126 can all carry out figure Second of iteration of the multiply-accumulate rotation instruction of address 2 in 4.As shown in the figure, neural processing unit 0 can be by accumulator 202 Numerical value adds the 209 received spin data text 211 of the output of multitask buffer 208 by neural processing unit 511 (i.e. by counting According to the received data literal 510 of random access memory 122) multiply with the text 0 of the column 2 of weight random access memory 124 Product；202 numerical value of accumulator can be added and export 209 by the multitask buffer 208 of neural processing unit 0 by neural processing unit 1 Received spin data text 211 is deposited (i.e. by the received data literal 511 of data random access memory 122) with weight at random The product of the text 1 of the column 2 of access to memory 124；The rest may be inferred, neural processing unit 511 202 numerical value of accumulator can be added by The 209 received spin data text 211 of the output of multitask buffer 208 of neural processing unit 510 is (i.e. by data random access The received data literal 509 of memory 122) product with the text 511 of the column 2 of weight random access memory 124.Such as figure 5 omission label shows that following 509 time-frequency periods can continue to carry out according to this, until the time-frequency period 512.

In the time-frequency period 512, the neural processing unit 126 of each of this 512 neural processing units 126 all can be into 511st iteration of the multiply-accumulate rotation instruction of address 2 in row Fig. 4.As shown in the figure, neural processing unit 0 can will add up 202 numerical value of device, which is added, exports 209 received spin data texts 211 by the multitask buffer 208 of neural processing unit 511 The text of the column 511 of (i.e. by the received data literal 1 of data random access memory 122) and weight random access memory 124 The product of word 0；202 numerical value of accumulator can be added the multitask buffer 208 by neural processing unit 0 by neural processing unit 1 Export 209 received spin data texts 211 (i.e. by the received data literal 2 of data random access memory 122) and weight The product of the text 1 of the column 511 of random access memory 124；The rest may be inferred, and neural processing unit 511 can be by the number of accumulator 202 Value is plus the 209 received spin data text 211 of the output of multitask buffer 208 by neural processing unit 510 (i.e. by data Random access memory 122 received data literal 0) multiply with the text 511 of the column 511 of weight random access memory 124 Product.Need multiple time-frequency periods from data random access memory 122 and weight random access memory 124 in one embodiment Data literal and weight text are read to execute the multiply-accumulate instruction of address 1 in Fig. 4；But, data random access memory 122, weight random access memory 124 and neural processing unit 126 be using pipeline configuration, it is so multiply-accumulate at first After operation starts (as shown in the time-frequency period 1 of Fig. 5), subsequent multiply-accumulate operation is (such as the time-frequency period 2-512 institute of Fig. 5 Show) it will start to execute within the time-frequency period of connecting.For a preferred embodiment, instructed in response to using framework, such as MTNN Or MFNN instruction (will do it explanation in subsequent figure 14 and Figure 15), it is random for data random access memory 122 and/or weight The microcommand that the access action or framework instruction translation for accessing memory 124 go out, these neural processing units 126 can be of short duration It shelves on ground.

In the time-frequency period 513, the starting of the neural processing unit 126 of each of this 512 neural processing units 126 Function unit 212 can all execute the run function of address 3 in Fig. 4.Finally, this 512 nerve processing are single in the time-frequency period 514 The neural processing unit 126 of each of member 126 can be penetrated the column of its 133 write back data random access memory 122 of result Corresponding text in 16 is to execute the write-in run function unit output order of address 4 in Fig. 4, that is to say, that nerve processing The result 133 of unit 0 can be written into the text 0 of data random access memory 122, and the result 133 of neural processing unit 1 can quilt The text 1 of data random access memory 122 is written, and so on, the result 133 of neural processing unit 511 can be written into number According to the text 511 of random access memory 122.The corresponding block diagram of operation corresponding to earlier figures 5 is shown in Fig. 6 A.

Fig. 6 A is to show that the neural network unit 121 of Fig. 1 executes the block schematic diagram of the program of Fig. 4.This neural network list First 121 include 512 neural processing units 126, the data random access memory 122 for receiving address input 123, with reception ground The weight random access memory 124 of location input 125.When the time-frequency period 0, this 512 126 meetings of neural processing unit Execute initialization directive.This running is not shown in figure.As shown in the figure, when the time-frequency period 1,512 of column 17 16 data literals can read from data random access memory 122 and be provided to this 512 neural processing units 126.? During the time-frequency period 1 to 512,512 16 weight texts of column 0 to column 511 can be deposited from weight arbitrary access respectively Reservoir 122 reads and is provided to this 512 neural processing units 126.When the time-frequency period 1, this 512 nerve processing are single Member 126 can execute its corresponding multiply-accumulate operation with weight text to the data literal of load.This is operated in figure not Display.During the time-frequency period 2 to 512, the multitask buffer 208 of 512 neural processing units 126 can be such as same Rotator with 512 16 texts operated, and will previously have been loaded by the column 17 of data random access memory 122 Data literal turns to neighbouring neural processing unit 126, and these neural processing units 126 can be to the corresponding number after rotation Multiply-accumulate operation is executed according to text and by the corresponding weight text that weight random access memory 124 loads.In time-frequency When period 513, this 512 run function units 212 can execute enabled instruction.This running is not shown in figure.When When the frequency period 514, this 512 neural processing units 126 can be by its corresponding 512 16 133 write back datas of result The column 16 of random access memory 122.

As shown in the figure, result text (neuron output) and write back data random access memory 122 or weight are generated The data input (connection) that the current layer for the time-frequency periodicity substantially neural network that random access memory 124 needs receives The square root of quantity.For example, if current layer has 512 neurons, and each neuron have 512 from previous The sum of the connection of layer, these connections is exactly 256K, and the time-frequency periodicity for generating current layer result needs will be slightly larger than 512.Therefore, neural network unit 121 can provide high efficiency in terms of neural computing.

Fig. 6 B is flow chart, shows that the processor 100 of Fig. 1 executes framework program, to execute using neural network unit 121 It is associated with the running of the multiply-accumulate run function operation of typical case of the neuron of the hidden layer of artificial neural network, as by Fig. 4 Program execute running.The example of Fig. 6 B assumes (to be shown in the variable NUM_ of initialization step 602 there are four hidden layer LAYERS), each hidden layer has 512 neurons, and 512 neurons of each neuron connection preceding layer whole (penetrate The program of Fig. 4).However, it is desirable to understand, the selection of these layers and the quantity of neuron to illustrate the invention, neural network Unit 121 is when neural with different number in the embodiment that similar calculating can be applied to different number hidden layer, each layer The embodiment that the embodiment or neuron of member are not linked all.In one embodiment, the mind for being not present in this layer The weighted value linked through member or the neuron being not present can be set to zero.For a preferred embodiment, framework program meeting By first group of weight write-in weight random access memory 124 and start neural network unit 121, when neural network unit 121 When being carrying out the calculating for being associated with first layer, weight random access memory can be written in second group of weight by this framework program 124, once in this way, neural network unit 121 completes the calculating of the first hidden layer, neural network unit 121 can start the Two layers of calculating.In this way, framework program can travel to and fro between two regions of weight random access memory 124, to ensure nerve net Network unit 121 can be fully utilized.This process starts from step 602.

In step 602, as described in the related Sections of Fig. 6 A, number is written in input value by the processor 100 for executing framework program According to the Current neural member hidden layer of random access memory 122, that is, the column 17 of write-in data random access memory 122. The column 17 that these values may also have been positioned at data random access memory 122 are directed to preceding layer as neural network unit 121 Operation result 133 (such as convolution, common source or input layer).Secondly, variable N can be initialized as numerical value 1 by framework program.Variable N represents the current layer that will be handled by neural network unit 121 in hidden layer.In addition, framework program can be by variable NUM_ LAYERS is initialized as numerical value 4, because there are four hidden layers in this example.Following process advances to step 604.

In step 604, weight random access memory 124, such as Fig. 6 A is written in the weight text of layer 1 by processor 100 Shown in column 0 to 511.Following process advances to step 606.

In step 606, processor 100 is instructed using specified function 1432 with the MTNN of write-in program memory 129 1400, by multiply-accumulate run function program 121 program storage 129 of write-in neural network unit (as shown in Figure 4).Processing Followed by MTNN instruction 1400 to start neural network unit program, this instruction specified function 1432 starts to execute this device 100 Program.Following process advances to step 608.

In steps in decision-making 608, whether the numerical value of framework program validation variable N is less than NUM_LAYERS.If so, process is just Step 612 can be advanced to；Otherwise step 614 is proceeded to.

In step 612, weight random access memory 124 is written in the weight text of layer N+1 by processor 100, such as Column 512 to 1023.Therefore, framework program can neural network unit 121 execute current layer hidden layer calculate when will under Weight random access memory 124 is written in one layer of weight text, whereby, in the calculating for completing current layer, that is, write-in number After random access memory 122, the hidden layer that neural network unit 121 can get started next layer of execution is calculated.It connects Get off to advance to step 614.

In step 614, processor 100 confirms the neural network unit program being carrying out (for layer 1, in step 606 start to execute, and are then to start to execute in step 618 for layer 2 to 4) whether complete to execute.Preferably implement with regard to one For example, processor 100 can read 121 status register 127 of neural network unit through MFNN instruction 1500 is executed to confirm Whether complete to execute.In another embodiment, neural network unit 121 can generate an interruption, and multiplication has been completed in expression Cumulative run function layer program.Following process advances to steps in decision-making 616.

In steps in decision-making 616, whether the numerical value of framework program validation variable N is less than NUM_LAYERS.If so, process meeting Advance to step 618；Otherwise step 622 is proceeded to.

In step 618, processor 100 will be updated multiply-accumulate run function program, enable hiding for execution level N+1 Layer calculates.Furthermore, it is understood that processor 100 can be by the data random access memory of the multiply-accumulate instruction of address 1 in Fig. 4 122 train values are updated to the column (such as being updated to column 16) of preceding layer calculated result write-in in data random access memory 122 simultaneously Update output column (such as being updated to column 15).Processor 100 then begins to update neural network unit program.In another implementation In example, the program of Fig. 4 specifies the same row of the output order of address 4 as the column of the multiply-accumulate instruction of address 1 (column namely read by data random access memory 122).In this embodiment, the forefront of working as of input data text can quilt Overriding is (because column data text has been read into multitask buffer 208 and through N text rotator at these nerves thus It is rotated between reason unit 126, as long as this column data text is not required to be used for other purposes, such processing mode is exactly can be with It is allowed to).In the case, it in step 618 there is no need to update neural network unit program, and only needs it again Starting.Following process advances to step 622.

In step 622, neural network unit program of the processor 100 from 122 reading layer N of data random access memory Result.But, if these results can only be used for next layer, framework program is just not necessary to from data random access memory 122 read these as a result, and can be retained on data random access memory 122 and be used for the calculating of next hidden layer.It connects Process of getting off advances to step 624.

In steps in decision-making 624, whether the numerical value of framework program validation variable N is less than NUM_LAYERS.If so, before process Proceed to step 626；Otherwise this process is just terminated.

In step 626, the numerical value of N can be increased by one by framework program.Following process can return to steps in decision-making 608.

As shown in the example of Fig. 6 B, in generally every 512 time-frequency periods, these neural processing units 126 will logarithm Primary read with write-once (through the effect of the operation of the neural network unit program of Fig. 4 is executed according to random access memory 122 Fruit).In addition, these neural generally each time-frequency periods of processing unit 126 can carry out weight random access memory 124 It reads to read a column weight text.Therefore, the whole bandwidth of weight random access memory 124 all can be because of neural network list Member 121 executes hidden layer operation in a mixed manner and is consumed.Furthermore, it is assumed that there is a write-in in one embodiment and read Buffer, such as the buffer 1704 of Figure 17, while neural processing unit 126 is read out, processor 100 is random to weight Access memory 124 is written, and such buffer 1704 generally every 16 time-frequency periods can be to weight random access memory Device 124 executes write-once so that weight text is written.Therefore, in the implementation that weight random access memory 124 is single-port In example (as described in the corresponding chapters and sections of Figure 17), generally every 16 time-frequency periods, these neural processing units 126 will be temporary When shelve the reading carried out to weight random access memory 124, and enable buffer 1704 to weight random access memory Device 124 is written.But, in the embodiment of dual-port weight random access memory 124, these neural processing units 126 are just not required to lie on the table.

Fig. 7 is the block schematic diagram for showing another embodiment of neural processing unit 126 of Fig. 1.The nerve processing of Fig. 7 is single Member 126 is similar to the neural processing unit 126 of Fig. 2.But, in addition Fig. 7 neural processing unit 126 has a dual input more Task buffer device 705.This multitask buffer 705 selects one of input 206 or 711 to be stored in its buffer, and in rear The continuous time-frequency period is provided in its output 203.Input 206 receives weight text from weight random access memory 124.Another Input 711 is then the output 203 for receiving the second multitask buffer 705 of adjacent nerve processing unit 126.Preferably implement with regard to one For example, the multitask buffer of the received neural processing unit 126 for being arranged in J-1 of the meeting of input 711 of neural processing unit J 705 outputs 203, and the output 203 of neural processing unit J is then to provide to more of the neural processing unit 126 for being arranged in J+1 The input 711 of business buffer 705.In this way, it is N number of nerve processing unit 126 multitask buffer 705 can cooperating syringe, such as The rotator of same N number of text, running are similar to aforementioned mode shown in Fig. 3, but are non-data for weight text Text.Multitask buffer 705 can be by multitask buffer 705 using which in the two inputs of 213 control of control input Selection is stored in its buffer and is provided in output 203 in subsequent.

Utilize multitask buffer 208 and/or multitask buffer 705 (and other realities as shown in Figure 18 and Figure 23 Apply the multitask buffer in example), data random access memory 122 will be come from by effectively forming a large-scale rotator And/or weight random access memory 124 one column data/weight rotated, neural network unit 121 there is no need to Using a very big multiplexer to provide between data random access memory 122 and/or weight random access memory 124 The data needed/weight text is to neural network unit appropriate.

Accumulator value is written back in addition to run function result

For certain applications, processor 100 is allowed to be received back (such as slow to media through the MFNN command reception of Figure 15 Storage 118) untreated 202 numerical value 217 of accumulator, to be supplied in terms of the instruction execution for being implemented in other execution units 112 It calculates, there is its use really.For example, in one embodiment, run function unit 212 is not directed to holding for soft very big run function Row is configured to reduce the complexity of run function unit 212.So neural network unit 121 can export it is untreated 202 numerical value 217 of accumulator or in which a subset are bonded to data random access memory 122 or weight random access memory 124, and framework program can be read in subsequent step by data random access memory 122 or weight random access memory 124 It takes and this untreated numerical value is calculated.But, not for the application of untreated 202 numerical value 217 of accumulator It is limited to execute soft very big operation, other application is also covered by the present invention.

Fig. 8 is the block schematic diagram for showing the another embodiment of neural processing unit 126 of Fig. 1.The nerve processing of Fig. 8 is single Member 126 is similar to the neural processing unit 126 of Fig. 2.But, the neural processing unit 126 of Fig. 8 is in run function unit 212 Including multiplexer 802, and this run function unit 212 has control input 803.The width (in bits) of accumulator 202 is greater than The width of data literal.Multiplexer 802 has multiple inputs to receive the data literal width segments of the output of accumulator 202 217. In one embodiment, the width of accumulator 202 is 41 positions, and neural processing unit 216 can be used to export one 16 knots Fruit text 133；So, for example, there are three multiplexer 802 (or multiplexer 3032 and/or multiplexer 3037 of Figure 30) tools Input receives the position [15:0] of the output of accumulator 202 217, position [31:16] and position [47:32] respectively.With regard to a preferred embodiment Speech, the non-output bit (such as position [47:41]) provided by accumulator 202 can be forced to be set as off bit.

Sequencer 128 can control input 803 generate a numerical value, control multiplexer 802 accumulator 202 text (such as 16) in selection first, to be instructed in response to write accumulator, such as write accumulator in subsequent figure 9 positioned at address 3 to 5 refers to It enables.For a preferred embodiment, multiplexer 802 also has one or more inputs to receive run function circuit (such as Figure 30 In component 3022,3024,3026,3018,3014 and output 3016), and the output that these run function circuits generate Width is equal to a data literal.Sequencer 128 can be opened in 803 generation numerical value of control input with controlling multiplexer 802 at these It is selected in dynamic functional circuit output, rather than selects it in the text of accumulator 202, with the starting in response to address 4 in such as Fig. 4 Function unit output order.

Fig. 9 is table, and display one is stored in the program storage 129 of the neural network unit 121 of Fig. 1 and by the mind The program executed through network unit 121.The example program of Fig. 9 is similar to the program of Fig. 4.Especially, the two is in address 0 to 2 It instructs identical.But, the instruction of address 3 and 4 is then to be instructed to replace by write accumulator in Fig. 9 in Fig. 4, this instruction meeting Indicate that 512 neural processing units 126 accumulate it the 133 write back data random access memory as a result of the output of device 202 217 122 three column are column 16 to 18 in this example.That is, the instruction of this write accumulator can indicate sequencer 128 at the The data random access memory address 123 and writing commands that frequency period output numerical value is 16, export in the second time-frequency period The data random access memory address 123 and writing commands that numerical value is 17, are then that output numerical value is in the third time-frequency period 18 data random access memory address 123 and writing commands.For preferred embodiment, the execution of write accumulator instruction Time can overlap with other instructions, in this way, write accumulator instruction just can actually be held within these three time-frequency periods Row, wherein the column of data random access memory 122 can be written in each time-frequency period.In embodiment, the specified starting of user The numerical value (Figure 29 A) on 2956 columns of output order of function 2934 and control buffer 127, by the required part of accumulator 202 Data random access memory 122 or weight random access memory 124 is written.In addition, write accumulator instruction can choose The subset of accumulator 202 is write back to property, rather than writes back the full content of accumulator 202.In embodiment, standard type can be write back Accumulator 202.This part can be described in more detail in the subsequent chapters and sections corresponding to Figure 29 to Figure 31.

Figure 10 is to show that neural network unit 121 executes the timing diagram of the program of Fig. 9.The timing diagram of Figure 10 is similar to Fig. 5 Timing diagram, wherein the time-frequency period 0 to 512 is identical.But, in time-frequency period 513-515, this 512 nerve processing are single The write accumulator that the run function unit 212 of each neural processing unit 126 can execute address 3 to 5 in Fig. 9 in member 126 refers to One of enable.Especially, each neural processing unit 126 in the time-frequency period 513,512 neural processing units 126 It can will be in column 16 of the position [15:0] as its 133 write back data random access memory 122 of result of the output of accumulator 202 217 Corresponding text；Each neural processing unit 126 can will tire out in the time-frequency period 514,512 neural processing units 126 Add the position [31:16] of the output of device 202 217 as the phase in the column 17 of its 133 write back data random access memory 122 of result Corresponding text；And each neural processing unit 126 can will add up in the time-frequency period 515,512 neural processing units 126 The position [40:32] of the output of device 202 217 is as corresponding in the column 18 of its 133 write back data random access memory 122 of result Text.For a preferred embodiment, position [47:41] can be forced to be set as zero.

Shared run function unit

Figure 11 is the block schematic diagram for showing an embodiment of neural network unit 121 of Fig. 1.In the embodiment of Figure 11 In, a neuron is divided into two parts, i.e., (this part also includes displacement for run function unit part and arithmetic logic unit part Buffer parts), and each run function unit part is by multiple arithmetic logic unit partial sharings.In Figure 11, arithmetic is patrolled It collects unit part and refers to neural processing unit 126, and shared run function unit part then refers to run function unit 1112.Phase For the embodiment of such as Fig. 2, each neuron is then the run function unit 212 comprising oneself.According to this, in Figure 11 embodiment In one example, neural processing unit 126 (arithmetic logic unit part) may include the accumulator 202 of Fig. 2, arithmetic logic unit 204, multitask buffer 208 and buffer 205, but do not include run function unit 212.In the embodiment in figure 11, neural Network unit 121 includes 512 neural processing units 126, and but, the present invention is not limited thereto.In the example of Figure 11, this 512 neural processing units 126 are divided into 64 groups, and group 0 to 63 is denoted as in Figure 11, and each group has eight Neural processing unit 126.

Neural network unit 121 further includes column buffer 1104 and multiple shared run function units 1112, these are opened Dynamic function unit 1112 is coupled between neural processing unit 126 and column buffer 1104.The width of column buffer 1104 is (with position Meter) it is identical as a column of data random access memory 122 or weight random access memory 124, such as 512 texts.Often One neural 126 group of processing unit has a run function unit 1112, that is, each run function unit 1112 is corresponding In neural 126 group of processing unit；In this way, it is corresponding to 64 to there is 64 run function units 1112 in the embodiment in figure 11 126 group of a nerve processing unit.The shared starting letter corresponding to this group of the neural processing unit 126 of eight of the same group Counting unit 1112.It is can also be applied to having difference in run function unit and each group with different number The embodiment of the neural processing unit of quantity.For example, it is can also be applied in each group tool there are two, four or 16 neural processing units 126 share the embodiment of the same run function unit 1112.

Shared run function unit 1112 helps to reduce the size of neural network unit 121.Size reduction can sacrifice effect Energy.That is, the difference according to shared rate, it may be desirable to entire neural processing unit could be generated using the additional time-frequency period The result 133 of 126 arrays, for example, additional with regard to needs seven in the case where 8: 1 shared rate as shown in following figure 12 The time-frequency period.But, it is however generally that, compared to time-frequency periodicity needed for generating progressive total (for example, for each Neuron has one layers of 512 connections, it is necessary to 512 time-frequency periods), aforementioned additional increased time-frequency periodicity (such as 7) quite few.Therefore, it is very small (for example, increasing about centesimal calculating to share influence of the run function unit to efficiency Time), it can be a worthwhile cost for it can reduce the size of neural network unit 121.

In one embodiment, each neural processing unit 126 includes that run function unit 212 is relatively easy to execute Run function, these simple run function units 212 have a lesser size and to be comprised in each nerve processing single In member 126；Conversely, shared complicated run function unit 1112 is then to execute relative complex run function, size can be bright It is aobvious to be greater than simple run function unit 212.In this embodiment, it only needs in specified complicated run function by shared multiple In the case that miscellaneous run function unit 1112 executes, need the additional time-frequency period, specified run function can be by simple In the case that run function unit 212 executes, there is no need to this additional time-frequency periods.

Figure 12 and Figure 13 is to show that the neural network unit 121 of Figure 11 executes the timing diagram of the program of Fig. 4.The timing of Figure 12 Figure is similar to the timing diagram of Fig. 5, and the time-frequency period 0 to 512 of the two is all the same.But, in the operation in time-frequency period 513 not phase Together, because the neural processing unit 126 of Figure 11 can share run function unit 1112；That is, the nerve processing of the same group Unit 126 can share the run function unit 1112 for being associated with this group, and Figure 11 shows this share framework.

Each column of the timing diagram of Figure 13 are corresponding to the continuous time-frequency period for being shown in the first row.Other rows are then right respectively Different run function units 1112 and its operation should be pointed out into this 64 run function units 1112.Nerve is only shown in figure The operation of processing unit 0,1,63 is to simplify explanation.The corresponding time-frequency period to Figure 12 in the time-frequency period of Figure 13, but with not Tongfang Formula shows that neural processing unit 126 shares the operation of run function unit 1112.As shown in figure 13, in the time-frequency period 0 to 512, This 64 run function units 1112 are at not starting state, and neural processing unit 126 executes initialization nerve processing Unit instruction, multiply-accumulate instruction are instructed with multiply-accumulate rotation.

As shown in Figure 12 and Figure 13, in the time-frequency period 513, run function unit 0 (is associated with the run function list of group 0 1112) member starts to execute 202 numerical value 217 of accumulator of neural processing unit 0 specified run function, neural processing unit First neural processing unit 216 in 0 i.e. group 0, and the output of run function unit 1112 will be stored in column buffer 1104 text 0.Equally in the time-frequency period 513, each run function unit 1112 can start single to corresponding nerve processing 202 numerical value 217 of accumulator of first neural processing unit 126 executes specified run function in first 216 groups.Therefore, As shown in figure 13, in the time-frequency period 513, run function unit 0 starts to execute meaning to the accumulator 202 of neural processing unit 0 Fixed run function is to generate the result of the text 0 that will be stored in column buffer 1104；Run function unit 1 starts to nerve The accumulator 202 of processing unit 8 executes specified run function to generate the text 8 that will be stored in column buffer 1104 As a result；The rest may be inferred, and run function unit 63 starts to execute the accumulator 202 of neural processing unit 504 specified starting Function is to generate the result of the text 504 that will be stored in column buffer 1104.

In the time-frequency period 514, run function unit 0 (the run function unit 1112 for being associated with group 0) starts to nerve 202 numerical value 217 of accumulator of processing unit 1 executes specified run function, and neural processing unit 1 is second in group 0 Neural processing unit 216, and the output of run function unit 1112 will be stored in the text 1 of column buffer 1104.Equally In the time-frequency period 514, each run function unit 1112 can start to second in 216 group of corresponding neural processing unit 202 numerical value 217 of accumulator of a nerve processing unit 126 executes specified run function.Therefore, as shown in figure 13, when Frequency period 514, run function unit 0 start to execute the accumulator 202 of neural processing unit 1 specified run function to produce Life will be stored in the result of the text 1 of column buffer 1104；Run function unit 1 starts to the cumulative of neural processing unit 9 Device 202 executes specified run function to generate the result for the text 9 that will be stored in column buffer 1104；The rest may be inferred, Run function unit 63 starts to execute the accumulator 202 of neural processing unit 505 specified run function will to generate It is stored in the result of the text 505 of column buffer 1104.Such processing can continue to the time-frequency period 520, run function unit 0 (the run function unit 1112 for being associated with group 0) starts to execute meaning to 202 numerical value 217 of accumulator of neural processing unit 7 Fixed run function, neural processing unit 7 is (the last one) neural processing unit 216 the 8th in group 0, and run function The output of unit 1112 will be stored in the text 7 of column buffer 1104.Equally in time-frequency period 520, each run function Unit 1112 can all start the accumulator 202 to the 8th neural processing unit 126 in 216 group of corresponding neural processing unit Numerical value 217 executes specified run function.Therefore, as shown in figure 13, in the time-frequency period 520, run function unit 0 starts pair The accumulator 202 of neural processing unit 7 executes specified run function to generate the text that will be stored in column buffer 1104 The result of word 7；Run function unit 1 start to execute the accumulator 202 of neural processing unit 15 specified run function with Generate the result that will be stored in the text 15 of column buffer 1104；The rest may be inferred, and run function unit 63 starts at nerve The accumulator 202 of reason unit 511 executes specified run function to generate the text 511 that will be stored in column buffer 1104 Result.

In the time-frequency period 521, once all 512 results of this 512 neural processing units 126 have all been generated and have been write Fall in lines buffer 1104, column buffer 1104 will start its content data random access memory 122 or weight is written Random access memory 124.In this way, the run function unit 1112 of each 126 group of neural processing unit is carried out in Fig. 4 A part of the run function instruction of address 3.

The embodiment for sharing run function unit 1112 in 204 group of arithmetic logic unit as shown in figure 11, especially has Help the use of collocation integer arithmetic logic unit 204.This part such as corresponds at Figure 29 A to Figure 33 in following sections to be had Related description.

MTNN and MFNN framework instruct

Figure 14 is block schematic diagram, and display is moved to neural network (MTNN) framework instruction 1400 and it corresponds to Fig. 1 Neural network unit 121 part running.This MTNN instruction 1400 include execute code field 1402, src1 field 1404, Src2 field, gpr field 1408 and immediate field 1412.This MTNN instruction is that framework instruction namely this instruction are included in processing In the instruction set architecture of device 100.For a preferred embodiment, this instruction set architecture can utilize the default for executing code field 1402 Value, to distinguish MTNN instruction 1400 and other instructions in instruction set architecture.The actuating code 1402 of this MTNN instruction 1400 can wrap The preamble (prefix) for being common in x86 framework etc. is included, can not also include.

Immediate field 1412 provides a numerical value with the control logic 1434 of specified function 1432 to neural network unit 121. For a preferred embodiment, immediate operand of this function 1432 as the microcommand 105 of Fig. 1.These can be by nerve net The function 1432 that network unit 121 executes includes write-in data random access memory 122, write-in weight random access memory 124, write-in program memory 129, write-in control buffer 127, the program in beginning executive memory 129, pause are held Program in line program memory 129, complete the notice request (such as interruption) after program in executive memory 129, And neural network unit 121 is reseted, but not limited to this.For a preferred embodiment, this neural network unit instruction group meeting It is instructed including one, the result of this instruction points out that neural network unit program is completed.In addition, this neural network unit instruction set Interrupt instruction is clearly generated including one.For a preferred embodiment, running that neural network unit 121 is reseted Including by neural network unit 121, in addition to data random access memory 122, weight random access memory 124, program The data of memory 129 can maintain complete motionless outer other parts, effectively force to return back to the state of reseting (for example, emptying Internal state machine simultaneously sets it to idle state).In addition, internal buffer can't be reseted such as accumulator 202 The influence of function, and emptying must be expressed, such as instruct using the initialization nerve processing unit of address 0 in Fig. 4.One In embodiment, function 1432 may include direct execution function, and first carrys out Source buffer (for example, can join comprising micro- operation According to micro- operation 3418 of Figure 34).This directly executes function instruction neural network unit 121 and directly executes specified micro- operation. In this way, framework program, which can directly control neural network unit 121, executes operation, rather than write the instruction into program storage 129 and this is executed in subsequent instruction neural network unit 121 be located at instruction in program storage 129 or through MTNN instruction The execution of 1400 (or MFNN instructions 1500 of Figure 15).Figure 14 shows the function of this write-in data random access memory 122 One example.

This gpr field specifies the general caching device in general caching device archives 116.In one embodiment, each general slow Storage is 64.This general caching device archives 116 provides the numerical value of selected general caching device to neural network unit 121, as shown in the figure, and neural network unit 121 is used this numerical value as address 1422.This address 1422 can select letter One column of the memory specified in number 1432.With regard to data random access memory 122 or weight random access memory 124 Speech, this address 1422 can additionally select a data block, size be twice of the position of media cache in this select column (such as 512 positions).For a preferred embodiment, this position is located at 512 bit boundaries.In one embodiment, multiplexer can select Address 1422 (address 1422 in the case where MFNN described below instruction 1400) or from sequencer 128 Address 123/125/131 is provided to 124/ weight random access memory of data random access memory, 124/ program storage 129.In one embodiment, data random access memory 122 has dual-port, and neural processing unit 126 is enable to utilize matchmaker Body buffer 118 is to the read/write of this data random access memory 122, while read/write this data random access is deposited Reservoir 122.In one embodiment, for similar purpose, weight random access memory 124 also has dual-port.

One media buffer of the specified media cache archives 118 of src1 field 1404 and src2 field 1406 in figure Device.In one embodiment, each media cache 118 is 256.Media cache archives 118 can will come from selected The conjoint data (such as 512 positions) of media cache is provided to (or the weight arbitrary access of data random access memory 122 Memory 124 or program storage 129) with writing address 1422 specify select column 1428 and in select column 1428 by The specified position in address 1422, as shown in the figure.Through a series of (and the MFNN as described below instruction of MTNN instruction 1400 1500) execution, be implemented in processor 100 framework program can fill up data random access memory 122 column and weight with Machine accesses the column of memory 124 and by program write-in program memory 129, such as program as described herein is (as shown in Fig. 4 and Fig. 9 Program) neural network unit 121 can be made to carry out operation at a very rapid rate to data and weight, to complete this artificial neuron Network.In one embodiment, this framework program directly controls neural network unit 121 rather than by program write-in program memory 129。

In one embodiment, MTNN instruction 1400 specified one originates source buffer and carrys out the quantity of Source buffer, i.e., Q, and non-designated two are come Source buffer (person as specified by field 1404 and 1406).The MTNN instruction 1400 of this form can refer to Show that processor 100 will be appointed as starting and come the media cache 118 of Source buffer and the media buffer of following Q-1 connecting Neural network unit 121 is written in device 118, that is, specified data random access memory 122 or weight is written and deposits at random Access to memory 124.For a preferred embodiment, it is Q all that MTNN instruction 1400 can be translated to write-in by instruction translator 104 The specified required amount of microcommand of media cache 118.For example, in one embodiment, when MTNN instruction 1400 will Buffer MR4 is appointed as starting to come Source buffer and Q being 8, and MTNN will be instructed 1400 translate to by instruction translator 104 Four microcommands, wherein first microcommand is written buffer MR4 and MR5, second microcommand write-in buffer MR6 with MR7, buffer MR8 and MR9 is written in third microcommand, and buffer MR10 and MR11 is written in the 4th microcommand.Another In a embodiment, the data path by media cache 118 to neural network unit 121 be 1024 rather than 512, in this feelings Under condition, MTNN instruction 1400 can be translated to two microcommands by instruction translator 104, wherein buffer is written in first microcommand MR4 to MR7, second microcommand are then write-in buffer MR8 to MR11.It is can also be applied to MFNN instruction 1500 is specified The embodiment of the quantity of one starting purpose buffer and purpose buffer, and allow each MFNN instruction 1500 from data One column of random access memory 122 or weight random access memory 124 read the data for being greater than single medium buffer 118 Block.

Figure 15 is block schematic diagram, and display is moved to neural network (MTNN) framework instruction 1500 and it corresponds to Fig. 1 Neural network unit 121 part running.This MFNN instruction 1500 include execute code field 1502, dst field 1504, Gpr field 1508 and immediate field 1512.MFNN instruction is the finger that framework instruction namely this instruction are contained in processor 100 It enables in collection framework.For a preferred embodiment, this instruction set architecture can be using the default value for executing code field 1502, to distinguish MFNN instruction 1500 and other instructions in instruction set architecture.The actuating code 1502 of this MFNN instruction 1500 may include being common in The preamble (prefix) of x86 framework etc. can not also include.

Immediate field 1512 provides a numerical value with the control logic 1434 of specified function 1532 to neural network unit 121. For a preferred embodiment, immediate operand of this function 1532 as the microcommand 105 of Fig. 1.These neural network units 121 functions 1532 that can be executed include reading data random access memory 122, reading weight random access memory 124, reading program memory 129 and reading state buffer 127, but not limited to this.Data are read in the example display of Figure 15 The function 1532 of random access memory 122.

This gpr field 1508 specifies the general caching device in general caching device archives 116.This general caching device archives 116 The numerical value of selected general caching device is provided to neural network unit 121, as shown in the figure, and neural network unit 121 will This numerical value carries out operation as address 1522 and in a manner of similar to the address 1422 of Figure 14, uses selection 1532 middle finger of function One column of fixed memory.For data random access memory 122 or weight random access memory 124, this address 1522 can additionally select a data block, and size is the position of media cache (such as 256 positions) in select column thus.With regard to one compared with For good embodiment, this position is located at 256 bit boundaries.

This dst field 1504 is in a media cache specified in a media cache archives 118.As shown in the figure, media Register file 118 will be from data random access memory 122 (or weight random access memory 124 or program storage 129) data (such as 256) are received to selected media cache, this reading data address 1522 from data receiver is specified Select column 1528 and select column 1528 in the specified position in address 1522.

The port of neural network unit internal random access memory configures

Figure 16 is the block schematic diagram for showing an embodiment of data random access memory 122 of Fig. 1.This data is random Accessing memory 122 includes memory array 1606, read port 1602 and write-in port 1604.Memory array 1606 loads Data literal, for a preferred embodiment, these data arrangements at the D as previously described N number of text arranged array.Implement one In example, this memory array 1606 includes an array being made of 64 horizontally arranged static random-access memory cells, In each memory cell there is 128 width and 64 height, so can provide the data random access of a 64KB Memory 122, width is 8192 and has 64 column, and crystal grain face used in this data random access memory 122 Substantially 0.2 square millimeter of product.But, the present invention is not limited thereto.

For a preferred embodiment, be written port 1602 with multitask mode be coupled to neural processing unit 126 and Media cache 118.Furthermore, it is understood that these media caches 118 can be coupled to read port through result bus, and tie Fruit bus is also used for providing data to reorder buffer and/or result transmission bus to be provided to other execution units 112.These Neural processing unit 126 shares this read port 1602 with media cache 118, with to data random access memory 122 into Row is read.Also, write-in port 1604 is also to be coupled to neural processing unit with multitask mode for a preferred embodiment 126 and media cache 118.These neural processing units 126 share this write-in port 1604 with media cache 118, with This data random access memory 122 is written.In this way, media cache 118 can neural processing unit 126 to data with Machine access memory 122 is while be read out, and is written data random access memory 122, and neural processing unit 126 is also Data random access can be written while media cache 118 is read out data random access memory 122 Memory 122.Such ways of carrying out can promote efficiency.For example, these neural processing units 126 can read data Random access memory 122 (such as continuously carrying out calculating), and this is simultaneously, media cache 118 can be by more data literals Data random access memory 122 is written.In another example, these neural processing units 126 calculated result can be written Data random access memory 122, and this is simultaneously, media cache 118 can then be read from data random access memory 122 Calculated result.In one embodiment, data random access memory can be written in a column count result by neural processing unit 126 122, while also a column data text is read from data random access memory 122.In one embodiment, memory array 1606 It is configured to memory block (bank).When neural processing unit 126 accesses data random access memory 122, own Memory block can all be initiated to access memory array 1606 a complete column；But, it is accessed in media cache 118 When data random access memory 122, only specified memory block can be activated.In one embodiment, each The width of memory block is 128, and the width of media cache 118 is then 256, so, for example, deposit every time Media cache 118 is taken just to need to start two memory blocks.In one embodiment, these ports 1602/1604 are wherein One of be read/write port.In one embodiment, these ports 1602/1604 are all read/write ports.

The advantages of allowing these neural processing units 126 to have the ability of rotator as described herein be, compared to for Ensure that neural processing unit 126 can be fully utilized and framework program (by media cache 118) is made to be able to continue offer Data are to data random access memory 122 and while neural processing unit 126 is executed and calculated, from data random access Memory 122 fetches memory array required for result, this ability helps to reduce depositing for data random access memory 122 The columns of memory array 1606, thus can reduce the size.

Internal random access memory buffer

Figure 17 is to show that the square of an embodiment of weight random access memory 124 and buffer 1704 of Fig. 1 is illustrated Figure.This weight random access memory 124 includes memory array 1706 and port 1702.This memory array 1706 loads power Weigh text, for a preferred embodiment, these weight character arrangings at the W as previously described N number of text arranged array.It is real one It applies in example, this memory array 1706 includes an array being made of 128 horizontally arranged static random-access memory cells, Wherein each memory cell has 64 width and 2048 height, and the weight that so can provide a 2MB is deposited at random Access to memory 124, width is 8192 and has 2048 column, and crystalline substance used in this weight random access memory 124 Substantially 2.4 square millimeters of grain product.But, the present invention is not limited thereto.

For a preferred embodiment, this port 1702 is coupled to neural processing unit 126 and buffering with multitask mode Device 1704.These neural processing units 126, which read through this port 1702 with buffer 1704 and weight arbitrary access is written, to be deposited Reservoir 124.Buffer 1704 is further coupled to the media cache 118 of Fig. 1, in this way, media cache 118 can pass through buffer 1704 read and weight random access memory 124 are written.The advantages of this mode, is, when neural processing unit 126 is being read When taking or be written weight random access memory 124, media cache 118 with write buffer 118 or can postpone Rush device 118 read (if but neural processing unit 126 be carrying out, it is single to shelve these nerve processing in the preferred case Member 126 accesses weight random access memory to avoid when buffer 1704 accesses weight random access memory 124 124).This mode can promote efficiency, especially because reading of the media cache 118 for weight random access memory 124 It takes and is significantly less than reading and write-in of the neural processing unit 126 for weight random access memory 124 in write-in relatively.It lifts For example, in one embodiment, the read/write 8192 of neural processing unit 126 1 times position (column), but, media buffer The width of device 118 is only 256, and each MTNN instruction 1400 is only written two media caches 118, i.e., 512.Therefore, In the case where framework program executes 16 MTNN instruction 1400 to fill up buffer 1704, neural processing unit 126 with deposit The time clashed between the framework program of weighting weight random access memory 124 can be less than the percent of the substantially the entirety of time Six.In another embodiment, a MTNN instruction 1400 is translated to two microcommands 105 by instruction translator 104, and each micro- Instruction can be by single a 118 write buffer 1704 of data buffer, in this way, neural processing unit 126 is being deposited with framework program The frequency that conflict is generated when weighting weight random access memory 124 can be also further reduced.

In the embodiment comprising buffer 1704, needed using framework program write-in weight random access memory 124 Multiple MTNN instructions 1400.One or more MTNN instructions 1400 specify a function 1432 to specify in write buffer 1704 Data block indicates neural network unit 121 by the content of buffer 1704 with the specified function 1432 of latter MTNN instruction 1400 A select column of weight random access memory 124 is written.The size of single a data block is the digit of media cache 118 Twice, and these data blocks can come into line naturally in buffer 1704.In one embodiment, each specified function 1432 is to write The MTNN instruction 1400 for entering 1704 specified data block of buffer includes a bit mask (birmask), corresponding to buffering with position Each data block of device 1704.The data of carrying out Source buffer 118 specified from two are written into the data block of buffer 1704 In, the correspondence position in bit mask is each data block being set.This embodiment facilitates weight random access memory 124 A column memory repeated data value situation.For example, in order to which by buffer 1704, (and subsequent weight is deposited at random One column of access to memory 124) it is zeroed, zero load can be carried out Source buffer and set all of bit mask by program designer Position.In addition, the selected data block that bit mask can also allow program designer to be only written in buffer 1704, and make other data blocks Maintain its previous data mode.

In the embodiment comprising buffer 1704, weight random access memory 124 is read using framework program and is needed Multiple MFNN instructions 1500.Initial MFNN instruction 1500 specifies a function 1532 by a finger of weight random access units 124 Determine column load buffer 1704, subsequent one or more MFNN instruction 1500 specifies a function 1532 by the one of buffer 1704 Specified data block is read to purpose buffer.The size of single a data block is the digit of media cache 118, and these are counted It can be come into line naturally in buffer 1704 according to block.Technical characteristic of the invention is equally applicable to other embodiments, as weight with Machine, which accesses memory 124, has multiple buffers 1704, and framework program deposits when executing through the neural processing unit 126 of increase Access amount, to be further reduced between neural processing unit 126 and framework program because access weight random access memory 124 is produced Raw conflict, and increase within the time-frequency period that neural processing unit 126 is not necessary to access weight random access memory 124, change A possibility that being accessed by buffer 1704.

Figure 16 describes dual port data random access memory 122, and but, the present invention is not limited thereto.Skill of the invention It is the other embodiments of dual-port design that art feature, which is equally applicable to weight random access memory 124 also,.In addition, being retouched in Figure 17 It states buffer collocation weight random access memory 124 to use, but, the present invention is not limited thereto.Technical characteristic of the invention It is equally applicable to the implementation that data random access memory 122 is similar to the corresponding buffer of buffer 1704 with one Example.

Dynamically configurable neural processing unit

Figure 18 is the block schematic diagram for showing the dynamically configurable neural processing unit 126 of Fig. 1.At the nerve of Figure 18 Manage the neural processing unit 126 that unit 126 is similar to Fig. 2.But, the neural processing unit 126 of Figure 18 is dynamically configurable with fortune Make in two it is different configuration of one of them.In first configuration, the running of the neural processing unit 126 of Figure 18 is similar to The neural processing unit 126 of Fig. 2.That is, being denoted as " wide " configuration herein in first configuration or " single " matching It sets, the arithmetic logic unit 204 of neural processing unit 126 is to single wide data literal and single wide weight text (such as 16 positions) execute operation to generate single wide result.In comparison, it in second configuration, i.e., indicates herein It can be to two narrow data literals and two narrow weights for " narrow " configuration or " even numbers " configuration, neural processing unit 126 Text (such as 8 positions) executes operation and generates two narrow results respectively.In one embodiment, neural processing unit 126 is matched It sets (wide or narrow) and is reached by initializing the neural processing unit instruction instruction of address 0 (such as in earlier figures 20).In addition, this Configuration can also have function 1432 to specify to set the MTNN of the configuration (wide or narrow) of neural processing unit setting and refer to by one It enables to reach.For a preferred embodiment, the MTNN instruction of the instruction of program storage 129 or determining configuration (wide or narrow) can be filled out Full configuration sets buffer.For example, the output of allocating cache device be supplied to arithmetic logic unit 204, run function unit 212 with And generate the logic of multitask cache control signal 213.Substantially, in the component and Fig. 2 of the neural processing unit 126 of Figure 18 The component of identical number can execute similar function, can therefrom obtain referring to the embodiment to understand Figure 18.Below for Figure 18 Embodiment include its be illustrated with not existing together for Fig. 2.

The neural processing unit 126 of Figure 18 includes two buffer 205A and 205B, two three input multitask buffers 208A and 208B, arithmetic logic unit 204, two accumulator 202A and 202B and two run function unit 212A With 212B.Buffer 205A/205B is respectively provided with the half (such as 8 positions) of the width of the buffer 205 of Fig. 2.Buffer 205A/ 205B receives a corresponding narrow weight text 206A/B206 (such as 8 positions) simultaneously from weight random access memory 124 respectively It outputs it 203A/203B and selects logic 1898 in the operand that a subsequent time-frequency period is provided to arithmetic logic unit 204. When neural processing unit 126 is in wide configuration, buffer 205A/205B will operate random from weight to receive together The width weight text 206A/206B (such as 16 positions) for accessing memory 124, similar to the buffer in the embodiment of Fig. 2 205；When neural processing unit 126 is in narrow configuration, buffer 205A/205B actually will be independent work, respectively The narrow weight text 206A/206B (such as 8 positions) from weight random access memory 124 is received, in this way, nerve processing Unit 126 is actually equivalent to two narrow neural processing units respectively independent work.But, no matter neural processing unit Why is 126 configuration aspect, and the identical output bit of weight random access memory 124 can all couple and be provided to buffer 205A/205B.For example, the buffer 205A of neural processing unit 0 receives the buffer of byte 0, neural processing unit 0 205B receives byte 1, the buffer 205A of neural processing unit 1 receives byte 2, the buffer of neural processing unit 1 205B receives that byte 3, the rest may be inferred, and the buffer 205B of neural processing unit 511 will receive byte 1023.

Multitask buffer 208A/208B is respectively provided with the half (such as 8 positions) of the width of the buffer 208 of Fig. 2.It is more Task buffer device 208A can select a storage to its buffer and in subsequent time-frequency week in input 207A, 211A and 1811A Phase is provided by output 209A, and multitask buffer 208B can select a storage slow to it in input 207B, 211B and 1811B Storage is simultaneously provided to operand selection logic 1898 by exporting 209B in the subsequent time-frequency period.Input 207A is deposited at random from data Access to memory 122 receives a narrow data literal (such as 8 positions), and input 207B receives one from data random access memory 122 Narrow data literal.When neural processing unit 126 is in wide configuration, multitask buffer 208A/208B actually will It is to be operated together to receive a wide data literal 207A/207B (such as 16 from data random access memory 122 Position), similar to the multitask buffer 208 in the embodiment of Fig. 2；When neural processing unit 126 is in narrow configuration, more Business buffer 208A/208B actually will be independent work, and it is narrow respectively to receive one from data random access memory 122 Data literal 207A/207B (such as 8 positions), in this way, neural processing unit 126 is actually equivalent to two narrow nerves The respective independent work of processing unit.But, though the configuration aspect of neural processing unit 126 why, data random access storage The identical output bit of device 122 can all couple and be provided to multitask buffer 208A/208B.For example, neural processing unit 0 Multitask buffer 208A receive byte 0, the multitask buffer 208B of neural processing unit 0 receives byte 1, nerve The multitask buffer 208A of processing unit 1 receives byte 2, the multitask buffer 208B of neural processing unit 1 is received Byte 3, the rest may be inferred, and the multitask buffer 208B of neural processing unit 511 will receive byte 1023.

Input 211A receives the output 209A of the multitask buffer 208A of neighbouring neural processing unit 126, input 211B receives the output 209B of the multitask buffer 208B of neighbouring neural processing unit 126.It is neighbouring to input 1811A reception The output 209B of the multitask buffer 208B of neural processing unit 126, and input 1811B and receive neighbouring neural processing unit The output 209A of 126 multitask buffer 208A.Nerve processing unit 126 shown in Figure 18 belongs to N number of mind shown in FIG. 1 Through one of processing unit 126 and it is denoted as neural processing unit J.That is, nerve processing unit J is this N number of mind One through processing unit represents example.For a preferred embodiment, the multitask buffer 208A of neural processing unit J is defeated The multitask buffer 208A output 209A of neural processing unit 126 of example J-1 can be received by entering 211A, and nerve processing is single The multitask buffer 208A input 1811A of first J can receive the multitask buffer of the neural processing unit 126 of example J-1 208B exports 209B, and the multitask buffer 208A output 209A of neural processing unit J can be provided to example J+1 simultaneously The multitask of the neural processing unit 126 of the multitask buffer 208A input 211A and example J of neural processing unit 126 is slow Storage 208B inputs 211B；The input 211B of the multitask buffer 208B of neural processing unit J can receive the mind of example J-1 Multitask buffer 208B through processing unit 126 exports 209B, and the multitask buffer 208B's of neural processing unit J is defeated The multitask buffer 208A output 209A of neural processing unit 126 of example J can be received by entering 1811B, also, nerve processing is single The output 209B of the multitask buffer 208B of first J can be provided to the multitask of the neural processing unit 126 of example J+1 simultaneously The multitask buffer 208B that buffer 208A inputs the neural processing unit 126 of 1811A and example J+1 inputs 211B.

Each of 213 control multitask buffer 208A/208B of control input, selects one from these three inputs It stores to its corresponding buffer, and is provided to corresponding output 209A/209B in subsequent step.When nerve processing is single Member 126 be instructed to from data random access memory 122 load one column when (such as in Figure 20 address 1 multiply-accumulate instruction, The details will be described later), no matter this neural processing unit 126 is in wide configuration or narrow configuration, and control input 213 can control more It is engaged in each of buffer 208A/208B multitask buffer, from the opposite of the select column of data random access memory 122 Answer one corresponding narrow data literal 207A/207B (such as 8) of selection in narrow text.

When the reception instruction of neural processing unit 126 needs to rotate the data columns value of previous receipt (such as scheme The multiply-accumulate rotation instruction of address 2 in 20, the details will be described later), if neural processing unit 126 is controlled defeated in narrow configuration The corresponding input 1811A/ of each multitask buffer selection in multitask buffer 208A/208B will be controlled by entering 213 1811B.In the case, multitask buffer 208A/208B actually can be independent work and make neural processing unit 126 Actually just as two independent narrow neural processing units.In this way, the multitask buffer of N number of nerve processing unit 126 208A and 208B cooperating syringe will be such as the rotators of same 2N narrow texts, this part is subsequent more detailed corresponding to having at Figure 19 Thin explanation.

When neural processing unit 126, which receives instruction, to be needed to rotate the data columns value of previous receipt, if refreshing It is in wide configuration through processing unit 126, it is more that control input 213 will control each in multitask buffer 208A/208B Task buffer device selects corresponding input 211A/211B.In the case, multitask buffer 208A/208B can cooperating syringe It and actually just look like this neural processing unit 126 is single wide neural processing unit 126.In this way, N number of nerve processing is single The multitask buffer 208A and 208B cooperating syringe of member 126 will be as similar to correspond to such as the rotator of same N number of wide text Mode described in Fig. 3.

Arithmetic logic unit 204 includes that operand selects 1898, wide multiplier 242A of logic, a narrow multiplier 242B, dual input multiplexer 1896A one wide, dual input multiplexer 1896B one narrow, adder 244A one wide and one narrow Adder 244B.In fact, this arithmetic logic unit 204 can be regarded as including operand selection logic, a wide arithmetical logic Unit 204A (including aforementioned wide multiplier 242A, aforementioned width multiplexer 1896A and aforementioned width adder 244A) and a narrow calculation Art logic unit 204B (including aforementioned narrow multiplier 242B, aforementioned narrow multiplexer 1896B and aforementioned narrow adder 244B).With regard to one For preferred embodiment, two wide text can be multiplied by wide multiplier 242A, similar to the multiplier 242 of Fig. 2, such as one 16 Position multiplies 16 multipliers.Two narrow text can be multiplied by narrow multiplier 242B, such as one 8 multiply 8 multipliers to produce Raw one 16 results.When neural processing unit 126 is in narrow configuration, through the assistance of operand selection logic 1898, i.e., Wide multiplier 242A can be made full use of, so that two narrow text is multiplied as a narrow multiplier, so nerve processing unit 126 will be such as the narrow neural processing unit of two effective operations.For a preferred embodiment, wide adder 244A can will be wide The output of multiplexer 1896A is added with the output 217A of wide accumulator 202A has generated a sum 215A for wide accumulator 202A It uses, running is similar to the adder 244 of Fig. 2.Narrow adder 244B can add up the output of narrow multiplexer 1896B with narrow Device 202B output 217B addition is used with generating a sum 215B for narrow accumulator 202B.In one embodiment, narrow accumulator 202B has 28 width, can lose accuracy to avoid when carrying out the accumulating operation of up to 1024 16 products.Mind When being in wide configuration through processing unit 126, narrow multiplier 244B, narrow accumulator 202B are preferably with narrow run function unit 212B In not starting state to reduce energy dissipation.

Operand selection logic 1898 can be provided to arithmetical logic by selection operation number from 209A, 209B, 203A and 203B Other components of unit 204, the details will be described later.For a preferred embodiment, operand selects logic 1898 also to have other function Can, such as execute the symbol extension of signed magnitude data literal and weight text.For example, if neural processing unit 126 be in narrow configuration, and the symbol of narrow data literal and weight text can be extended into wide text by operand selection logic 1898 Width, be then just supplied to wide multiplier 242A.Similarly, if arithmetic logic unit 204 receives instruction and to transmit one Narrow data/weight text (skip wide multiplier 242A using wide multiplexer 1896A), and operand selects the meeting of logic 1898 will be narrow The symbol of data literal and weight text extends into the width of wide text, is then just supplied to wide adder 244A.Preferably with regard to one For embodiment, this logic for executing symbol extension function exists in the arithmetical logic fortune of the neural processing unit 126 of Fig. 2 Calculate 204 inside.

The output and the operation from operand selection logic 1898 that wide multiplexer 1896A receives wide multiplier 242A Number, and select one to be supplied to wide adder 244A from these inputs, narrow multiplexer 1896B receives the defeated of narrow multiplier 242B Out with from operand selection logic 1898 an operand, and from these inputs select one be supplied to narrow adder 244B。

Operand selects configuration and arithmetic logic unit 204 of the meeting of logic 1898 according to neural processing unit 126 will The arithmetic of execution and/or logical operation provide operand, the finger that this arithmetic/logic is executed according to neural processing unit 126 Specified function is enabled to determine.For example, if instructing the one multiply-accumulate operation of execution of instruction arithmetic logic unit 204 Neural processing unit 126 is in wide configuration, and operand selection logic 1898 will just export 209A and concatenate the one wide of composition with 209B Text is provided to an input of wide multiplier 242A, and the width text for exporting 203A with 203B and concatenate composition is provided to another Input, and narrow multiplier 242B is then not start, in this way, the running of neural processing unit 126 will be similar to such as single The wide neural processing unit 126 of the neural processing unit 126 of Fig. 2.But, if instruction instruction arithmetic logic unit executes one Multiply-accumulate operation and neural processing unit 126 be in narrow configuration, operand select logic 1898 just by after an extension or The narrow data literal 209A of version is provided to an input of wide multiplier 242A after expansion, and the narrow weight of version after extension is literary Word 203A is provided to another input；In addition, narrow data literal 209B can be provided to narrow multiplier by operand selection logic 1898 An input of 242B, and narrow weight text 203B is provided to another input.Extend as previously described to narrow text to reach Or the operation of expansion, if narrow text has symbol, operand selects logic 1898 that will carry out symbol extension to narrow text；If Narrow text without symbol, operand select logic 1898 will above the addition of narrow text off bit.

In another example, if neural processing unit 126 is in wide configuration and instructs instruction arithmetic logic unit 204 The accumulating operation of a weight text is executed, wide multiplier 242A will be skipped, and operand selection logic 1898 will will be defeated 203A is concatenated with 203B out is provided to wide multiplexer 1896A to be supplied to wide adder 244A.But, if neural processing unit 126 in the narrow accumulating operation configured and instruction arithmetic logic unit 204 is instructed to execute a weight text, wide multiplier 242A It will be skipped, and the output 203A of version after an extension will be provided to wide multiplexer by operand selection logic 1898 1896A is to be supplied to wide adder 244A；In addition, narrow multiplier 242B can be skipped, operand selection logic 1898 can will prolong The output 203B of version is provided to narrow multiplexer 1896B to be supplied to narrow adder 244B after exhibition.

In another example, if neural processing unit 126 is in wide configuration and instructs instruction arithmetic logic unit 204 The accumulating operation of a data literal is executed, wide multiplier 242A will be skipped, and operand selection logic 1898 will will be defeated 209A is concatenated with 209B out is provided to wide multiplexer 1896A to be supplied to wide adder 244A.But, if neural processing unit 126 in the narrow accumulating operation configured and instruction arithmetic logic unit 204 is instructed to execute a data literal, wide multiplier 242A It will be skipped, and the output 209A of version after an extension will be provided to wide multiplexer by operand selection logic 1898 1896A is to be supplied to wide adder 244A；In addition, narrow multiplier 242B can be skipped, operand selection logic 1898 can will prolong The output 209B of version is provided to narrow multiplexer 1896B to be supplied to narrow adder 244B after exhibition.Weight/data literal is cumulative Calculating facilitates average calculating operation, the common source that the available certain artificial neural networks as including image processing of average calculating operation are applied (pooling) layer.

For a preferred embodiment, neural processing unit 126 further includes the second wide multiplexer (not shown), to skip Wide adder 244A, in favor of by wide data/weight text under wide configuration or narrow data/power after the extension under narrow configuration The text load width narrow multiplexer of accumulator 202A and second (not shown) is weighed, to skip narrow adder 244B, in favor of inciting somebody to action Narrow data/weight text under narrow configuration loads narrow accumulator 202B.For a preferred embodiment, this arithmetic logic unit 204 further include that width with narrow comparator/multiplexer combine (not shown), this comparator/multiplexer combination reception is corresponding tires out Add device numerical value 217A/217B to export with corresponding multiplexer 1896A/1896B, use accumulator value 217A/217B with Maximum value, the common source of certain artificial neural network applications are selected between one data/weight text 209A/209B/203A/203B (pooling) layer uses this operation, this part has in more detail in following sections, such as corresponding at Figure 27 and Figure 28 It is bright.In addition, operand selects logic 1898 to provide the operand of value of zero (for adding zero add operation or to clear Except accumulator), and the operand (for multiplying one multiplying) of numerical value one is provided.

Narrow run function unit 212B receives the output 217B of narrow accumulator 202B and executes run function to it to generate Narrow result 133B, wide run function unit 212A receive the output 217A of wide accumulator 202A and it is executed run function with Generate wide result 133A.When neural processing unit 126 is in narrow configuration, it is tired that wide run function unit 212A can configure understanding according to this Add the output 217A of device 202A and run function is executed to it to generate narrow as a result, this part is such as corresponding in following sections such as 8 It is described in more detail at Figure 29 A to Figure 30.

As previously mentioned, single neural processing unit 126 effectively can function as two narrow nerves when being in narrow configuration Processing unit operates, and therefore, for lesser text, when compared to width configuration, can generally provide up to twice Processing capacity.For example, it is assumed that neural net layer has 1024 neurons, and each neuron is received from preceding layer 1024 narrow input (and having narrow weight text), will so generate 1,000,000 connections.For having 512 nerve processing For the neural network unit 121 of unit 126, (1024 narrow neural processing unit is equivalent to) under narrow configuration, although processing Be narrow text rather than wide text, but its connective number that can be handled can achieve four times of wide configuration that (1,000,000 link Upper 256K is linked), and substantially half of required time (about 1026 time-frequency periods are to upper 514 time-frequency periods).

In one embodiment, the dynamic configuration nerve processing unit 126 of Figure 18 includes being similar to multitask buffer 208A With the three of 208B input multitask buffers to replace buffer 205A and 205B, to constitute a rotator, processing by weight with Machine accesses the received weight character string of memory 124, mode but application described in embodiment of this operational part similar to Fig. 7 In dynamic configuration described in Figure 18.

Figure 19 be a block schematic diagram, display according to Figure 18 embodiment, using Fig. 1 neural network unit 121 it is N number of 2N multitask buffer 208A/208B of neural processing unit 126, for the data random access memory 122 by Fig. 1 The column data text 207 obtained executes the running such as same rotator.In the embodiment of figure 19, N is 512, nerve processing Unit 121 has 1024 multitask buffer 208A/208B, is denoted as 0 to 511, is respectively corresponding to 512 nerve processing Unit 126 and actually 1024 narrow neural processing unit.Two narrow neural processing unit in neural processing unit 126 It is respectively designated as A and B, in each multitask buffer 208, corresponding narrow neural processing unit is also indicated.Into For one step, the multitask buffer 208A for being denoted as 0 neural processing unit 126 is denoted as 0-A, is denoted as at 0 nerve The multitask buffer 208B of reason unit 126 is denoted as 0-B, is denoted as the multitask buffer of 1 neural processing unit 126 208A is denoted as 1-A, and the multitask buffer 208B for being denoted as 1 neural processing unit 126 is denoted as 1-B, is denoted as 511 The multitask buffer 208A of neural processing unit 126 is denoted as 511-A, and be denoted as 511 neural processing unit 126 it is more Task buffer device 208B is denoted as 511-B, and numerical value is also corresponded to narrow nerve processing unit described in subsequent figure 21.

Each multitask buffer 208A receives its phase in the wherein column that the D of data random access memory 122 is arranged Corresponding narrow data literal 207A, and each multitask buffer 208B is arranged wherein in the D of data random access memory 122 Its corresponding narrow data literal 207B is received in one column.It is deposited that is, multitask buffer 0-A receives data random access The narrow data literal 0 that reservoir 122 arranges, multitask buffer 0-B receive the narrow data literal that data random access memory 122 arranges 1, multitask buffer 1-A receive the narrow data literal 2 that data random access memory 122 arranges, and multitask buffer 1-B is received The narrow data literal 3 that data random access memory 122 arranges, and so on, multitask buffer 511-A receives data and deposits at random The narrow data literal 1022 that access to memory 122 arranges, and multitask buffer 511-B is then to receive data random access memory The narrow data literal 1023 of 122 column.In addition, multitask buffer 1-A receives the output 209A of multitask buffer 0-A as it 211A is inputted, the output 209B that multitask buffer 1-B receives multitask buffer 0-B inputs 211B as it, and so on, The output 209A that multitask buffer 511-A receives multitask buffer 510-A inputs 211A, multitask buffer as it The output 209B that 511-B receives multitask buffer 510-B inputs 211B as it, and multitask buffer 0-A reception is more The output 209A of task buffer device 511-A inputs 211A as it, and multitask buffer 0-B receives multitask buffer 511-B Output 209B as its input 211B.Each multitask buffer 208A/208B can receive control input 213 with control It inputs 211A/211B after selecting data literal 207A/207B rotation or inputs 1811A/1811B after rotating.Most Afterwards, the output 209B that multitask buffer 1-A receives multitask buffer 0-B inputs 1811A, multitask buffer 1- as it The output 209A that B receives multitask buffer 1-A inputs 1811B as it, and so on, multitask buffer 511-A is received The output 209B of multitask buffer 510-B inputs 1811A as it, and multitask buffer 511-B receives multitask buffer The output 209A of 511-A inputs 1811B as it, and multitask buffer 0-A receives the output of multitask buffer 511-B 209B inputs 1811A as it, and multitask buffer 0-B receives the output 209A of multitask buffer 0-A as its input 1811B.Each multitask buffer 208A/208B can receive control input 213 and select data literal 207A/ to control it 1811A/1811B is inputted after inputting 211A/211B or rotation after 207B rotation.In an operation mode, at first The frequency period, control input 213 can control each multitask buffer 208A/208B selection data literal 207A/207B store to Buffer is provided to arithmetic logic unit 204 for subsequent；And in subsequent time-frequency period (such as M-1 time-frequency period above-mentioned), control Input 1811A/1811B is stored to buffer after system input 213 can control each multitask buffer 208A/208B selection rotation It is provided to arithmetic logic unit 204 for subsequent, this part can be described in more detail in following sections.

Figure 20 is a table, and display one is stored in the program storage 129 of the neural network unit 121 of Fig. 1 and by this The program that neural network unit 121 executes, and this neural network unit 121 has nerve processing as shown in the embodiment of figure 18 Unit 126.The example program of Figure 20 is similar to the program of Fig. 4.It is illustrated below for its difference.Positioned at the initial of address 0 Narrow configuration will be entered by changing the specified neural processing unit 126 of neural processing unit instruction.In addition, as shown in the figure, being located at address 2 multiply-accumulate rotation instructs the count value that a specified numerical value is 1023 and needs 1023 time-frequency periods.This is because Figure 20 Example in assume to be of virtually 1024 narrow (such as 8) neuron (i.e. neural processing unit), each narrow mind in one layer There is the connection input of 1024 1024 neurons from preceding layer, therefore a total of 1024K connection through member.Each Neuron is inputted from each connection to be received 8 bit data values and this 8 bit data value is multiplied by 8 weighted value appropriate.

Figure 21 is to show that neural network unit 121 executes the timing diagram of the program of Figure 20, this neural network unit 121 has Neural processing unit 126 as shown in figure 18 is implemented in narrow configuration.The timing diagram of Figure 21 is similar to the timing diagram of Fig. 5.With knit stitch Its difference is illustrated.

In the timing diagram of Figure 21, these neural processing units 126 can be in narrow configuration, this is because being located at address 0 It initializes neural processing unit instruction and is initialized with narrow configuration.So this 512 neural processing units 126 are actually transported Make to get up just as 1024 narrow neural processing unit (or neuron), this 1024 narrow nerve processing unit in field with (the two narrow nerve processing for being denoted as 0 neural processing unit 126 is single by neural processing unit 0-A and neural processing unit 0-B Member), (the two narrow nerve for being denoted as 1 neural processing unit 126 is handled by neural processing unit 1-A and neural processing unit 1-B Unit), and so on until neural processing unit 511-A and neural processing unit 511-B (are denoted as 511 nerve processing list The two narrow neural processing unit of member 126), it is indicated.Only shown to simplify explanation, in figure narrow neural processing unit 0-A, The operation of 0-B and 511-B.Because the count value for being located at the multiply-accumulate rotation instruction of address 2 is 1023, and is needed Therefore 1023 time-frequency periods are operated, the columns of the timing diagram of Figure 21 includes up to 1026 time-frequency periods.

In the time-frequency period 0, each of this 1024 neural processing units can execute the initialization directive of Fig. 4, i.e. Fig. 5 The shown running for assigning zero to accumulator 202.

In the time-frequency period 1, each of this 1024 narrow nerve processing unit can execute multiplying positioned at address 1 in Figure 20 Method accumulated instruction.As shown in the figure, accumulator 202A numerical value (i.e. zero) is added data random access by narrow nerve processing unit 0-A The product of the column 17 narrow text 0 and the narrow text 0 of column 0 of weight random access units 124 of unit 122；Narrow nerve processing unit 0- B is by accumulator 202B numerical value (i.e. zero) plus the narrow text 1 of column 17 and weight random access units of data random access unit 122 The product of the 124 narrow text 1 of column 0；Accumulator 202B numerical value (i.e. zero) is added so on up to narrow nerve processing unit 511-B The narrow text 1023 of column 17 of upper data random access unit 122 and multiplying for the narrow text 1023 of column 0 of weight random access units 124 Product.

In the time-frequency period 2, each of this 1024 narrow nerve processing unit can execute multiplying positioned at address 2 in Figure 20 The first time iteration of the cumulative rotation instruction of method.As shown in the figure, narrow nerve processing unit 0-A adds accumulator 202A numerical value 217A On by narrow data literal after the received rotations of multitask buffer 208B output 209B institute of narrow neural processing unit 511-B 1811A (namely by the 122 received narrow data literal 1023 of institute of data random access memory) and weight random access units The product of the 124 narrow text 0 of column 1；Narrow nerve processing unit 0-B adds accumulator 202B numerical value 217B single by narrow nerve processing Narrow data literal 1811B is (namely random by data after the received rotation of multitask buffer 208A output 209A institute of first 0-A Access memory 122 institute received narrow data literal 0) and weight random access units 124 the narrow text 1 of column 1 product；According to this Analogize, until narrow nerve processing unit 511-B adds accumulator 202B numerical value 217B by narrow neural processing unit 511-A's Narrow data literal 1811B (is namely deposited by data random access after the received rotation of multitask buffer 208A output 209A institute Reservoir 122 received narrow data literal 1022) product with the narrow texts 1023 of column 1 of weight random access units 124.

In the time-frequency period 3, each of this 1024 narrow nerve processing unit can execute multiplying positioned at address 2 in Figure 20 Second of iteration of the cumulative rotation instruction of method.As shown in the figure, narrow nerve processing unit 0-A adds accumulator 202A numerical value 217A On by narrow data literal after the received rotations of multitask buffer 208B output 209B institute of narrow neural processing unit 511-B 1811A (namely by the 122 received narrow data literal 1022 of institute of data random access memory) and weight random access units The product of the 124 narrow text 0 of column 2；Narrow nerve processing unit 0-B adds accumulator 202B numerical value 217B single by narrow nerve processing Narrow data literal 1811B is (namely random by data after the received rotation of multitask buffer 208A output 209A institute of first 0-A Access memory 122 institute received narrow data literal 1023) and weight random access units 124 the narrow text 1 of column 2 product； The rest may be inferred, until narrow nerve processing unit 511-B adds accumulator 202B numerical value 217B by narrow neural processing unit 511- Narrow data literal 1811B is (namely by data random access after the received rotation of multitask buffer 208A output 209A institute of A Memory 122 received narrow data literal 1021) product with the narrow texts 1023 of column 2 of weight random access units 124. As shown in figure 21, this operation can persistently carry out in subsequent 1021 time-frequency periods, until the time-frequency period 1024 as described below.

In the time-frequency period 1024, each of this 1024 narrow nerve processing unit, which can execute, is located at address 2 in Figure 20 Multiply-accumulate rotation instruction the 1023rd iteration.As shown in the figure, narrow nerve processing unit 0-A is by accumulator 202A number Value 217A adds narrow number after the received rotation of multitask buffer 208B output 209B institute by narrow neural processing unit 511-B According to text 1811A (namely by the 122 received narrow data literal 1 of institute of data random access memory) and weight arbitrary access list The product of the narrow text 0 of column 1023 of member 124；Narrow nerve processing unit 0-B adds accumulator 202B numerical value 217B by narrow nerve Narrow data literal 1811B after the received rotations of multitask buffer 208A output 209A institute of processing unit 0-A (namely by Data random access memory 122 received narrow data literal 2) with the narrow texts of column 1023 of weight random access units 124 1 product；The rest may be inferred, is handled until narrow nerve processing unit 511-B adds accumulator 202B numerical value 217B by narrow nerve Narrow data literal 1811B is (namely by data after the received rotation of multitask buffer 208A output 209A institute of unit 511-A Random access memory 122 received narrow data literal 0) with the narrow texts 1023 of column 1023 of weight random access units 124 Product.

In the time-frequency period 1025, the run function unit 212A/ of each of this 1024 narrow nerve processing unit The run function that 212B can execute in Figure 20 positioned at address 3 instructs.Finally, in the time-frequency period 1026, at this 1024 narrow nerve Managing the meeting of each of unit will be opposite in the column 16 of its narrow result 133A/133B write back data random access memory 122 Narrow text is answered, is instructed with executing the write-in run function unit in Figure 20 positioned at address 4.That is, neural processing unit 0-A's is narrow As a result 133A can be written into the narrow text 0 of data random access memory 122, the narrow result 133B meeting of neural processing unit 0-B It is written into the narrow text 1 of data random access memory 122, and so on, until the narrow result of neural processing unit 511-B 133B can be written into the narrow text 1023 of data random access memory 122.Figure 22 is shown aforementioned corresponding to Figure 21 with block diagram Operation.

Figure 22 is the block schematic diagram for showing the neural network unit 121 of Fig. 1, this neural network unit 121 has as schemed Nerve processing unit 126 shown in 18 is to execute the program of Figure 20.This neural network unit 121 includes that 512 nerve processing are single Member 126, i.e., 1024 narrow nerve processing unit, data random access memory 122 and weight random access memory 124, Data random access memory 122 receives its address input 123, and weight random access memory 124 receives the input of its address 125.Although not shown in figure, but, in the time-frequency period 0, this 1024 narrow nerve processing unit can all execute the first of Figure 20 Beginningization instruction.As shown in the figure, in the time-frequency period 1,1024 8 data literals of column 17 can be from data random access memory 122 read and are provided to this 1024 narrow neural processing unit.In the time-frequency period 1 to 1024,1024 8 of column 0 to 1023 Weight text can read respectively from weight random access memory 124 and be provided to this 1024 narrow neural processing unit.Although It is not shown in figure, but, in the time-frequency period 1, this 1024 narrow nerve processing unit can be to the data literal and weight of load Text executes its corresponding multiply-accumulate operation.In the time-frequency period 2 to 1024, more of this 1024 narrow nerve processing unit The rotator of the running of business buffer 208A/208B such as same 1024 8 texts, can deposit at random the previously loaded data The data literal of the column 17 of access to memory 122 is rotated to neighbouring narrow neural processing unit, and these narrow neural processing unit meetings It is executed to data literal after corresponding rotation and by the corresponding narrow weight text that weight random access memory 124 loads Multiply-accumulate operation.Although not shown in figure, in the time-frequency period 1025, run function unit 212A/212B this 1024 narrow Enabled instruction can be executed.In the time-frequency period 1026, this 1024 narrow nerve processing unit can be by its corresponding 1024 8 knots The column 16 of fruit 133A/133B write back data random access memory 122.

It is possible thereby to find, the embodiment compared to Fig. 2, the embodiment of Figure 18 allows program designer that there is elasticity can select It selects and executes calculating using wide data and weight text (such as 16) and narrow data and weight text (such as 8), in response to specific For the demand of accuracy under.From one towards from the point of view of, for the application of narrow data, the embodiment of Figure 18 compared to The embodiment of Fig. 2 can provide twice of efficiency, but must increase additional narrow component (such as multitask buffer 208B, caching Device 205B, narrow arithmetic logic unit 204B, narrow accumulator 202B, narrow run function unit 212B) it is used as cost, these are additional Narrow component can make neural processing unit 126 increase about 50% area.

Three mould nerve processing units

Figure 23 is the block schematic diagram for showing another embodiment of dynamically configurable neural processing unit 126 of Fig. 1.Figure 23 neural processing unit 126 may not only be applied to wide configuration and narrow configuration, also can be used to the third configuration, hereon referred to as " funnel (funnel) " it configures.The neural processing unit 126 of Figure 23 is similar to the neural processing unit 126 of Figure 18.But, in Figure 18 For wide adder 244A in the neural processing unit 126 of Figure 23 as replaced the wide adder 2344A of one three input, this is three defeated Enter wide adder 2344A and receive a third addend 2399, is that the one of the output of narrow multiplexer 1896B extends version.With figure Program performed by the neural network unit of 23 neural processing unit is similar to the program of Figure 20.But, wherein being located at address 0 The instruction of initialization nerve processing unit these neural processing units 126 can be initialized as to funnel configuration, rather than narrow configuration.This Outside, positioned at address 2 multiply-accumulate rotation instruct count value be 511 rather than 1023.

When in funnel configuration, the running of neural processing unit 126, which is similar to, is in narrow configuration, when execution is as in Figure 20 When the multiply-accumulate instruction of location 1, neural processing unit 126 can receive data literal 207A/207B two narrow and two narrow weights Text 206A/206B；Data literal 209A and weight text 203A can be multiplied to produce wide multiplexer by wide multiplier 242A The product 246A of 1896A selection；Data literal 209B and weight text 203B can be multiplied to produce narrow more by narrow multiplier 242B The product 246B of work device 1896B selection.But, wide adder 2344A (can be selected product 246A) by wide multiplexer 1896A And product 246B/2399 (being selected by wide multiplexer 1896B) is added with wide accumulator 202A output 217A, and narrow adder 244B is then not start with narrow accumulator 202B.In addition, being configured in funnel and executing the multiply-accumulate rotation such as address 2 in Figure 20 When turning instruction, control input 213 can make multitask buffer 208A/208B rotate two narrow text (such as 16), that is, It says, multitask buffer 208A/208B can select its corresponding input 211A/211B, just as the same in wide configuration.No It crosses, data literal 209A and weight text 203A can be multiplied to produce multiplying for wide multiplexer 1896A selection by wide multiplier 242A Product 246A；Data literal 209B and weight text 203B can be multiplied to produce narrow multiplexer 1896B and selected by narrow multiplier 242B Product 246B；Also, wide adder 2344A can be by product 246A (being selected by wide multiplexer 1896A) and product 246B/ 2399 (being selected by wide multiplexer 1896B) are all added with wide accumulator 202A output 217A, and narrow adder 244B adds up with narrow Device 202B is not start as aforementioned.Finally, when being configured in funnel and executing the run function instruction of address 3 in such as Figure 20, Wide run function unit 212A can execute run function to result sum 215A to generate a narrow result 133A, and narrow run function Unit 212B is then not start.In this way, narrow result 133A can be generated by being only denoted as the narrow neural processing unit of A, it is denoted as B Narrow neural processing unit caused by narrow result 133B be then invalid.Therefore, the column of write-back result are (such as address 4 in Figure 20 The indicated column 16 of instruction) can be comprising cavity, this is because only narrow result 133A is effective, narrow result 133B is then invalid.Cause This, conceptually, in each time-frequency period, it is defeated that each neuron (the neural processing unit of Figure 23) can execute two connection data Enter, i.e., two narrow data literals is multiplied by its corresponding weight and by the two product additions, in comparison, Fig. 2 and Figure 18 Embodiment the input of connection data is only carried out within each time-frequency period.

It is deposited at random in the embodiment of Figure 23 it can be found that generating simultaneously write back data random access memory 122 or weight The quantity of the result text (neuron output) of access to memory 124 is subduplicate the one of received data input (connection) quantity Half, and the column that write back of result have cavity, i.e., be exactly every a narrow text results it is invalid, more precisely, be denoted as the narrow of B Neural processing unit result does not have meaning.Therefore, the embodiment of Figure 23 is especially effective for the neural network with continuous two layers Rate, for example, the neuronal quantity that first layer has be the second layer twice (such as first layer have 1024 neurons fill Divide 512 neurons for being connected to the second layer).In addition, other execution unit 122 (such as media units, as x86 it is advanced to Measure expanding element) if necessary, can a dispersion results be arranged and (have cavity) with execution union operation (pack operation) So that its close (not having cavity).Subsequent nerve processing unit 121 of working as is stored in the other data random access that are associated with of execution It, can be by this treated data column based on when the calculating of other column of device 122 and/or weight random access memory 124 It calculates.

Hybrid neural networks unitary operation: convolution and common source operational capability

The advantages of neural network unit 121 described in the embodiment of the present invention, is that this neural network unit 121 can be same When by be similar to a coprocessor execute oneself internal processes in a manner of operate and be similar to a processor processing Unit executes issued framework and instructs (or the microcommand gone out by framework instruction translation).Framework instruction, which is included in, has mind In framework program performed by processor through network unit 121.In this way, neural network unit 121 can be transported in a mixed manner Make, and the high usage of neural processing unit 121 can be maintained.For example, Figure 24 to Figure 26 shows that neural network unit 121 is held The running of row convolution algorithm, wherein neural network unit is fully utilized, and Figure 27 to Figure 28 shows that neural network unit 121 is held The running of row common source operation.The application that convolutional layer, common source layer and other numerical datas calculate, such as image processing (such as edge Detecting, sharpened, blurring, identification/classification) it needs to use these operations.But, the mixing fortune of neural processing unit 121 Calculation, which is not limited to execution convolution or common source operation, this composite character, can also be used for executing other operations, such as described in Fig. 4 to Figure 13 The multiply-accumulate operation of traditional neural network and run function operation.That is, 100 (more precisely, reservation station of processor 108) MTNN instruction 1400 and MFNN instruction 1500 can be issued to neural network unit 121, in response to the instruction of this publication, nerve Network unit 121 can write data into 122/124/129 and deposit result from what is be written by neural network unit 121 It is read in reservoir 122/124, at the same time, (instructs) write-in program memory through MTNN1400 to execute processor 100 129 program, neural network unit 121 can read and memory 122/124/129 are written.

Figure 24 is a block schematic diagram, and display uses the data to execute convolution algorithm by the neural network unit 121 of Fig. 1 One example of structure.This block diagram includes the data random access memory of convolution kernel 2402, data array 2404 and Fig. 1 122 with weight random access memory 124.For a preferred embodiment, data array 2404 (such as corresponding to image picture Element) it is loaded into the system storage (not shown) for being connected to processor 100 and is instructed by processor 100 through MTNN is executed The weight random access memory 124 of 1400 load neural network units 121.Convolution algorithm is by the first array and second array Convolution is carried out, this second array is convolution kernel as described herein.As described herein, convolution kernel is a coefficient matrix, these are Number is alternatively referred to as weight, parameter, element or numerical value.For a preferred embodiment, this convolution kernel 2042 is held by processor 100 The static data of capable framework program.

This data array 2404 is the two-dimensional array of a data value, and each data value (such as image pixel value) is big Small is the size (such as 16 or 8) of the text of data random access memory 122 or weight random access memory 124. In this example, data value is 16 texts, and neural network unit 121 is the neural processing unit configured with 512 wide configurations 126.In addition, in this embodiment, neural processing unit 126 includes that multitask buffer is deposited with receiving from weight arbitrary access The weight text 206 of reservoir 124, such as the multitask buffer 705 of Fig. 7, use and connect to by weight random access memory 124 The column data value received executes the operation of collective's rotator, this part can be described in more detail in following sections.In this example, Data array 2404 is the pixel array of 2560 row X1600 column.As shown in the figure, when framework program is by data array 2404 When carrying out convolutional calculation with convolution kernel 2402, data array 2402 can be divided into 20 data blocks, and each data block is respectively The data array 2406 of 512x400.

In this example, convolution kernel 2402 is one by coefficient, weight, parameter or element, the 3x3 array of composition.This The first row of a little coefficients is denoted as C0, and 0；C0,1；With C0,2；The secondary series of these coefficients is denoted as C1, and 0；C1,1；With C1,2； The third of these coefficients, which arranges, is denoted as C2, and 0；C2,1；With C2,2.For example, the convolution kernel with following coefficient can be used for holding Row edge detection: 0,1,0,1, -4,1,0,1,0.In another embodiment, the convolution kernel with following coefficient can be used for executing height This fuzzy operation: 1,2,1,2,4,2,1,2,1.In this example, it will usually execute one again to the numerical value after final add up and remove Method, wherein divisor is the aggregation of the absolute value of each element of convolution kernel 2042, is 16 in this example.In another example In, divisor can be the number of elements of convolution kernel 2042.In another example, divisor, which can be, is compressed to one for convolution algorithm Numerical value used in target value range, this divisor is by the element numerical value of convolution kernel 2042, target zone and executes convolution fortune The range of the input value array of calculation is determined.

4 and Figure 25 of wherein details is described in detail referring to figure 2., framework program by the coefficient write-in data of convolution kernel 2042 with Machine accesses memory 122.For a preferred embodiment, the continuous nine column (convolution kernel 2402 of data random access memory 122 Interior number of elements) each column on all texts, can be added using the different elements of convolution kernel 2402 with arranging for its primary sequence With write-in.That is, as shown in the figure, same row each text with the first coefficient C0,0 write-in；Next column be then with Second coefficient C0,1 write-in；Next column is then with third coefficient C0,2 write-ins；Next column is then with the 4th coefficient C1,0 write-in again； The rest may be inferred, until each text of the 9th column is with the 9th coefficient C2,2 write-ins.In order to what is be partitioned into data array 2404 The data matrix 2406 of data block carries out convolution algorithm, and neural processing unit 126 can read data according to sequence repetition and deposit at random Nine column of 2042 coefficient of convolution kernel are loaded in access to memory 122, this part is particularly corresponding to the portion of Figure 26 A in following sections Point, it can be described in more detail.

4 and Figure 25 of wherein details is described in detail referring to figure 2., weight is written in the numerical value of data matrix 2406 by framework program Random access memory 124.When neural network unit program executes convolution algorithm, result array can be write back to weight arbitrary access Memory 124.For a preferred embodiment, weight random access memory can be written in the first data matrix 2406 by framework program Device 124 simultaneously makes neural network unit 121 start operation, when neural network unit 121 is to the first data matrix 2406 and convolution When core 2402 executes convolution algorithm, weight random access memory 124 can be written in the second data matrix 2406 by framework program, such as This, after neural network unit 121 completes the convolution algorithm of the first data matrix 2406, can start to execute the second data matrix 2406 convolution algorithm, this part it is subsequent correspond to Figure 25 at be described in more detail.By this method, framework program can be past It returns in two regions of weight random access memory 124, to ensure that neural network unit 121 is sufficiently used.Therefore, Figure 24 Example show the first data matrix 2406A and the second data matrix 2406B, the first data matrix 2406A is corresponding to accounting for According to the first data block of column 0 to 399 in weight random access memory 124, and the second data matrix 2406B is corresponding to accounting for According to the second data block of column 500 to 899 in weight random access memory 124.In addition, as shown in the figure, neural network unit 121 can write back the result of convolution algorithm the column 900-1299 of weight random access memory 124 and column 1300-1699, with Framework program can read these results from weight random access memory 124 afterwards.It is loaded into weight random access memory 124 The data value of data matrix 2406 is denoted as that " Dx, y ", wherein " x " is 124 columns of weight random access memory, " y " is weight The text or line number of random access memory.For example, it is denoted as in Figure 24 positioned at the data literal 511 of column 399 D399,511, this data literal is received by the multitask buffer 705 of neural processing unit 511.

Figure 25 is a flow chart, shows that the processor 100 of Fig. 1 executes framework program with right using neural network unit 121 The data array 2404 of Figure 24 executes the convolution algorithm of convolution kernel 2042.This process starts from step 2502.

In step 2502, processor 100 executes the processor 100 for having framework program, can be by the convolution kernel of Figure 24 Data random access memory 122 is written in a manner of 2402 descriptions shown by Figure 24.In addition, framework program can will be at the beginning of variable N Beginning turns to numerical value 1.The data block that neural network unit 121 is being handled in variable N unlabeled data array 2404.In addition, framework Variable NUM_CHUNKS can be initialized as numerical value 20 by program.Following process advances to step 2504.

In step 2504, as shown in figure 24, processor 100 data matrix 2406 of data block 1 can be written weight with Machine accesses memory 124 (such as data matrix 2406A of data block 1).Following process advances to step 2506.

In step 2506, processor 100 will use a specified function 1432 with write-in program memory 129 121 program storage 129 of neural network unit is written in convolution program by MTNN instruction 1400.Processor 100 then will use one A specified function 1432 instructs 1400 with the MTNN for starting to execute program, to start neural network unit convolution program.Nerve net One example of network unit convolution program is corresponding at Figure 26 A and can be described in more detail.Following process advances to step 2508。

Whether it is less than NUM_CHUNKS in the numerical value of steps in decision-making 2508, framework program validation variable N.If so, process meeting Advance to step 2512；Otherwise step 2514 is proceeded to.

In step 2512, as shown in figure 24, processor 100 is random by the write-in of data matrix 2406 weight of data block N+1 It accesses memory 124 (such as data matrix 2406B of data block 2).Therefore, when neural network unit 121 is to current data When block executes convolution algorithm, the data matrix 2406 of next data block can be written weight arbitrary access and deposited by framework program Reservoir 124, in this way, after the convolution algorithm for completing current data block, i.e., after write-in weight random access memory 124, nerve Network unit 121 can immediately begin to execute convolution algorithm to next data block.

In step 2514, processor 100 confirm be carrying out neural network unit program (for data block 1 but from Step 2506 starts to execute, and is then to execute for data block 2-20 since step 2518) whether complete to execute. For a preferred embodiment, processor 100 reads 121 state cache of neural network unit through MFNN instruction 1500 is executed Device 127 is to be confirmed whether to have completed to execute.In another embodiment, neural network unit 121 can generate interruption, indicate Complete convolution program.Following process advances to steps in decision-making 2516.

In steps in decision-making 2516, whether the numerical value of framework program validation variable N is less than NUM_CHUNKS.If so, process Advance to step 2518；Otherwise step 2522 is proceeded to.

In step 2518, processor 100 will be updated convolution program to be implemented in data block N+1.More precisely, locate Managing device 100 can be by the train value of the neural processing unit instruction of the initialization for corresponding to address 0 in weight random access memory 124 more It is newly the first row of data matrix 2406 (for example, being updated to the column 0 of data matrix 2406A or the column of data matrix 2406B 500), and output column (such as being updated to column 900 or 1300) be will be updated.Being followed by subsequent processing device 100 can start after executing this update Neural network unit convolution program.Following process advances to step 2522.

In step 2522, neural network list of the processor 100 from 124 read block N of weight random access memory The implementing result of first convolution program.Following process advances to steps in decision-making 2524.

In steps in decision-making 2524, whether the numerical value of framework program validation variable N is less than NUM_CHUNKS.If so, process Advance to step 2526；Otherwise it just terminates.

In step 2526, the numerical value of N can be increased by one by framework program.Following process returns to steps in decision-making 2508.

Figure 26 A is the program listing of neural network unit program, this neural network unit program utilizes the convolution kernel of Figure 24 The convolution algorithm of 2402 execution data matrixes 2406 is simultaneously write back weight random access memory 124.This program is by address 1 The instruction cycles that are constituted of instruction to 9 recycle certain number.Initialization nerve processing unit instruction positioned at address 0 is specified Each nerve processing unit 126 executes the number of this instruction cycles, and the loop count possessed by the example of Figure 26 A is 400, Corresponding to the columns in the data matrix 2406 of Figure 24, and the recursion instruction (being located at address 10) for being located at circulation terminal can make currently Loop count successively decreases, if result is nonzero value, is just returned to the top (returning to the instruction of address 1) of instruction cycles. Initializing neural processing unit instruction also can be cleared to zero for accumulator 202.For a preferred embodiment, positioned at address 10 Accumulator 202 can be also cleared to zero by recursion instruction.In addition, as the aforementioned multiply-accumulate instruction positioned at address 1 can also will add up Device 202 is cleared to zero.

Execution each time for instruction cycles in program, this 512 neural processing units 126 can be performed simultaneously 512 The convolution algorithm of 512 corresponding 3x3 submatrixs of 3x3 convolution kernel and data matrix 2406.Convolution algorithm is by convolution The aggregation for nine products that corresponding element in the element of core 2042 and corresponding submatrix calculates.In the reality of Figure 26 A It applies in example, the origin of each (central element) of this 512 corresponding 3x3 submatrixs is data literal Dx+1, y in Figure 24 + 1, wherein y (row number) is that neural processing unit 126 is numbered, and x (column number) is present weight random access memory 124 In by address 1 in the program of Figure 26 A the read column number of multiply-accumulate instruction (this column number also can by address 0 just Beginningization nerve processing unit instruction carries out initialization process, can also pass when executing the multiply-accumulate instruction for being located at address 3 and 5 Increase, the decrement commands that can be also located at address 9 update).In this way, in each circulation of this program, this 512 nerve processing Unit 126 can calculate 512 convolution algorithms and the result of this 512 convolution algorithms is write back weight random access memory 124 Instruction column.It omits edge processing (edge handling) herein to simplify explanation, but should be noted that and utilize Collective's hyperspin feature of these neural processing units 126 will cause (for the image processor i.e. image of data matrix 2406 Data matrix) multirow data in have two rows from the vertical edge of one side to another vertical edge (such as from left side Edge is to right side edge, and vice versa) it generates around (wrapping).It is illustrated now for instruction cycles.

Address 1 is multiply-accumulate instruction, this instruction can specify the column 0 of data random access memory 122 and utilize in the dark The column of present weight random access memory 124, this column preferably be loaded in sequencer 128 (and by be located at address 0 finger It enables and is initialized with zero to execute the operation that first time instruction cycles transmit).That is, being located at the instruction of address 1 can make It is each nerve processing unit 126 from the column 0 of data random access memory 122 read its corresponding text, from present weight with Machine accesses the column of memory 124 and reads its corresponding text, and executes a multiply-accumulate operation to this two texts.In this way, citing For, C0,0 and Dx, 5 are multiplied (wherein " x " is that present weight random access memory 124 arranges) by neural processing unit 5, will tie Fruit adds 202 numerical value 217 of accumulator, and sum is write back accumulator 202.

Address 2 is a multiply-accumulate instruction, this instruction can specify the column of data random access memory 122 to be incremented by (i.e. It increases to 1), then reads this column from the incremental rear address of data random access memory 122 again.This instructs and can specify will be every Numerical value in the multitask buffer 705 of a nerve processing unit 126 is rotated to neighbouring neural processing unit 126, in this model It is the instruction in response to address 1 in example and the column of 2406 value of data matrix from the reading of weight random access memory 124.Scheming In 24 to Figure 26 embodiment, these neural processing units 126 numerical value of multitask buffer 705 to be rotated to the left, Rotate from neural processing unit J to neural processing unit J-1, rather than if earlier figures 3, Fig. 7 and Figure 19 are from neural processing unit J It rotates to neural processing unit J+1.It is worth noting that, in the dextrorotary embodiment of neural processing unit 126, framework program Can by convolution kernel 2042 be numerical value with different order be written data random access memory 122 (such as around its central row revolve Turn) to achieve the purpose that similar convolution results.In addition, when needed, additional convolution kernel pretreatment (example can be performed in framework program Such as movement (transposition)).In addition, the count value that instruction is specified is 2.Therefore, the instruction positioned at address 2 can make each Neural processing unit 126 reads its corresponding text from the column 1 of data random access memory 122, by received text after rotation At most task buffer device 705, and multiply-accumulate operation is executed to the two texts.Because count value is 2, this instruction can also make Each nerve processing unit 126 repeats aforementioned running.That is, sequencer 128 can be such that data random access memory 122 arranges Address 123 is incremented by (increasing to 2), and each neural processing unit 126 can be read from the column 2 of data random access memory 122 Received text at most task buffer device 705 after taking its corresponding text and rotating, and multiplication is executed to the two texts Accumulating operation.In this way, for example, it is assumed that present weight random access memory 124 is classified as 27, in the instruction for executing address 2 Afterwards, neural processing unit 5 can be by the product and C0 of C0,1 and D27,6,2 and D27,7 product accumulation to its accumulator 202.Such as This, after the instruction for completing address 1 and address 2, C0,0 and D27,5 product, the product and C0 of C0,1 and D27,6,2 and D27,7 It will be added to accumulator 202, the accumulated value of other all instruction cycles from first front transfer is added.

Operation performed by the instruction of address 3 and 4 is similar to the instruction of address 1 and 2, utilizes weight random access memory The effect of 124 column increment pointers, these instructions can carry out operation, and this to the next column of weight random access memory 124 A little instructions can carry out operation to subsequent three column of data random access memory 122, i.e. column 3 to 5.That is, at nerve For managing unit 5, after the instruction for completing address 1 to 4, C0,0 and D27,5 product, the product of C0,1 and D27,6, C0,2 with D27,7 product, the product of C1,0 and D28,5, C1,1 and D28,6 product and C1,2 and D28,7 product can add up To accumulator 202, the accumulated value of other all instruction cycles from first front transfer is added.

Operation performed by the instruction of address 5 and 6 is similar to the instruction of address 3 and 4, these instructions can deposit at random weight The next column of access to memory 124 and subsequent three column of data random access memory 122, i.e. column 6 to 8, carry out operation.? That is by taking neural processing unit 5 as an example, after the instruction for completing address 1 to 6, C0,0 and D27,5 product, C0,1 and D27,6 Product, the product of C0,2 and D27,7, C1,0 and D28,5 product, the product of C1,1 and D28,6, C1,2 and D28,7, C2, 0 and D29,5 product, the product and C2 of C2,1 and D29,6,2 and D29,7 product can be added to accumulator 202, be added The accumulated value of other all instruction cycles from first front transfer.That is, after the instruction of completion address 1 to 6, it is assumed that refer to When circulation being enabled to start, weight random access memory 124 is classified as 27, by taking neural processing unit 5 as an example, it will utilizes convolution kernel 2042 pairs or less 3x3 submatrixs carry out convolution algorithm:

In general, this 512 neural processing units 126 have all used convolution kernel after the instruction of completion address 1 to 6 2042 pairs of following 3x3 submatrixs carry out convolution algorithm:

When wherein r is that instruction cycles start, the column address value of weight random access memory 124, and n is that nerve processing is single The number of member 126.

The instruction of address 7 can transmit 202 numerical value 217 of accumulator through run function unit 121.This transfer function can be transmitted One text, size (in bits) are equal to by data random access memory 122 and weight random access memory 124 texts (being 16 in this example) read.For a preferred embodiment, user may specify output format, such as How many position is the position decimal (fractional) in output bit, this part can be described in more detail in following sections.In addition, this It is specified to may specify a division run function, and non-designated transmitting run function, this division run function can be by accumulators 202 numerical value 217 as described in corresponding to Figure 29 A and Figure 30 herein, such as utilize " divider " of Figure 30 divided by a divisor One of 3014/3016.For example, for a convolution kernel 2042 with coefficient, there are 16 points Ru aforementioned One of coefficient Gaussian Blur core, the instruction of address 7 can specify a division run function (such as divided by 16), and non-designated One transmission function.In addition, framework program can be before being written data random access memory 122 for convolution kernel coefficient, to convolution 2042 coefficient of core executes this operation divided by 16, and adjusts the position of the binary point of 2042 numerical value of convolution kernel, example accordingly Such as use the data binary point 2922 of Figure 29 as described below.

The output of run function unit 212 can be written in weight random access memory 124 by defeated for the instruction of address 8 It falls out column specified by the current value of buffer.This current value can be initialized by the instruction of address 0, and by being incremented by instruction Pointer is just incremented by this numerical value often passing through one cycle.

As described in the example that there is a 3x3 convolution kernel 2402 such as Figure 24 to Figure 26, when neural processing unit 126 is every about three The frequency period can read weight random access memory 124 to read a column of data matrix 2406, and when every about 12 Weight random access memory 124 can be written in convolution kernel matrix of consequence by the frequency period.Furthermore, it is assumed that in one embodiment, having Write-in and read buffers such as the buffer 1704 of Figure 17, while neural processing unit 126 is read out with being written, place Reason device 100 can be read out and be written to weight random access memory 124, and buffer 1704 is every about 16 time-frequency weeks Phase can execute primary reading and write activity to weight random access memory, to read data matrix and write-in convolution respectively Core matrix of consequence.Therefore, the approximately half of bandwidth of weight random access memory 124 can be by neural network unit 121 with mixed The convolution kernel operation that conjunction mode executes is consumed.This example includes a 3x3 convolution kernel 2042, but, the present invention is not limited to This, the convolution kernel of other sizes, such as 2x2,4x4,5x5,6x6,7x7,8x8, it is equally applicable to different neural network units Program.Using larger convolution kernel, because of the rotation version (address 2,4 and 6 of such as Figure 26 A of multiply-accumulate instruction Instruction, biggish convolution kernel, which may require that, uses these instructions) there is biggish count value, neural processing unit 126 reads power The time accounting of weight random access memory 124 can reduce, therefore, the bandwidth use of weight random access memory 124 than It can reduce.

In addition, framework program can make neural network unit program to no longer needing column to be used in input data matrix 2406 Override, rather than by convolution algorithm result write back weight random access memory 124 different lines (as column 900-1299 with 1300-1699).For example, for the convolution kernel of a 3x3, weight can be written in data matrix 2406 by framework program The column 2-401 of random access memory 124, and write-not column 0-399, and neural processing unit program then can be random from weight The column 0 of access memory 124 start to be written convolution algorithm result, and often pass through once command circulation and be just incremented by columns. In this way, neural network unit program can only will no longer be required to using column override.For example, it is passed through for the first time (or more precisely, it loads weight random access memory 124 after the instruction for executing address 1 after instruction cycles Column 0), the data of column 0 can be written, and but, the data needs for arranging 1-3 leave the operation for passing through instruction cycles for the second time for And it cannot be written；Similarly, after passing through instruction cycles for the second time, the data of column 1 can be written, but, column The data needs of 2-4 leave third time for and pass through the operation of instruction cycles and cannot be written；The rest may be inferred.In this embodiment In, the height (such as 800 column) of each data matrix 2406 (data block) can be increased, thus less data block can be used.

In addition, framework program can make neural network unit program write back the result of convolution algorithm on convolution kernel 2402 The data random access memory 122 of side arranges (such as above column 8), rather than convolution algorithm result is write back weight arbitrary access Memory 124, when result is written in neural network unit 121, framework program can be read from data random access memory 122 As a result (such as being most recently written 2606 address of column using data random access memory 122 in Figure 26).This configuration is suitable for tool There is the embodiment of single port weight random access memory 124 Yu dual port data random access memory.

Operation according to neural network unit 121 in the embodiment of Figure 24 to Figure 26 A is it can be found that the program of Figure 26 A Every time execute may require that about 5000 time-frequency periods, in this way, in Figure 24 the data array 2404 of entire 2560x1600 volume Product operation needs about 100,000 time-frequency period, hence it is evident that all less than time-frequency required for same task is executed in a conventional manner Issue.

Figure 26 B is the embodiment for showing certain fields of control buffer 127 of the neural network unit 121 of Fig. 1 Block schematic diagram.This status register 127 includes a field 2602, it is indicated that quilt recently in weight random access memory 124 The address for the column that neural processing unit 126 is written；One field 2606, it is indicated that quilt recently in data random access memory 122 The address for the column that neural processing unit 126 is written；One field 2604, it is indicated that quilt recently in weight random access memory 124 The address for the column that neural processing unit 126 is read；An and field 2608, it is indicated that in data random access memory 122 most The address of the column closely read by neural processing unit 126.In this way, mind can be confirmed by being implemented in the framework program of processor 100 Processing progress through network unit 121, when to data random access memory 122 and/or weight random access memory 124 into When the reading and/or write-in of row data.Using this ability, and as aforementioned selection overrides input data matrix (or Data random access memory 122 is write the result into Ru aforementioned), as described in example below, the data array 2404 of Figure 24 is just The data block of 5 512x1600 can be considered as to execute, rather than the data block of 20 512x400.Processor 100 is random from weight The column 2 of access memory 124 start that the data block of first 512x1600 is written, and neural network unit program is made to start (this It is 1600 cycle count that program, which has a numerical value, and 0) it is that weight random access memory 124, which is exported row initialization,.When When neural network unit 121 executes neural network unit program, processor 100 can monitor weight random access memory 124 Output position/address, using (1) (using MFNN instruction 1500) and reading in weight random access memory 124 has by nerve The column of effective convolution operation result of network unit 121 (by column 0) write-in；And (2) by second 512x1600 data Matrix 2406 (starting from column 2) overriding is in the effective convolution operation result being read, so when neural network unit 121 is right Neural network unit program is completed in first 512x1600 data block, processor 100 can update nerve immediately if necessary Network unit program is simultaneously again started up neural network unit program to be implemented in second 512x1600 data block.This program can be again It executes in triplicate and is left three 512x1600 data blocks, so that neural network unit 121 can be used sufficiently.

In one embodiment, run function unit 212 has can perform effectively one effectively to 202 numerical value 217 of accumulator The ability of division arithmetic, this part especially corresponds at Figure 29 A, Figure 29 B and Figure 30 in following sections to be had in more detail It is bright.For example, 202 numerical value of accumulator instruct divided by the run function neural network unit of 16 division arithmetic available In Gaussian Blur matrix as described below.

Convolution kernel 2402 used in the example of Figure 24 is the small-sized static for being applied to entire data matrix 2404 Convolution kernel, but, the present invention is not limited thereto, this convolution kernel can also be a large-scale matrix, and there is specific weight to correspond to number According to the different data value of array 2404, such as it is common in the convolution kernel of convolutional neural networks.When neural network unit 121 is with this side Formula by use, framework program can by the location swap of data matrix and convolution kernel, also i.e. by data matrix be placed in data with Convolution kernel is placed in weight random access memory 124 in machine access memory 122, and executes neural network unit journey The columns handled needed for sequence also can be relatively fewer.

Figure 27 is a block schematic diagram, shows and inserts the one of the weight random access memory 124 of input data in Fig. 1 Example, this input data execute common source operation (pooling operation) by the neural network unit 121 of Fig. 1.Common source operation It is to be executed by the common source layer of artificial neural network, most through the subregion or submatrix and calculated sub-matrix for obtaining input matrix Big value or average value are with matrix as a result, that is, common source matrix, to reduce input data matrix (image after such as image or convolution) Size (dimension).In the example of Figure 27 and Figure 28, common source operation calculates the maximum value of each submatrix.Common source fortune It calculates particularly useful for the artificial neural network for such as executing object classification or detecting.In general, common source operation can actually The first prime number for the factor submatrix detected for reducing input matrix, especially can be by each dimension side of input matrix To first prime number in the corresponding dimension direction for all reducing submatrix.In the example of Figure 27, input data be a wide text (such as 16) 512x1600 matrix, be stored in the column 0 to 1599 of weight random access memory 124.In Figure 27, these texts With its column row location mark, e.g., positioned at 0 row 0 of column word indicating be D0,0；Word indicating positioned at 0 row 1 of column is D0, 1；Positioned at 0 row 2 of column word indicating be D0,2；The rest may be inferred, positioned at 0 row 511 of column word indicating be D0,511.In the same manner, Positioned at 1 row 0 of column word indicating be D1,0；Positioned at 1 row 1 of column word indicating be D1,1；It is positioned at 1 row of column, 2 word indicating D1,2；The rest may be inferred, positioned at 1 row 511 of column word indicating be D1,511；So the rest may be inferred, positioned at the text of 1599 row 0 of column It is denoted as D1599,0；Positioned at 1599 row 1 of column word indicating be D1599,1 be located at 1599 row 2 of column word indicating be D1599,2；The rest may be inferred, positioned at 1599 row 511 of column word indicating be D1599,511.

Figure 28 is the program listing of neural network unit program, this neural network unit program executes the input data of Figure 27 The common source operation of matrix is simultaneously write back weight random access memory 124.In the example of Figure 28, common source operation can calculate defeated Enter the maximum value of each 4x4 submatrix in data matrix.The instruction cycles being made of instruction 1 to 10 can be performed a plurality of times in this program. The neural processing unit of initialization positioned at address 0 instructs the number that each neural processing unit 126 can be specified to execute instruction circulation, Loop count in the example of Figure 28 is 400, and the recursion instruction in circulation end (in address 11) can make previous cycle Count value is successively decreased, and if generated the result is that nonzero value, the top for being just returned to instruction cycles (return to address 1 Instruction).Input data matrix in weight random access memory 124 substantially can be considered as 400 by neural network unit program A mutual exclusion group being made of four adjacent columns, i.e. column 0-3, column 4-7, column 8-11, the rest may be inferred, until arranging 1596-1599.Often One group being made of four adjacent columns includes 128 4x4 submatrixs, four column and four phases of these submatrixs thus group The infall element of adjacent rows is formed by 4x4 submatrix, these adjacent rows at once 0-3, row 4-7, row 8-11, so on up to Row 508-511.In this 512 neural processing units 126, every four the 4th neural processing units 126 (one for one group of calculating Altogether be 128) can execute common source operation to a corresponding 4x4 submatrix, and other three neural processing units 126 then not by It uses.More precisely, neural processing unit 0,4,8, so on up to neural processing unit 508, can be corresponding to its 4x4 submatrix executes common source operation, and the leftmost side row number of this 4x4 submatrix corresponds to neural processing unit and numbers, and under Side's column correspond to the train value of present weight random access memory 124, this numerical value can be initialized as by the initialization directive of address 0 Zero and it will increase 4 after repeating each instruction cycles, this part can be described in more detail in following sections.This 400 times fingers Enabling the corresponding 4x4 submatrix group number into the input data matrix of Figure 27 of the repetitive operation of circulation, (i.e. input data matrix has 1600 column having are divided by 4).The neural processing unit instruction of initialization, which can also remove accumulator 202, makes its zero.It is preferably real with regard to one For applying example, the recursion instruction of address 11, which can also remove accumulator 202, makes its zero.In addition, the maxwacc of address 1 instructs meeting Specified accumulator 202 of removing makes its zero.

Every time when executing the instruction cycles of program, this 128 neural processing units 126 used can be to input data 128 other 4x4 submatrixs in the current four column group of matrix, are performed simultaneously 128 common source operations.Furthermore, it is understood that This common source operation can confirm the maximum value element in 16 elements of this 4x4 submatrix.In the embodiment of Figure 28, for this For each of 128 neural processing units 126 used nerve processing unit y, the lower left element of 4x4 submatrix For element Dx, the y in Figure 27, wherein x is the columns of present weight random access memory 124 when instruction cycles start, and this By the maxwacc instruction reading of address 1 in the program of Figure 28, (this columns can also be handled column data by the initialization nerve of address 0 Unit instruction is initialized, and is incremented by the maxwacc instruction for executing address 3,5 and 7 every time).Therefore, for this program Each circulation for, this 128 neural processing units 126 that are used can will work as corresponding 128 4x4 of forefront group The maximum value element of submatrix writes back the specified column of weight random access memory 124.It is retouched below for this instruction cycles It states.

The maxwacc instruction of address 1 can be arranged using present weight random access memory 124 in the dark, this column preferably fills Be loaded in sequencer 128 (and the instruction by being located at address 0 is initialized with zero and passes through instruction cycles for the first time to execute Operation).The instruction of address 1 can make each neural processing unit 126 from weight random access memory 124 when forefront is read Its corresponding text by this text compared with 202 numerical value 217 of accumulator, and the maximum of the two numerical value is stored in cumulative Device 202.So that it takes up a position, for example, neural processing unit 8 can confirm 202 numerical value 217 of accumulator and data literal Dx, 8 (wherein " x " Be present weight random access memory 124 arrange) in maximum value and write back accumulator 202.

Address 2 is a maxwacc instruction, this instruction can be specified the multitask caching of each neural processing unit 126 Numerical value in device 705 is rotated to neighbouring to neural processing unit 126, and the instruction as in response to address 1 is just random from weight herein Access the column input data array of values that memory 124 is read.In the embodiment of Figure 27 to Figure 28, neural processing unit 126 It rotates to rotate to the left 705 numerical value of multiplexer, namely from neural processing unit J to neural processing unit J-1, it is such as right above It should be described in chapters and sections of the Figure 24 to Figure 26.In addition, it is 3 that this instruction, which can specify a count value,.In this way, the instruction of address 2 can make often It is a nerve processing unit 126 by the at most task buffer device 705 of received text after rotation and confirm this rotation after text and accumulator Then this operation is repeated two more times by the maximum value in 202 numerical value.That is, each nerve processing unit 126 can be held Row by the at most task buffer device 705 of received text after rotation and confirms after rotation maximum in text and 202 numerical value of accumulator three times The operation of value.In this way, for example, it is assumed that present weight random access memory 124 is classified as 36 when starting this instruction cycles, By taking neural processing unit 8 as an example, after the instruction for executing address 1 and 2, neural processing unit 8 will store up in its accumulator 202 Deposit accumulator 202 and four 124 text D36 of weight random access memory when circulation starts, 8, D36,9, D36,10 with D36, the maximum value in 11.

The performed operation of the maxwacc instruction of address 3 and 4 is similar to the instruction of address 1, is deposited using weight arbitrary access 124 column increment pointers of reservoir have effects that the instruction of address 3 and 4 can hold the next column of weight random access memory 124 Row.That is, it is assumed that the column of present weight random access memory 124 are 36 when instruction cycles start, with neural processing unit 8 For, after the instruction for completing address 1 to 4, neural processing unit 8 will store tired when circulation starts in its accumulator 202 Add device 202 and eight 124 text D36 of weight random access memory, 8, D36,9, D36,10, D36,11, D37,8, D37, 9, D37,10 and D37, the maximum value in 11.

The performed operation of the maxwacc instruction of address 5 to 8 is similar to the instruction of address 1 to 4, the instruction of address 5 to 8 Lower two column of weight random access memory 124 can be executed.That is, it is assumed that present weight is random when instruction cycles start Accessing the column of memory 124 is 36, by taking neural processing unit 8 as an example, after the instruction for completing address 1 to 8, and neural processing unit 8 Accumulator 202 and 16 124 texts of weight random access memory when circulation starts will be stored in its accumulator 202 D36,8, D36,9, D36,10, D36,11, D37,8, D37,9, D37,10, D37,11, D38,8, D38,9, D38,10, D38, 11, D39,8, D39,9, D39,10 and D39, the maximum value in 11.That is, it is assumed that present weight when instruction cycles start The column of random access memory 124 are 36, and by taking neural processing unit 8 as an example, after the instruction for completing address 1 to 8, nerve processing is single Member 8 will be completed to confirm the maximum value of following 4x4 submatrix:

Substantially, each in this 128 neural processing units 126 used after the instruction for completing address 1 to 8 A nerve processing unit 126 will be completed to confirm the maximum value of following 4x4 submatrix:

Wherein r is the column address value of present weight random access memory 124 when instruction cycles start, and n is neural processing Unit 126 is numbered.

The instruction of address 9 can transmit 202 numerical value 217 of accumulator through run function unit 212.This transfer function can be transmitted One text, size (in bits) are equal to the text read by weight random access memory 124 (in this example In i.e. 16).For a preferred embodiment, user may specify how many position is decimal in output format, such as output bit (fractional) position, this part can be described in more detail in following sections.

The instruction of address 10 202 numerical value 217 of accumulator can be written slow by output column in weight random access memory 124 Column specified by the current value of storage, this current value can be initialized by the instruction of address 0, and utilize the incremental finger in instruction This numerical value is incremented by by needle after passing through circulation every time.Furthermore, it is understood that the instruction of address 10 can be wide by the one of accumulator 202 Weight random access memory 124 is written in text (such as 16).For a preferred embodiment, this instruction can by this 16 positions according to It is written according to output binary point 2916, this part is following more detailed corresponding to having at Figure 29 A and Figure 29 B Explanation.

It has been observed that the column of iteration once command recurrent wrIting weight random access memory 124 can be comprising having invalid value Cavity.That is, the wide text 1 to 3 of result 133,5 to 7,9 to 11, the rest may be inferred, until wide text 509 to 511 all It is invalid or not used.In one embodiment, run function unit 212 includes that enabled be incorporated into result of multiplexer arranges buffering The adjacent text of device, such as the column buffer 1104 of Figure 11 are arranged with writing back output weight random access memory 124.With regard to one compared with For good embodiment, run function instruction can specify the text number in each cavity, and the text number in this cavity controls multiplexing Device amalgamation result.In one embodiment, empty number can be designed to numerical value 2 to 6, with merge common source 3x3,4x4,5x5,6x6 or The output of 7x7 submatrix.In addition, institute can be read from weight random access memory 124 by being implemented in the framework program of processor 100 Sparse (there is cavity) the result column generated, and other execution units 112 are utilized, such as merge the matchmaker of instruction using framework Body unit executes pooling function such as x86 single-instruction multiple-data stream (SIMD) extension (SSE) instruction.To be similar to side that is aforementioned while carrying out Formula is simultaneously essential using the mixing of neural network unit 121, and the framework program for being implemented in processor 100 can be with reading state buffer 127 with monitor weight random access memory 124 be most recently written column (such as field 2602 of Figure 26 B) with read caused by One sparse result column, are merged and are write back the same row of weight random access memory 124, so just complete to prepare and can make For an input data matrix, it is supplied to next layer of use of neural network, such as convolutional layer or traditional neural network layer (namely Multiply-accumulate layer).In addition, embodiment as described herein with 4x4 submatrix execute common source operation, but the present invention is not limited to This, the neural network unit program of Figure 28 can be adjusted, and with the submatrix of other sizes, such as 3x3,5x5,6x6 or 7x7, hold Row common source operation.

As aforementioned it can be found that the quantity of the result column of write-in weight random access memory 124 is input data matrix Columns a quarter.Finally, in this example and data random access memory 122 is not used.But, number can also be used According to random access memory 122, rather than weight random access memory 124, Lai Zhihang common source operation.

In the embodiment of Figure 27 and Figure 28, the maximum value in common source operation accounting operator region.But, the program of Figure 28 It may be adjusted to calculate the average value of subregion, benefit, which enters, to be instructed through by maxwacc with sumwacc instruction substitution (by weight text Word and 202 numerical value 217 of accumulator add up) and be by accumulation result divided by each sub-district by the run function instruction modification of address 9 First prime number (preferably is through multiplying reciprocal as described below) in domain, is 16 in this example.

By neural network unit 121 according in the operation of Figure 27 and Figure 28 it can be found that each time execute Figure 28 program It needs to execute a common source operation to entire 512x1600 data matrix shown in Figure 27 using about 6000 time-frequency periods, Time-frequency periodicity used in this operation is considerably less than the time-frequency periodicity that traditional approach executes similar required by task.

In addition, framework program can make result write back data random access memory of the neural network unit program by common source operation Device 122 arranges, rather than results back into weight random access memory 124, when neural network unit 121 writes the result into data When random access memory 122 (such as the ground of column 2606 is most recently written using the data random access memory 122 of Figure 26 B Location), framework program can read result from data random access memory 122.This configuration is applicable in, and there is single port weight to deposit at random The embodiment of access to memory 124 and dual port data random access memory 122.

There is user to provide binary point for fixed point arithmetic operation, and full precision fixed point is cumulative, the specified inverse of user Value, the random rounding-off of accumulator value and optional starting/output function

In general, the hardware cell for executing arithmetical operation in digital computing system executes pair of arithmetical operation according to it As being commonly divided into " integer " unit and " floating-point " unit for integer or floating number.Floating number has numerical value (magnitude) (or mantissa) and index, usually there are also symbols.Index is radix (radix) point (usually binary point) relative to numerical value Position pointer.In comparison, integer does not have index, and only has numerical value, and usually there are also symbols.Floating point unit can allow Program designer can obtain its work institute number to be used from a very large-scale different numerical value, and hardware is then It is responsible for the index value of this number of adjustment when needed, is handled without program designer.For example, it is assumed that two floating numbers 0.111 x 10²⁹With 0.81 x 10³¹It is multiplied.Although (floating point unit typically operate in 2 based on floating number, in this example Used is decimal fraction, or the floating number based on 10.) floating point unit can be responsible for automatically mantissa multiplication, index phase Add, result is then normalized to numerical value .8911 x 10 again⁵⁹.In another example, it is assumed that same two floating number phases Add.Floating point unit can be responsible for automatically the binary fraction point alignment by mantissa before addition to generate numerical value as .81111 x 10³¹ Sum.

But, it is well known that complicated in this way operation and the size that will lead to floating point unit increases, energy consumption increases, every finger Time-frequency periodicity needed for enabling increases and/or cycle time is elongated.For this reason that many device (such as embedded processings The microprocessor of device, microcontroller and relatively low cost and/or low-power) and do not have floating point unit.It can be with by previous cases It was found that the labyrinth of floating point unit includes to execute the logic for being associated with the index of floating add and multiplication/division and calculating (i.e. pair The index of operand executes plus/minus operation to generate the adder of floating-point multiplication/division exponential number, by operand index phase Subtract the subtracter to confirm the binary point alignment offset amount of floating add), comprising in order to reach mantissa in floating add Binary point alignment deviator, include the deviator being standardized to floating point result.In addition, process into Row usually also need to be implemented the logic of the rounding-off operation of floating point result, execute integer data format between floating-point format and different floating-points The logic of conversion between format (such as amplification precision, double precision, single precision, half precision), leading zero with leading one detector, And the logic of the special floating number of processing, such as outlying observation, nonumeric and infinite value.

In addition, the correctness verifying about floating point unit can be big because the numerical space for needing to be verified in design increases Width increases its complexity, and can extend product development cycle and Time To Market.In addition, it has been observed that floating-point operation arithmetic needs pair The mantissa field and exponent field of each floating number for calculating are stored respectively and are used, and will increase required storage space And/or accuracy is reduced in the case where given storage space is to store integer.Many disadvantages can penetrate integer list Member executes arithmetical operation to avoid.

Program designer usually requires to write the program of processing decimal, and decimal is the numerical value of incomplete number.This program Although may need to execute on the processor for not having floating point unit or processor has floating point unit, but by handling The integer unit of device executes integer instructions can be than very fast.For the advantage using integer processor in efficiency, program designer Known fixed point arithmetic operation can be used to fixed-point value (fixed-point numbers).Such program will include execution The instruction of integer or integer data is handled in integer unit.Software knows that data are decimals, this software and include instruction pair Integer data executes operation and handles the problem of this data is actually decimal, such as alignment offset device.Substantially, it pinpoints soft Part can manually perform the function that some or all floating point unit can execute.

Herein, one " fixed point " number (or value or operand or input or output) is a number, bit of storage quilt It is interpreted as indicating a fractional part of this fixed-point number comprising position, this position is referred to here as " decimal place ".The bit of storage packet of fixed-point number Contained in one 8 or 16 texts in memory or buffer, such as in memory or buffer.In addition, the storage of fixed-point number It deposits position and is all used to express a numerical value, and in some cases, one of position, which can be used to expression symbol, but not to be had The bit of storage of one fixed-point number can be used to express the index of this number.In addition, the decimal place quantity or binary system of this fixed-point number Scaling position is specified in one and is different from the storage space of fixed-point number bit of storage, and is referred in shared or general mode The quantity of decimal place or binary point position out are shared with the fixed-point number set comprising this fixed-point number, such as defeated Enter the set of the output result of operand, accumulating values or pe array.

In embodiment described here, arithmetic logic unit is integer unit, and but, run function unit is then comprising floating Point arithmetic hardware auxiliary accelerates.Arithmetic logic unit part can be so set to become smaller and more quick, in favor of giving More arithmetic logic unit are used on fixed chip space.This is also illustrated in unit chip spatially and more minds can be set Through member, and it is particularly conducive to neural network unit.

In addition, requiring index bit of storage compared to each floating number, the fixed-point number in embodiment as described herein is with one Belong to the quantity of the bit of storage of decimal place in the whole digital collection of a pointer expression, but, this pointer is single positioned at one, total The storage space enjoyed and widely point out all numbers entirely gathered, such as input set, a series of fortune of a series of operations Set, the set of output of the cumulative number of calculation, the wherein quantity of decimal place.For a preferred embodiment, neural network unit User can to this digital collection specify decimal bit of storage quantity.Although it is understood, therefore, that in many cases (mathematics as), " integer, term refer to that a tape symbol completely counts, that is, one do not have fractional part number Word, but, in the train of thought of this paper, the term of " integer " can indicate the number with fractional part.In addition, in the train of thought of this paper In, the term of " integer " is the part position in order to distinguish with floating number, for floating number, in respective storage space The index of floating number can be used to express.Similarly, integer arithmetic operation, the multiplication of integers or addition or ratio executed such as integer unit Compared with operation, it is assumed that do not have index, therefore, the whole array part of integer unit, such as integer multiplier, addition of integer in operand Device, integer comparator there is no need to handle index comprising logic, such as do not need to move tail for addition or comparison operation Number does not need to be added index for multiplying to be directed at binary point.

In addition, embodiment as described herein includes a large-scale hardware integer accumulator to the whole of large series Number operation is added up (such as 1000 multiply-accumulate operations) without losing accuracy.At so avoidable neural network unit Floating number is managed, while cumulative number can be made to maintain full precision again, without being saturated it or generating the knot of inaccuracy because of overflow Fruit.Once this series of integers operation adds up out a result and inputs this full precision accumulator, this fixed-point hardware auxiliary can execute necessity Scaling and saturation arithmetic, use small required for the accumulated value decimal place quantity pointer specified using user and output valve This full precision accumulated value is converted to an output valve by numerical digit quantity, this part can be described in more detail in following sections.

It is inputted for use in the one of run function when needing to compress accumulated value from full precision form or is used to pass It passs, for a preferred embodiment, random rounding-off operation, this part is executed to the run function unit property of can choose to accumulated value It can be described in more detail in following sections.Finally, the different demands to given layer according to neural network, neural processing unit can Selectively to receive to indicate to use different run function and/or many various forms of accumulated values of output.

Figure 29 A is the block schematic diagram for showing an embodiment of control buffer 127 of Fig. 1.This control buffer 127 can Including multiple control buffers 127.As shown in the figure, this control buffer 127 includes fields: configuration 2902, tape symbol Data 2912, tape symbol weight 2914, data binary point 2922, weight binary point 2924, arithmetical logic list Meta-function 2926, rounding control 2932, run function 2934, inverse 2942, offset 2944, output random access memory 2952, binary point 2954 and output order 2956 are exported.Control 127 value of buffer can use MTNN instruction 1400 carry out write activity with the instruction of NNU program, such as enabled instruction.

Configuring 2902 values and specifying neural network unit 121 is to belong to narrow configuration, wide configuration or funnel configuration, such as preceding institute It states.Configuration 2902 is also set by data random access memory 122 and the received input of weight random access memory 124 text The size of word.In narrow configuration with funnel configuration, the size for inputting text is narrow (such as 8 or 9), but, is matched in width In setting, the size for inputting text is then wide (such as 12 or 16).In addition, configuration 2902 is also set and input text The size of the identical output result 133 of size.

When tape symbol data value 2912 is genuine, that is, indicate by the received data text of data random access memory 122 Word is signed value, if vacation, then it represents that these data literals are not signed value.When tape symbol weighted value 2914 is genuine It waits, that is, indicates that by the received weight text of weight random access memory 122 be signed value, if vacation, then it represents that these power Text is weighed as not signed value.

2922 value of data binary point indicate by the received data literal of data random access memory 122 two into Scaling position processed.For a preferred embodiment, for the position of binary point, data binary point 2922 values are the position number of positions for indicating binary point and calculating from right side.In other words, 2922 table of data binary point Show the quantity for belonging to decimal place in the least significant bit of data literal, that is, the digit being located on the right side of binary point.Similarly, 2924 value of weight binary point indicates the binary point by the received weight text of weight random access memory 124 Position.For a preferred embodiment, when arithmetic logic unit function 2926 is a multiplication and is added up or output is cumulative, nerve It is small that digit on the right side of the binary point for being loaded into the numerical value of accumulator 202 is determined as data binary system by processing unit 126 The aggregation of several points 2922 and weight binary point 2924.If so that it takes up a position, for example, data binary point 2922 Value is 5 and the value of weight binary point 2924 is 3, and the value in accumulator 202 will have 8 on the right side of binary point Position.When arithmetic logic unit function 2926 be a sum/maximum value accumulator and data/weight text or transmitting data/ Weight text, neural processing unit 126 can distinguish the digit on the right side of the binary point for the numerical value for being loaded into accumulator 202 It is determined as data/weight binary point 2922/2924.In another embodiment, then refer to one accumulator two of order into Decimal point 2923 processed, without removing specified other data binary point 2922 and weight binary point 2924.This portion Divide and corresponds at Figure 29 B and can be described in more detail subsequent.

The specified function executed by the arithmetic logic unit 204 of neural processing unit 126 of arithmetic logic unit function 2926. It has been observed that arithmetic logic unit function 2926 may include following operation but be not limited to: by data literal 209 and weight text 203 It is multiplied and is added this product with accumulator 202；Accumulator 202 is added with weight text 203；By accumulator 202 and data Text 209 is added；Maximum value in accumulator 202 and data literal 209；Maximum in accumulator 202 and weight text 209 Value；Export accumulator 202；Transmit data literal 209；Transmit weight text 209；Export zero.In one embodiment, this arithmetic Logic unit function 2926 is specified by neural network unit initialization directive, and by 204 use of arithmetic logic unit with (not shown) is executed instruction in response to one.In one embodiment, this arithmetic logic unit function 2926 is by a other neural network list Metainstruction is specified, and such as aforementioned multiply-accumulate and maxwacc is instructed.

Rounding control 2932 specifies the form of rounding-off operation used in (in Figure 30) rounder 3004.In an embodiment In, assignable rounding mode includes but is not limited to: unrounded, be rounded up to most recent value and random rounding-off.Preferably implement with regard to one For example, processor 100 includes random order source 3003 (referring to figure 3. 0) to generate random order 3005, these random orders 3005 is sampled to execute random rounding-off to reduce a possibility that generating rounding-off deviation.In one embodiment, work as rounding bit 3005 for one and stick the position (sticky) be zero, if the random order 3005 of sampling is that very, neural processing unit 126 will be given up upwards Enter, if the random order 3005 of sampling is vacation, neural processing unit 126 would not be rounded up to.In one embodiment, random order Source 3003 is sampled based on the random characteristic electron that processor 100 has to generate random order 3005, these random electronics The thermal noise of characteristic such as semiconductor diode or resistance, but the present invention is not limited thereto.

Run function 2934 specifies the function for 202 numerical value 217 of accumulator to generate the defeated of neural processing unit 126 Out 133.As described herein, run function 2934 includes but is not limited to: S type function；Hyperbolic tangent function；Soft plus function；Correction Function；Divided by two specified power side；The reciprocal value that a user specifies is multiplied by reach equivalent division；Transmitting is entire cumulative Device；And transmit accumulator with standard size, this part can be described in more detail in following sections.In one embodiment, Run function is as specified by neural network unit starting function instruction.In addition, run function can also as specified by initialization directive, And it is used in response to output order, such as the run function unit output order of address 4 is located at figure in this embodiment in Fig. 4 The run function instruction of address 3 can be contained in output order in 4.

2942 values specified one reciprocal is multiplied to reach to the progress of 202 numerical value 217 of accumulator with 202 numerical value 217 of accumulator The numerical value of division arithmetic.That is, 2942 value of inverse specified by user can be falling for the divisor actually wished to carry out Number.This is conducive to arrange in pairs or groups convolution as described herein or common source operation.For a preferred embodiment, user can will be reciprocal 2942 values are appointed as two parts, this corresponds at Figure 29 C and can be described in more detail subsequent.In one embodiment, it controls Buffer 127 includes that a field (not shown) allows user that can specify a progress division in multiple built-in divider values, this The sizableness of a little built-in divider values is in the size of common convolution kernel, such as 9,25,36 or 49.In this embodiment, start letter Counting unit 212 can store the inverse of these built-in divisors, to be multiplied with 202 numerical value 217 of accumulator.

The digit that offset 2944 specifies the shift unit of run function unit 212 that can move to right 202 numerical value 217 of accumulator, With reach by its divided by two power side operation.This convolution kernel having a size of two power side that is conducive to arrange in pairs or groups carries out operation.

Exporting 2952 value of random access memory can be in data random access memory 122 and weight random access memory One is specified in 124 to receive output result 133.

Exporting 2954 value of binary point indicates the position of binary point of output result 133.It is preferably real with regard to one For applying example, for exporting the position of binary point of result 133, output 2954 value of binary point is indicated The position number of positions calculated from right side.In other words, output binary point 2954 indicates the minimum effective of output result 133 Belong to the quantity of decimal place, that is, the digit being located on the right side of binary point in position.Run function unit 212 can be based on output two The numerical value of system decimal point 2954 (in most cases, can also be based on data binary point 2922, weight binary system Decimal point 2924, run function 2934 and/or configure 2902 numerical value) executes rounding-off, compression, be saturated and size conversion fortune It calculates.

Output order 2956 can export result 133 from many Control-orienteds.In one embodiment, run function unit 121 Standard-sized concept can be utilized, standard size is twice for configuring 2902 specified width sizes (in bits).In this way, citing For, if 2902 setting of configuration is received defeated with weight random access memory 124 by data random access memory 122 The size for entering text is 8, and standard size will be 16；In another example, if configuration 2902 setting by data with It is 16 that machine, which accesses memory 122 and the size of the received input text of weight random access memory 124, and standard size will It is 32.As described herein, larger (for example, narrow accumulator 202B is 28 to the size of accumulator 202, and wide tired Adding device 202A then is 41) to maintain intermediate computations, such as 1024 and 512 multiply-accumulate instructions of neural network unit, full essence Degree.In this way, 202 numerical value 217 of accumulator will be greater than (in bits) standard size, and for most of number of run function 2934 Value (in addition to transmitting entire accumulator), run function unit 212 (such as standard size described in the paragraph below corresponding to Figure 30 Compressor 3008) 202 numerical value 217 of accumulator will be compressed to standard-sized size.First default of output order 2956 Value can indicate run function unit 212 execute specified run function 2934 using generate internal result and by this internal result as It exports result 133 to export, the size of this internal result is equal to the size for being originally inputted text, i.e., standard-sized half.Output Second default value of order 2956 can indicate that run function unit 212 executes specified run function 2934 to generate internal result And exported the lower half of this internal result as output result 133, the size of this internal result, which is equal to, is originally inputted text Twice of size, i.e. standard size；And the third default value for exporting order 2956 can indicate run function unit 212 by gauge The upper half of very little inside result is exported as output result 133.4th default value of output order 2956 can indicate starting letter Counting unit 212 is exported untreated minimum effective text of accumulator 202 as output result 133；And export order 2956 the 5th default value can indicate run function unit 212 using the effective text in untreated centre of accumulator 202 as Result 133 is exported to export；6th default value of output order 2956 can indicate run function unit 212 by accumulator 202 not The processed effective text of highest (its width is as specified by configuration 2902) exports as output result 133, this is corresponded to above It is described in more detail in the chapters and sections of Fig. 8 to Figure 10.It has been observed that export entire 202 size of accumulator or it is standard-sized in Portion's result helps to allow other execution units 112 of processor 100 that can execute run function, such as soft very big run function.

Field described in Figure 29 A (and Figure 29 B and Figure 29 C) is located inside control buffer 127, but, the present invention It is not limited to this, wherein one or more fields may be alternatively located at the other parts of neural network unit 121.With regard to a preferred embodiment For, many fields may be embodied in neural network unit instruction internal, and be decoded by sequencer 128 micro- to generate Instruct 3416 (referring to figure 3. 4) control arithmetic logic unit 204 and/or run function unit 212.In addition, these fields It may be embodied in and be stored in micro- operation 3414 of media cache 118 (referring to figure 3. 4), to control arithmetic logic unit 204 And/or run function unit 212.This embodiment can reduce the use of initialization neural network unit instruction, and other It then can remove this initialization neural network unit instruction in embodiment.

It has been observed that the instruction of neural network unit can specify to memory operand (as stored from data random access The text of device 122 and/or weight random access memory 123) or one rotation after operand (as come from multitask buffer 208/705) arithmetical logic ordering calculation is executed.In one embodiment, neural network unit instruction can also be by an operand It is appointed as the buffer output (output of the buffer 3038 of such as Figure 30) of run function.In addition, it has been observed that neural network list Metainstruction can specify make data random access memory 122 or weight random access memory 124 when top address is passed Increase.In one embodiment, the instruction of neural network unit may specify that signed integer difference is added when forefront is incremental to reach immediately Or the purpose for the numerical value other than one of successively decreasing.

Figure 29 B is the block schematic diagram for showing another embodiment of control buffer 127 of Fig. 1.The control of Figure 29 B is slow Storage 127 is similar to the control buffer 127 of Figure 29 A, but, the control buffer 127 of Figure 29 B include an accumulator two into Decimal point 2923 processed.The binary point position of the expression accumulator 202 of accumulator binary point 2923.It is preferably real with regard to one For applying example, 2923 value of accumulator binary point indicates position number of positions of this binary point position from right side.It changes Yan Zhi, accumulator binary point 2923 indicate to belong to the quantity of decimal place in the least significant bit of accumulator 202, that is, are located at Position on the right side of binary point.In this embodiment, accumulator binary point 2923 is explicitly indicated, rather than such as Figure 29 A Embodiment be to confirm in the dark.

Figure 29 C is display with the block schematic diagram of an embodiment of the inverse 2942 of two section store Figure 29 A.First Part 2962 is a deviant, indicates that user wants to be multiplied by the true reciprocal value of 202 numerical value 217 of accumulator and is suppressed Leading zero quantity 2962.The quantity of leading zero is an immediately proceeding on the right side of binary point continuously arranged zero quantity.The Two parts 2694 are leading null suppression reciprocal values, that is, by all leading zeroes remove after true reciprocal value.In an embodiment In, it is suppressed leading zero quantity 2962 and is stored with 4, and leading null suppression reciprocal value 2964 is then with 8 not signed value storages It deposits.

As an example it is assumed that user wants the reciprocal value that 202 numerical value 217 of accumulator is multiplied by numerical value 49.Numerical value 49 It will be 0.0000010100111 that reciprocal value, which is presented with two dimension and set 13 decimal places, wherein there are five leading zeroes.In this way, Suppressed leading zero quantity 2962 can be inserted numerical value 5 by user, and leading null suppression reciprocal value 2964 is inserted numerical value 10100111.In multiplier reciprocal " divider A " 3014 (referring to figure 3. 0) by 202 numerical value 217 of accumulator and leading null suppression After reciprocal value 2964 is multiplied, generated product can be moved to right according to leading zero quantity 2962 is suppressed.Such embodiment helps In the requirement for expressing 2942 values reciprocal using relatively small number of position and reaching pinpoint accuracy.

Figure 30 is the block schematic diagram for showing an embodiment of run function unit 212 of Fig. 2.This run function unit 212 127, positive type converters (PFC) of control logic comprising Fig. 1 and output binary point aligner (OBPA) 3002 is small to receive 202 numerical value 217 of accumulator and output binary system to receive 202 numerical value of accumulator, 217, rounder 3004 The pointer of bit quantity that several aligners 3002 remove, a random order source 3003 as the aforementioned with generate random order 3005, One the first multiplexer 3006 is to receive positive type converter and export output and the house of binary point aligner 3002 Enter output, standard size compressor (CCS) and the saturator 3008 of device 3004 with receive the output of the first multiplexer 3006, One digit selector receives the output of standard size compressor and saturator 3008, a corrector 3018 with saturator 3012 With receive standard size compressor and saturator 3008 output, a multiplier 3014 reciprocal to be to receive standard size compression The output of device and saturator 3008, a right shift device 3016 are to receive the defeated of standard size compressor and saturator 3008 Out, tanh (tanh) module 3022 is to receive the output of digit selector and saturator 3012, a S pattern block 3024 To receive, the output of digit selector and saturator 3012, one soft plus module 3026 is to receive digit selector and saturator 3012 Output, second multiplexer 3032 to be to receive tanh module 3022, S pattern block 3024, soft plus module 3026, correction The output and standard size compressor and saturator 3008 of device 3018, multiplier 3014 reciprocal with right shift device 3016 are passed The standard size passed exports 3028, symbol restorers 3034 to receive the output of the second multiplexer 3032, a size turn Parallel operation and saturator 3036 with receive output, the third multiplexer 3037 of symbol restorer 3034 with receive size converter with The output of saturator 3036 and accumulator output 217 and an output state 3038 to receive the output of multiplexer 3037, And its output is the result 133 in Fig. 1.

Positive type converter and output binary point aligner 3002 receive 202 value 217 of accumulator.It is preferably real with regard to one For applying example, it has been observed that 202 value 217 of accumulator is a full precision value.That is, accumulator 202 has enough storages For digit to load cumulative number, this cumulative number is by integer adder 244 by a series of products generated by integer multiplier 242 Be added caused by sum, and this operation do not give up multiplier 242 individual products or adder it is each sum in it is any One position is to maintain accuracy.For a preferred embodiment, at least there is accumulator 202 enough digits to load nerve net Network unit 121 can be programmed the maximum quantity for executing the product accumulation generated.For example, program referring to figure 4., in width Under configuration, it is 512 that neural network unit 121, which can be programmed and execute the maximum quantity of the product accumulation generated, and cumulative number 202 Bit width is 41.In another example, 0 program referring to figure 2., under narrow configuration, neural network unit 121 can be programmed The maximum quantity for executing the product accumulation generated is 1024, and 202 bit width of cumulative number is 28.Substantially, full precision accumulator 202 have at least Q position, and wherein Q is M and log₂The aggregation of P, wherein M is that the bit width of the integer multiplication of multiplier 242 (is lifted For example, it is 16 for narrow multiplier 242, is 32 for wide multiplier 242), and P is 202 institute of accumulator The maximum allowable quantity of product that can be cumulative.For a preferred embodiment, the maximum quantity of product accumulation is according to nerve net Specified by the program specification of the program designer of network unit 121.In one embodiment, it is assumed that a previous multiplications accumulated instruction is used To load data/column of weight text 206/207 (finger of address 1 in such as Fig. 4 from data/weight random access memory 122/124 Enable) on the basis of, sequencer 128 can execute the counting of multiply-accumulate neural network unit instruction (instruction of address 2 in such as Fig. 4) Maximum value be such as 511.

There is enough bit widths using one and cumulative fortune can be executed to the full precision value of allowed cumulative maximum quantity The accumulator 202 of calculation can simplify the design of the arithmetic logic unit 204 of neural processing unit 126.In particular, processing in this way The demand for needing to execute saturation arithmetic to the sum that integer adder 244 generates using logic can be mitigated, because integer adds Musical instruments used in a Buddhist or Taoist mass 244 can make a small-sized accumulator generate overflow, and need to keep track the binary point position of accumulator with true Recognize and whether generates overflow to be confirmed whether to need to be implemented saturation arithmetic.For example, for non-full precision accumulator but tool For having saturation logic to handle the design of the overflow of non-full precision accumulator, it is assumed that there are following situations.

(1) range of data literal value be between 0 and 1 and all bit of storage are all to store decimal place.Weight text The range of word value is all bit of storage between -8 and+8 and in addition to three all to store decimal place.As one The range of the accumulated value of the input of tanh run function is and all storages in addition to three between -8 and 8 Position is all to store decimal place.

(2) bit width of accumulator is non-full precision (as there was only the bit width of product).

(3) assume that accumulator is full precision, final accumulated value is also big to date between -8 and 8 (such as+4.2)；But, exist Product in this sequence before " point A " can relatively frequently generate positive value, and the product after point A then can relatively frequently generate negative value.

In the case, it is possible to obtain incorrect result (such as the result other than+4.2).This is because in front of point A Certain points, be more than numerical value that it is saturated maximum value+8 when needing to make accumulator to reach one, such as+8.2, will lose and have more 0.2.Accumulator even can make remaining product accumulation result maintain saturation value, and can lose more positive values.Therefore, it adds up The end value of device may be less than using the accumulator numerical value calculated (being less than+4.2) with full precision bit width.

Positive type converter 3004 can be converted into positive type, and generate volume when 202 numerical value 217 of accumulator is negative The positive and negative of script numerical value is pointed out in outer position, this meeting is passed down to 212 pipeline of run function unit with herewith numerical value.By negative Being converted to positive type can simplify the operation of subsequent run function unit 121.For example, after this treatment, only positive value meeting Tanh module 3022 and S pattern block 3024 are inputted, thus can simplify the design of these modules.In addition it is also possible to simplify Rounder 3004 and saturator 3008.

Output binary point aligner 3002 can move right or scale this positive type value, keep it slow in alignment with control The output binary point 2954 specified in storage 127.For a preferred embodiment, binary point aligner is exported 3002 decimal digits that can calculate 202 numerical value 217 of accumulator are (such as specified by accumulator binary point 2923 or number According to the aggregation of binary point 2922 and weight binary point 2924) decimal digits of output is subtracted (such as by exporting Specified by binary point 2954) difference as offset.So, for example, if 202 binary fraction of accumulator It is 3 that point 2923, which exports binary point 2954 for 8 (i.e. above-described embodiments), and output binary point aligner 3002 is just This positive type numerical value can be moved to right 5 positions to generate the result for being provided to multiplexer 3006 Yu rounder 3004.

Rounder 3004 can execute rounding-off operation to 202 numerical value 217 of accumulator.For a preferred embodiment, rounder The 3004 positive type numerical value that can be generated to positive type converter and output binary point aligner 3002 generate a rounding-off Version afterwards, and by this be rounded after version be provided to multiplexer 3006.Rounder 3004 can be executed according to aforementioned rounding control 2932 It is rounded operation, as described herein, aforementioned rounding control will include the random rounding-off using random order 3005.Multiplexer 3006 can be according to According to rounding control 2932 (as described herein, to may include being rounded at random), the selection in its multiple input is first, namely from just After rounding-off of the type transducer with the positive type numerical value for exporting binary point aligner 3002 or from rounder 3004 Version, and the numerical value after selection is supplied to standard size compressor and saturator 3008.For a preferred embodiment, if It is that rounding control is specified without rounding-off, multiplexer 3006 will select positive type converter to be aligned with output binary point Otherwise the output of device 3002 will select the output of rounder 3004.It in other embodiments, can also be by run function unit 212 execute additional rounding-off operation.For example, in one embodiment, when digit selector 3012 to standard size compressor with When output (as be described hereinafter) position of saturator 3008 is compressed, low cis-position position of the meeting of digit selector 3012 based on loss is given up Enter operation.In another example, the product of multiplier 3014 (as be described hereinafter) reciprocal can be subjected to rounding-off operation.In another model In example, the needs of size converter 3036 convert out Output Size appropriate (as be described hereinafter), this conversion may relate to lose certain use In the low cis-position position for determining rounding-off, rounding-off operation is carried out.

3006 output valve of multiplexer can be compressed to standard size by standard size compressor 3008.So that it takes up a position, for example, if Be neural processing unit 126 be in it is narrow configuration or funnel configuration 2902, standard size compressor 3008 can be by 28 multiplexings 3006 output valve of device is compressed to 16；And if neural processing unit 126 is in wide configuration 2902, standard size compressor 41 3006 output valves of multiplexer can be compressed to 32 by 3008.But, before being compressed to standard size, if value before compression Greater than the maximum value that standard type can be expressed, before saturator 3008 will be such that this compresses, value, which is filled up to standard type, to express Maximum value.For example, if being located at any position before highest is effectively compressed on the left of value position before compressing in value is all numerical value 1, Saturator 3008 will be filled up to maximum value (such as fill up is all 1).

For a preferred embodiment, tanh module 3022, S pattern block 3024 and soft plus module 3026 are all wrapped Containing look-up table, such as programmable logic array (PLA), read-only memory (ROM), combinational logic lock.In one embodiment, In order to simplify and reduce the size of these modules 3022/3024/3026, the input value for being provided to these modules has 3.4 type Formula, i.e. three integer characters and four decimal places namely input value tool are located on the right side of binary point and have there are four position There are three positions to be located on the left of binary point.Since at the extreme place of the input value range (- 8 ,+8) of 3.4 patterns, output valve Can be progressively close to its min/max, therefore select these numerical value.But, the present invention is not limited thereto, and the present invention can also answer For other embodiments that binary point is placed on to different location, such as with 4.3 patterns or 2.5 patterns.Digit selector 3012 selection can select the position for meeting 3.4 pattern specifications in the position that standard size compressor and saturator 3008 export, this is related to And compression processing, that is, certain positions can be lost, because standard type then has more digit.But, in selection/compression mark Before object staff cun compressor and 3008 output valve of saturator, if value is greater than the maximum value that 3.4 patterns can be expressed before compression, satisfy Value before compressing will be made to be filled up to the maximum value that 3.4 patterns can be expressed with device 3012.For example, if before compression in value Any position on the left of highest effective 3.4 pattern position is all numerical value 1, and saturator 3012 will be filled up to maximum value and (such as fill up 1) to whole.

Tanh module 3022, S pattern block 3024 and soft plus module 3026 can be to standard size compressor and saturators 3.4 pattern numerical value of 3008 outputs execute corresponding run function (described above) to generate a result.With regard to a preferred embodiment For, it is 7 of 0.7 pattern caused by tanh module 3022 and S pattern block 3024 as a result, i.e. zero integer word Member with seven decimal places namely input value there are seven positions to be located on the right side of binary point.It is soft for a preferred embodiment Add the generation of module 3026 is 7 of 3.4 patterns as a result, i.e. its pattern is identical as the entry type of this module 3026.Just For one preferred embodiment, tanh module 3022, S pattern block 3024 and soft plus module 3026 output can be extended to mark Pseudotype formula (such as adding leading zero if necessary) is simultaneously aligned and makes binary point by the number of output binary point 2954 Specified by value.

Corrector 3018 can generate standard size compressor and version after the correction of the output valve of saturator 3008.Namely It says, if standard size compressor and the output valve (its such as aforementioned symbol is moved down with pipeline) of saturator 3008 are negative, corrector 3018 can export zero；Otherwise, corrector 3018 will be inputted value output.For a preferred embodiment, corrector 3018 output is standard type and has the binary point as specified by 2954 numerical value of output binary point.

The meeting of multiplier 3014 reciprocal is by the output of standard size compressor and saturator 3008 and is specified in reciprocal value 2942 User specify reciprocal value be multiplied, to generate standard-sized product, this product actually be standard size compressor with The output valve of saturator 3008, the quotient calculated using the inverse of reciprocal value 2942 as divisor.With regard to a preferred embodiment Speech, the output of multiplier 3014 reciprocal are standard type and have the binary system specified by 2954 numerical value of output binary point Decimal point.

Right shift device 3016 can be by the output of standard size compressor and saturator 3008, to be specified in offset value 2944 user specifies digit to move, to generate standard-sized quotient.For a preferred embodiment, right shift The output of device 3016 is standard type and has the binary point specified by 2954 numerical value of output binary point.

Multiplexer 3032 selects to be properly entered specified by 2934 value of run function, and is selected to be provided to symbol recovery Device 3034, if 202 numerical value 217 of accumulator of script is negative value, what symbol restorer 3034 will export multiplexer 3032 Positive type numerical value conversion is negative type, such as is converted to two complement code types.

Size converter 3036 can be according to the numerical value of the output order 2956 as described in Figure 29 A, by symbol restorer 3034 Output convert to size appropriate.For a preferred embodiment, the output of symbol restorer 3034 is with one by exporting The specified binary point of 2954 numerical value of binary point.For a preferred embodiment, for the first of output order For default value, size converter 3036 can give up the upper portion of the output of symbol restorer 3034.In addition, if symbol restores The output of device 3034 is positive and is more than the maximum value that the specified character size of configuration 2902 can express, or output is negative simultaneously And be less than the minimum value that can express of character size, saturator 3036 will output it fill up so far character size respectively can Express maximum/minimum value.For second and third default value, size converter 3036 can transmit the defeated of symbol restorer 3034 Out.

Multiplexer 3037 can be according to output order 2956, in data converter and the output of saturator 3036 and accumulator 202 Select one to be supplied to output state 3038 in output 217.Furthermore, it is understood that first and for output order 2956 Two default values, multiplexer 3037 can select the lower section text of size converter and the output of saturator 3036, and (size is by configuring 2902 is specified).For third default value, multiplexer 3037 can select the top text of the output of size converter and saturator 3036 Word.For the 4th default value, multiplexer 3037 can select the lower section text of untreated 202 numerical value 217 of accumulator；For 5th default value, multiplexer 3037 can select the midamble of untreated 202 numerical value 217 of accumulator；And it is silent for the 6th Recognize value, multiplexer 3037 can select the top text of untreated 202 numerical value 217 of accumulator.It has been observed that preferably implementing with regard to one For example, run function unit 212 can add zero upper position in the top text of untreated 202 numerical value 217 of accumulator.

Figure 31 is the example for showing the running of run function unit 212 of Figure 30.As shown in the figure, neural processing unit 126 configuration 2902 is set as narrow configuration.In addition, signed number is true with 2914 value of tape symbol weight according to 2912.In addition, data 2922 value of binary point indicates that for 122 text of data random access memory, binary system scaling position is right There are 7 positions in side, neural processing unit 126 exemplary values of received first data literal be rendered as 0.1001110.In addition, 2924 value of weight binary point indicates the binary point position for 124 text of weight random access memory Setting right side has 3 positions, neural processing unit 126 the exemplary values of received first weight text be rendered as 00001.010.

First data are presented with 16 products (this product can be added with the initial zero value of accumulator 202) of weight text It is 000000.1100001100.Because data binary point 2912 is 7 and weight binary point 2914 is 3, for For 202 binary point of accumulator implied, right side has 10 positions.In the case where narrow configuration, such as this implementation Shown in example, accumulator 202 has 28 bit wides.After for example, completing all arithmetic logical operations (such as Figure 20 all 1024 A multiply-accumulate operation), the numerical value 217 of accumulator 202 can be 000000000000000001.1101010100.

Output 2954 value of binary point indicates that there are 7 positions on the binary point right side of output.Therefore, defeated in transmitting With after standard size compressor 3008,202 numerical value 217 of accumulator can be scaled, give up binary point aligner 3002 out Enter and be compressed to the numerical value of standard type, i.e., 000000001.1101011.In this example, binary fraction dot address is exported Indicate 7 decimal places, and 202 binary point position of accumulator indicates 10 decimal places.Therefore, binary point is exported Aligner 3002 can calculate difference 3, and penetrate 202 numerical value 217 of accumulator moving to right 3 positions to zoom in and out to it.Scheming I.e. display 202 numerical value 217 of accumulator can lose 3 least significant bits (binary number 100) in 31.In addition, in this example, house Entering 2932 values of control indicates using random rounding-off, and assumes that sampling random order 3005 is true in this example.In this way, as before It states, least significant bit will be rounded up to, this is because (this 3 because of accumulator for the rounding bit of 202 numerical value 217 of accumulator The zoom operation of 202 numerical value 217 and the most significant bit in the position that is moved out of) it is one, and (this 3 because of accumulator 202 for glutinous position The zoom operation of numerical value 217 and in the position that is moved out of, the boolean of 2 least significant bits or operation result) be zero.

In this example, it is S type function that run function 2934, which indicates used,.In this way, digit selector 3012 will select It selects the position of standard type value and has the input of S pattern block 3024 there are three integer character and four decimal places, it has been observed that i.e. institute The numerical value 001.1101 shown.The output numerical value of S pattern block 3024 can be put into standard type, i.e., shown in numerical value 000000000.1101110。

The first default value, the i.e. character size of 2902 expression of output configuration are specified in the output order 2956 of this example, herein In the case of i.e. narrow text (8).In this way, standard S type output valve can be converted to 8 amounts, tool by size converter 3036 There is an implicit binary point, i.e., there are 7 positions on the right side of this binary point, and generates an output valve 01101110, as shown in the figure.

Figure 32 is second example for showing the running of run function unit 212 of Figure 30.The example description of Figure 32, which is worked as, opens When dynamic function 2934 indicates to transmit 202 numerical value 217 of accumulator with standard size, the operation of run function unit 212.Such as institute in figure Show, this configuration 2902 is set as the narrow configuration of neural processing unit 216.

In this example, the width of accumulator 202 is 28 positions, is had on the right side of the position of 202 binary point of accumulator 10 positions are (this is because data binary point 2912 and the aggregation of weight binary point 2914 are in one embodiment 10, or accumulator binary point 2923 is clearly designated as with numerical value 10 in another embodiment).For example, After executing all arithmetic logical operations, 202 numerical value 217 of accumulator shown in Figure 32 is 000001100000011011.1101111010。

In this example, output 2954 value of binary point indicates for output there is 4 on the right side of binary point A position.Therefore, after transmitting output binary point aligner 3002 and standard size compressor 3008, accumulator 202 Numerical value 217 can be saturated and standard type value 111111111111.1111 shown in being compressed to, this numerical value is by 3032 institute of multiplexer It is received as standard size delivery value 3028.

Two output orders 2956 are shown in this example.First specified second default value of output order 2956, i.e., it is defeated The lower section text of standard type size out.Because size indicated by configuration 2902 is narrow text (8), standard size will be 16, and size converter 3036 can selection criteria size delivery value 3028 the position of lower section 8 to generate as illustrated in the drawing 8 Bit value 11111111.2956 specified third default values, i.e. the top text of outputting standard pattern size are ordered in second output. In this way, the position of top 8 of the meeting selection criteria size delivery value 3028 of size converter 3036 is to generate 8 digit as illustrated in the drawing Value 11111111.

Figure 33 is the third example for showing the running of run function unit 212 of Figure 30.The example of Figure 33 is disclosed to work as and be opened The running of run function unit 212 when dynamic function 2934 indicates to transmit entire untreated 202 numerical value 217 of accumulator.Such as As shown in the figure, this configuration 2902 is set as the wide configuration (such as 16 input text) of neural processing unit 126.

In this example, the width of accumulator 202 is 41 positions, and accumulator has 8 on the right side of 202 binary point position A position (this is because the aggregation of data binary point 2912 and weight binary point 2914 is 8 in one embodiment, Or accumulator binary point 2923 is clearly designated as with numerical value 8 in another embodiment).For example, it is holding After all arithmetic logical operations of row, 202 numerical value 217 of accumulator shown in Figure 33 is 001000000000000000001100000 011011.11011110。

Three output orders 2956 are shown in this example.Specified 4th default value is ordered in first output, i.e., output without The lower section text of 202 numerical value of accumulator of processing；Specified 5th default value is ordered in second output, that is, is exported untreated tired Add the midamble of 202 numerical value of device；And specified 6th default value is ordered in third output, that is, exports untreated accumulator The top text of 202 numerical value.Because size indicated by configuration 2902 is wide text (16), as shown in figure 33, in response to first Output order 2956, multiplexer 3037 can select 16 place values 0001101111011110；It is more in response to the second output order 2956 Work device 3037 can select 16 place values 0000000000011000；And in response to third output order 2956, multiplexer 3037 can select 16 place values 0000000001000000.

It has been observed that neural network unit 121 can be implemented in integer data rather than floating data.In this way, facilitating letter Change each and every one neural processing unit 126, or 204 part of arithmetic logic unit at least within.For example, this arithmetical logic list There is no need to the adder needed for the index of multiplier to be added is incorporated in floating-point operation for multiplier 242 for member 204.Class As, this arithmetic logic unit 204 there is no need to for adder 234 and being incorporated in floating-point operation needs for being directed at addend Binary point shift unit.For technical field tool usually intellectual when that can understand, floating point unit is often very multiple It is miscellaneous；Therefore, exemplifications set out herein is simplified only for arithmetic logic unit 204, has hardware fixed point auxiliary using described And user is allowed to may specify that the integer embodiment of associated binary decimal point can also be used for simplifying other parts.Compared to The embodiment of floating-point uses integer unit to be can produce at the nerve of one smaller (and very fast) as arithmetic logic unit 204 Unit 126 is managed, and is conducive to for large-scale 126 array of neural processing unit being integrated into neural network unit 121.It opens The part of dynamic function unit 212 can specified, cumulative number needs based on user decimal place quantity and output valve need Decimal place quantity, to handle the scaling and saturation arithmetic of 202 numerical value 217 of accumulator, and preferably is specified based on user.Appoint What additional complexity and adjoint size increase and the energy in the fixed-point hardware of run function unit 212 auxiliary and/or when Between consume, can be shared through the mode for sharing run function unit 212 between arithmetic logic unit 204, this is Because the quantity of run function unit 1112 can be reduced using the embodiment of sharing mode as shown in the embodiment of Figure 11.

Embodiment as described herein can enjoy the advantages of many utilization integer arithmetic units are to reduce hardware complexity (phase Compared with using floating point arithmetic unit), and can be used for the arithmetical operation of decimal simultaneously, i.e., with the number of binary point. The advantages of floating-point arithmetic, is that it can provide date arithmetic and fall in a very wide numerical value to the individual number of data In range (it is actually limited only in the size of index range, therefore can be a very big range).That is, each floating Points have its potential unique index value.But, embodiment as described herein understands and utilizes and has in certain applications There is input data height parallel and falls within the spy for making all panel datas that there is identical " index " in the range of a relative narrower Property.In this way, these embodiments allow user that binary point position is once assigned to all input values and/or is added up Value.Similarly, through the characteristic for understanding and having using parallel output similar range, these embodiments allow user by binary system Scaling position is once assigned to all output valves.Artificial neural network is an example of such application, but of the invention Embodiment can also be applied to execute the calculating of other application.Through by binary point position be once assigned to it is multiple input and Non- to give to a other input number, compared to floating-point operation is used, the embodiment of the present invention can be efficiently empty using memory Between (such as need less memory) and/or promote precision using the memory of similar quantity, this is because The position of index for floating-point operation can be used to promote numerical precision.

In addition, the embodiment of the present invention understands that (such as overflow or forfeiture are less in the integer arithmetic to a large series Important decimal place) precision may be lost when executing cumulative, therefore a solution is provided, it is mainly sufficiently large using one Accumulator avoid the precision from losing.

The direct execution of the micro- operation of neural network unit

Figure 34 is the block schematic diagram for showing the part details of processor 100 and neural network unit 121 of Fig. 1.Mind It include the pipeline stages 3401 of neural processing unit 126 through network unit 121.Each pipeline stages 3401 are distinguished with grade buffer, and The operation for reaching the neural processing unit 126 of this paper including combinational logic, such as Boolean logic lock, multiplexer, adder, multiplication Device, comparator etc..Pipeline stages 3401 receive micro- operation 3418 from multiplexer 3402.Micro- operation 3418 can flow downward to pipe Line grade 3401 simultaneously controls a combination thereof logic.Micro- operation 3418 is a position set.For a preferred embodiment, micro- operation 3418 Position, 124 storage address 125 of weight random access memory including 122 storage address 123 of data random access memory Position, the position of 129 storage address 131 of program storage, multitask buffer 208/705 control signal 213/713, also The field (such as control buffer of Figure 29 A to Figure 29 C) of many control buffers 217.In one embodiment, micro- operation 3418 Including about 120 positions.Multiplexer 3402 receives micro- operations from three different sources, and select one of them as being supplied to Micro- operation 3418 of pipeline stages 3401.

The micro- operation source of one of multiplexer 3402 is the sequencer 128 of Fig. 1.The meeting of sequencer 128 will be by program storage 129 received neural network unit Instruction decodings simultaneously generate a micro- operation 3416 accordingly and are provided to the first defeated of multiplexer 3402 Enter.

The micro- operation source of second of multiplexer 3402 is microcommand 105 to be received from the reservation station 108 of Fig. 1 and from general Buffer 116 and media cache 118 receive the decoder 3404 of operand.For a preferred embodiment, it has been observed that micro- finger It enables produced by 105 translations for instructing 1500 with MFNN in response to MTNN instruction 1400 as instruction translator 104.Microcommand 105 can wrap An immediate field is included with a specified specific function (as specified by a MTNN instruction 1400 or a MFNN instruction 1500), example As the beginning and stopping of 129 internal program of program storage executes, directly from media cache 118 executes a micro- operation or such as The memory of aforementioned read/write neural network unit.Decoder 3404 can be micro- by the decoding of microcommand 105 and accordingly generation one Operation 3412 is provided to the second input of multiplexer.For a preferred embodiment, for MTNN instruction 1400/MFNN instruction For 1500 certain functions 1432/1532, decoder 3404 does not need one micro- operation 3412 of generation and is sent to pipeline downwards 3401, such as write-in control buffer 127, the program in beginning executive memory 129, pause executive memory The program in program, waiting program storage 129 in 129 is completed to execute, reads and reset nerve from status register 127 Network unit 121.

The micro- operation source of third of multiplexer 3402 is media cache 118 itself.For a preferred embodiment, such as Correspond to described in Figure 14 above, MTNN instruction 1400 may specify a function to indicate that neural network unit 121 directly executes one Micro- operation 3414 of the third input of multiplexer 3402 is provided to by media cache 118.Directly execute by framework media buffer Micro- operation 3414 that device 118 provides is conducive to test neural network unit 121, such as built-in self-test (BIST), or Except wrong movement.

For a preferred embodiment, decoder 3404 can generate a mode pointer 3422 and control multiplexer 3402 Selection.When the specified function of MTNN instruction 1400 starts to execute a program from program storage 129, decoder 3404, which can generate 3422 value of a mode pointer, makes multiplexer 3402 select micro- operation 3416 from sequencer 128, until occurring Mistake encounters the specified functions of MTNN instruction 1400 until decoder 3404 and stops executing from program storage 129 Program.When the specified function instruction neural network unit 121 of MTNN instruction 1400 is directly executed by media cache 118 The micro- operation 3414 provided, decoder 3404, which can generate 3422 value of mode pointer, makes the selection of multiplexer 3402 from meaning Micro- operation 3414 of fixed media cache 118.Otherwise, decoder 3404 will generate 3422 value of mode pointer make it is more Work device 3402 selects micro- operation 3412 from decoder 3404.

Variable rate neural network unit

In many cases, neural network unit 121 will be to be processed into standby mode (idle) etc. after executing program Device 100 handles some things for needing to handle before executing next program.As an example it is assumed that being in one is similar to Fig. 3 To situation described in Fig. 6 A, neural network unit 121 (alternatively referred to as can before award nerve net to a multiply-accumulate run function program Network layers program (feed forward neural network layer program)) it continuously performs two or more times.It compares Execute program the time it takes in neural network unit 121, processor 100 obviously need to spend longer time by The weighted value write-in weight random access memory 124 of 512KB is for neural network unit program use next time.In other words, Neural network unit 121 can execute program in a short time, then enter standby mode, until processor 100 will be following Weighted value write-in weight random access memory 124 for next secondary program execute use.This situation can refer to Figure 36 A, in detail such as It is aftermentioned.In the case, frequency operation is used when neural network unit 121 can be used lower with extending the time for executing program So that energy consumption needed for executing program is dispersed to longer time range, and makes neural network unit 121, or even entire place Device 100 is managed, lower temperature is maintained.This situation is known as mitigation mode, can refer to Figure 36 B, the details will be described later.

Figure 35 is a block diagram, and showing has the processor 100 of variable rate neural network unit 121.This 100 class of processor It is similar to the processor 100 of Fig. 1, and the component in figure with identical label is also similar.The processor 100 of Figure 35 simultaneously has Time-frequency generates the functional unit that logic 3502 is coupled to processor 100, these functional units instruct acquisition unit 101, instruction Cache 102, instruction translator 104 rename unit 106, reservation station 108, neural network unit 121, other execution units 112, memory sub-system 114, general caching device 116 and media cache 118.Time-frequency generates logic 3502 and generates including time-frequency Device, such as phase-locked loop (PLL), the time frequency signal of frequency or time-frequency frequency when having main with generation one.For example, Frequency can be 1GHz, 1.5GHz, 2GHz etc. when this is main.When frequency indicate periods per second, as time frequency signal exists Concussion number between high low state.Preferably, this time frequency signal is with equilibration period (duty cycle), the i.e. half in this period For high state, the other half is low state；In addition, this time frequency signal can also have the non-equilibrium period, that is, time frequency signal is in The time of high state is longer than its time for being in low state, and vice versa.Preferably, frequency when phase-locked loop is to generate multiple Main time frequency signal.Preferably, processor 100 includes power management module, according to the main time-frequency of many factors adjust automatically Rate, these factors include the dynamic detection operation temperature of processor 100, utilization rate (utilization), and soft from system The order of efficiency needed for part (such as operating system, basic input output system (BIOS)) indicates and/or energy-saving index.Implement one In example, power management module includes the microcode of processor 100.

Time-frequency generates logic 3502 and including time-frequency distribution network or time-frequency tree (clock tree).Time-frequency tree can be by master Time frequency signal is wanted to be disseminated to the functional unit of processor 100, as shown in figure 35, this distribution movement is exactly by time frequency signal 3506-1 It is sent to instruction acquisition unit 101, time frequency signal 3506-2 is sent to instruction cache 102, time frequency signal 3506-10 is transmitted To instruction translator 104, time frequency signal 3506-9 is sent to renaming unit 106, time frequency signal 3506-8 is sent to guarantor Station 108 is stayed, time frequency signal 3506-7 is sent to neural network unit 121, time frequency signal 3506-4 is sent to other execution Unit 112, is sent to memory sub-system 114 for time frequency signal 3506-3, and time frequency signal 3506-5 is sent to general caching Device 116, and time frequency signal 3506-6 is sent to media cache 118, these signals are referred to collectively as time frequency signal 3506. This time-frequency tree has node or line, to transmit main time frequency signal 3506 to its corresponding functional unit.Additionally it is preferred that It may include time-frequency buffer that time-frequency, which generates logic 3502, is needing to provide cleaner time frequency signal and/or is needing to be promoted main When the voltage quasi position of time frequency signal, especially for farther away node, time-frequency buffer can regenerate main time frequency signal.This Outside, each functional unit and with its own the period of the day from 11 p.m. to 1 a.m frequency set, regenerate and/or promote when needed it is received corresponding The voltage quasi position of main time frequency signal 3506.

Neural network unit 121 includes that time-frequency reduces logic 3504, and time-frequency, which reduces logic 3504 and receives, mitigates pointer 3512 With main time frequency signal 3506-7, to generate the second time frequency signal.Frequency when second time frequency signal has.If not frequency phase at this time Frequency when being same as main, be in mitigation mode from it is main when frequency reduce numerical value with reduce thermal energy generation, this mathematical program Change to mitigation pointer 3512.Time-frequency reduces logic 3504 and is similar to time-frequency generation logic 3502, with time-frequency distribution network, or Time-frequency tree, to spread the multiple functions square of the second time frequency signal to neural network unit 121, this distribution movement is exactly by time-frequency Signal 3508-1 is sent to neural pe array 126, and time frequency signal 3508-2 is sent to sequencer 128 i.e. by time-frequency Signal 3508-3 is sent to interface logic 3514, these signals are referred to collectively as the second time frequency signal 3508.Preferably, these are refreshing It include multiple pipeline stages 3401 through processing unit 126, as shown in figure 34, pipeline stages 3401 include pipeline hierarchical cache device, to Logic 3504, which is reduced, from time-frequency receives the second time frequency signal 3508-1.

Also there is neural network unit 121 interface logic 3514 to be believed with receiving main time frequency signal 3506-7 and the second time-frequency Number 3508-3.Interface logic 3514 is coupled to lower part (such as reservation station 108, the media cache 118 of 100 front end of processor With general caching device 116) and the multiple functions square of neural network unit 121 between, these function blocks reduce logic frequently in real time 3504, data random access memory 122, weight random access memory 124, program storage 129 and sequencer 128.It connects Mouth logic 3514 includes data random access memory buffering 3522, and weight random access memory buffers translating for 3524, Figure 34 Code device 3404, and mitigate pointer 3512.It mitigates pointer 3512 and loads a numerical value, the specified neural pe array of this numerical value 126 can execute the instruction of neural network unit program with speed how slowly.Preferably, mitigating pointer 3512 specifies a divider value N, when Frequency reduces logic 3504 and main time frequency signal 3506-7 is generated the second time frequency signal 3508 divided by this divider value, in this way, the The when frequency of two time frequency signals will be 1/N.Preferably, the numerical value of N is programmable to any one in multiple and different default values A, these default values can make time-frequency reduce the second time frequency signal that logic 3504 corresponds to frequency when generation is multiple to have different 3508, these when frequency be less than it is main when frequency.

In one embodiment, it includes time-frequency divider circuit that time-frequency, which reduces logic 3504, to by main time frequency signal 3506-7 is divided by mitigation 3512 numerical value of pointer.In one embodiment, it includes time-frequency lock (such as AND lock) that time-frequency, which reduces logic 3504, Time-frequency lock can pass through an enabling signal to gate main time frequency signal 3506-7, and enabling signal is in main time frequency signal per N number of A true value can be only generated in period.By one comprising counter by taking the circuit for generating enabling signal as an example, this counter can be to On count up to N.When the output that adjoint logic circuit detects counter is matched with N, logic circuit will be believed in the second time-frequency Numbers 3508 generate a true value pulses and redesign number device.Give program preferably, mitigating 3512 numerical value of pointer and can be instructed by framework Change, such as the MTNN instruction 1400 of Figure 14.Preferably, starting to execute nerve net in framework program instruction neural network unit 121 Before network unit program, the framework program for operating on processor 100 can be by the sequencing of mitigation value to pointer 3512 is mitigated, this part exists It is subsequent to correspond at Figure 37 and be described in more detail.

Weight random access memory buffering 3524 is coupled to weight random access memory 124 and media cache 118 Between as data therebetween transmission buffering.Preferably, weight random access memory buffering 3524 is similar to the buffering of Figure 17 One or more embodiments of device 1704.Preferably, weight random access memory buffering 3524 is received from media cache 118 The part of data using with it is main when frequency main time frequency signal 3506-7 as time-frequency, and weight random access memory is slow Punching 3524 receives the part of data from weight random access memory 124 with the second time frequency signal with frequency when second 3508-3 as time-frequency, when second frequency can according to sequencing in mitigate the numerical value of pointer 3512 from it is main when frequency downgrade or It is no, namely according to neural network unit 121 mitigation or normal mode are implemented in be downgraded or no.In one embodiment, it weighs Weight random access memory 124 is single port, and as described in Figure 17 above, weight random access memory 124 can also be delayed by media Storage 118 is buffered through weight random access memory buffering 3524, and by the column of neural processing unit 126 or Figure 11 1104, (arbitrated fashion) is accessed in a manner of arbitrating.In another embodiment, weight random access memory 124 For dual-port, as described in Figure 16 above, each port can be buffered by media cache 118 through weight random access memory It 3524 and is accessed in a parallel fashion by neural processing unit 126 or column buffer 1104.

Similar to weight random access memory buffering 3524, data random access memory buffering 3522 is coupled to data Buffering between random access memory 122 and media cache 118 as data transmission therebetween.Preferably, data are deposited at random One or more embodiments of the access to memory buffering 3522 similar to the buffer 1704 of Figure 17.Preferably, data random access Memorizer buffer 3522 from media cache 118 receive data part with it is main when frequency main time frequency signal 3506-7 is as time-frequency, and data random access memory buffering 3522 receives data from data random access memory 122 Second time frequency signal 3508-3 of frequency is as time-frequency when part is using with second, and frequency can be according to sequencing in mitigation when second The numerical value of pointer 3512 from it is main when frequency downgrade or no, namely be implemented in mitigation or normal mode according to neural network unit 121 Formula is downgraded or no.In one embodiment, data random access memory 122 is single port, as described in Figure 17 above, number Also 3522 can be buffered through data random access memory by media cache 118 according to random access memory 122, and by mind Through the column of processing unit 126 or Figure 11 buffering 1104, accessed in a manner of arbitrating.In another embodiment, data random access is deposited Reservoir 122 is dual-port, and as described in Figure 16 above, each port can be stored by media cache 118 through data random access Device buffers 3522 and is accessed in a parallel fashion by neural processing unit 126 or column buffer 1104.

No matter preferably, data random access memory 122 and/or weight random access memory 124 be single port or Dual-port, interface logic 3514 will include data random access memory buffering 3522 and buffer with weight random access memory 3524 to synchronize main time-frequency domain and the second time-frequency domain.Preferably, data random access memory 122, weight arbitrary access is deposited Reservoir 124 and program storage 129 all have static random access memory (SRAM), wherein enabling letter comprising other read Number, write-in enable signal and memory select enable signal.

It has been observed that neural network unit 121 is the execution unit of processor 100.Execution unit is that frame is executed in processor The microcommand or execute the functional unit that framework instructs itself that structure instruction translation goes out, such as execute framework in Fig. 1 and instruct 103 turns Microcommand 105 or the framework instruction 103 translated itself.Execution unit receives operand, example from the general caching device of processor Such as from general caching device 116 and media cache 118.Execution unit execute microcommand or framework instruction after can generate one as a result, This result can be written into general caching device.The instruction of MTNN described in Figure 14 and Figure 15 1400 and MFNN instruction 1500 instructs for framework 103 example.Microcommand is to realize that framework instructs.More precisely, execution unit for framework instruction translation go out one Or the collective of multiple microcommands executes, and will be the fortune that framework instruction is executed for the input of framework instruction It calculates, to generate the result of framework instruction definition.

Figure 36 A is a timing diagram, the fortune that there is video-stream processor 100 neural network unit 121 to operate on general modfel Make example, this general modfel i.e. with it is main when frequency operation.In timing diagram, the process of time is right by a left side.Processor 100 With it is main when frequency execute framework program.More precisely, processor 100 front end (such as instruction acquisition unit 101, instruction Cache 102, instruction translator 104 rename unit 106 and reservation station 108) with it is main when frequency seize, decode and issue frame Structure is instructed to neural network unit 121 and other execution units 112.

Originally, framework program executes framework instruction (such as MTNN instruction 1400), this framework is instructed and sent out by processor front end 100 Cloth indicates that neural network unit 121 starts to execute the neural network list in its program storage 129 to neural network unit 121 Metaprogram.Before, framework program can execute framework instruction and the numerical value write-in of frequency when specifying main is mitigated pointer 3512, Even if neural network unit is in general modfel.More precisely, sequencing to the numerical value for mitigating pointer 3512 can be such that time-frequency drops Low logic 3504 with main time frequency signal 3506 it is main when frequency generate the second time frequency signal 3508.Preferably, in this example In, the time-frequency buffer that time-frequency reduces logic 3504 promotes merely the voltage quasi position of main time frequency signal 3506.In addition before, Framework program can execute framework instruction so that data random access memory 122 is written, and weight random access memory 124 simultaneously will be refreshing Through network unit program write-in program memory 129.In response to neural network unit program MTNN instruction 1400, neural network list Member 121 can start with it is main when frequency execute neural network unit program, this is because mitigate pointer 3512 be with main time-frequency Rate value gives sequencing.After neural network unit 121 starts execution, framework program can continue with it is main when frequency execute framework Instruction, including mainly deposited at random with 1400 write-in of MTNN instruction and/or reading data random access memory 122 with weight Access to memory 124 to complete the example next time (instance) for neural network unit program, or calls (invocation) or execute (run) preparation.

In the example of Figure 36 A, completed compared to framework program random for data random access memory 122 and weight Access 124 write-ins of memory/reading the time it takes, neural network unit 121 can be with obvious less time (such as four / mono- time) complete neural network unit program execution.For example, with it is main when frequency operation in the case where, mind About 1000 time-frequency periods are spent to execute neural network unit program through network unit 121, but, framework program can be spent About 4000 time-frequency periods.In this way, neural network unit 121 would be at standby mode within the remaining time, in this model In example, this is a considerable time, such as about 3000 it is main when frequency cycle.As shown in the example of Figure 36 A, according to mind The difference of size and configuration through network can execute previous mode again, and may continuously carry out many times.Because of neural network Unit 121 is a quite big and intensive transistor functional unit in processor 100, and the running of neural network unit 121 will A large amount of thermal energy can be generated, especially with it is main when frequency operation when.

Figure 36 B is a timing diagram, the fortune that there is video-stream processor 100 neural network unit 121 to operate on mitigation mode Make example, frequency when frequency is lower than main when mitigating the running of mode.The timing diagram of Figure 36 B is similar to Figure 36 A, in Figure 36 A, Processor 100 with it is main when frequency execute framework program.This example assumes framework program and neural network unit journey in Figure 36 B Sequence is identical to the framework program and neural network unit program of Figure 36 A.But, before starting neural network unit program, frame Structure program can execute MTNN instruction 1400 and mitigate pointer 3512 with a mathematical programization, this numerical value can make time-frequency reduce logic 3504 Be less than it is main when frequency second when frequency generate the second time frequency signal 3508.That is, framework program can make nerve net Network unit 121 is in the mitigation mode of Figure 36 B, rather than the general modfel of Figure 36 A.In this way, neural processing unit 126 will be with the Frequency executes neural network unit program when two, under mitigation mode, frequency when frequency is less than main when second.It is false in this example Surely mitigate pointer 3512 be with one by frequency when second be appointed as a quarter it is main when frequency numerical value give sequencing.Such as This, it can be it in general mould that neural network unit 121 executes neural network unit program the time it takes under mitigation mode Time taking four times are spent under formula, as shown in Figure 36 A and Figure 36 B, can find that neural network unit 121 is in through this two figure is compared The time span of standby mode can significantly shorten.In this way, neural network unit 121 executes neural network unit journey in Figure 36 B The duration big appointment of sequence consumption energy is four times that neural network unit 121 executes program under general modfel in Figure 36 A. Therefore, neural network unit 121 executes the big appointment of the thermal energy that generates within the unit time of neural network unit program and is in Figure 36 B The a quarter of Figure 36 A, and have the advantages that described herein.

Figure 37 is a flow chart, shows the running of the processor 100 of Figure 35.The running of this flow chart description is similar to above Running corresponding to Figure 35, Figure 36 A and Figure 36 B figure.This process starts from step 3702.

In step 3702, processor 100 executes MTNN instruction 1400 and weight random access memory is written in weight 124 and write data into data random access memory 122.Following process advances to step 3704.

In step 3704, processor 100 executes MTNN instruction 1400 and mitigates pointer 3512 with a numerical value sequencing, Specified one of this numerical value lower than it is main when frequency when frequency, even if also neural network unit 121 is in mitigation mode.It connects down Incoming flow Cheng Qian proceeds to step 3706.

In step 3706, processor 100 executes 1400 instruction neural network unit 121 of MTNN instruction and starts to execute mind Through network unit program, i.e., the mode that is presented similar to Figure 36 B.Following process advances to step 3708.

In step 3708, neural network unit 121 starts to execute this neural network unit program.Meanwhile processor 100 MTNN instruction 1400 can be executed and new data (can also may be write to new weight write-in weight random access memory 124 Enter data random access memory 122), and/or execute MFNN instruction 1500 and read from data random access memory 122 Take result (also can may read result from weight random access memory 124).Following process advances to step 3712.

In step 3712, processor 100 executes MFNN and instructs 1500 (such as reading state buffers 127), with detecting Neural network unit 121 has terminated program execution.Assuming that framework procedure selection one good 3512 numerical value of mitigation pointer, nerve net Network unit 121, which executes neural network unit program the time it takes, will be identical to 100 execution part framework program of processor To access 122 the time it takes of weight random access memory 124 and/or data random access memory, such as Figure 36 B institute Show.Following process advances to step 3714.

In step 3714, processor 100 executes MTNN instruction 1400 and a mathematical programization is utilized to mitigate pointer 3512, this Frequency when numerical value specifies main, even if also neural network unit 121 is in general modfel.Next step 3716 is advanced to.

In step 3716, processor 100 executes 1400 instruction neural network unit 121 of MTNN instruction and starts to execute mind Through network unit program, i.e., the mode that is presented similar to Figure 36 A.Following process advances to step 3718.

In step 3718, neural network unit 121 starts to execute neural network unit program with general modfel.This process Terminate at step 3718.

It has been observed that compared under general modfel execute neural network unit program (i.e. with processor it is main when frequency Execute), executing under mitigation mode can disperse to execute the time and be avoided that generation high temperature.Furthermore, it is understood that working as neural network Unit mitigate mode execute program when, neural network unit be with it is lower when frequency generate thermal energy, these thermal energy can be suitable Sharply via the packaging body and cooling of neural network unit (such as semiconductor device, the substrate of metal layer and lower section) and surrounding Mechanism (such as cooling fin, fan) discharge, also therefore, the device (such as transistor, capacitor, conducting wire) in neural network unit just compares It may operate at a lower temperature.On the whole, running also contributes to reducing the other of processor crystal grain under mitigation mode Unit temp in part.Lower operational temperature can mitigate electric leakage for the junction temperature of these devices The generation of stream.In addition, inductance noise can also be reduced with IR pressure drop noise because the magnitude of current flowed into the unit time reduces.This Outside, temperature reduces the Negative Bias Temperature Instability (NBTI) for the metal-oxide half field effect transistor (MOSFET) in processor Also there are positive influences with positive bias unstability (PBSI), and the longevity of reliability and/or device and processor part can be promoted Life.Temperature reduces and can reduce Joule heat and electromigration effect in the metal layer of processor.

About the communication mechanism between the framework program and nand architecture program of neural network unit shared resource

It has been observed that in the example of Figure 24 to Figure 28 and Figure 35 to Figure 37, data random access memory 122 and weight The resource of random access memory 124 is shared.The front end shared data of neural processing unit 126 and processor 100 is random Access memory 122 and weight random access memory 124.More precisely, neural processing unit 126 and processor 100 Front end can all read data random access memory 122 and weight random access memory 124 such as media cache 118 It takes and is written.In other words, it is implemented in the framework program of processor 100 and is implemented in the neural network of neural network unit 121 Unit program meeting shared data random access memory 122 and weight random access memory 124, and in some cases, such as It is preceding described, it needs to control the process between framework program and neural network unit program.The resource of program storage 129 It is also shared under to a certain degree, this is because it can be written in framework program, and sequencer 128 can read it It takes.Embodiment as described herein provides a dynamical solution, to control between framework program and neural network unit program Access the process of shared resource.

In the embodiments described herein, neural network unit program is also referred to as nand architecture program, and neural network unit refers to It enables and is also referred to as nand architecture instruction, and neural network unit instruction set (also referred to as neural processing unit instruction set as previously described) is also referred to as For nand architecture instruction set.Nand architecture instruction set is different from architecture instruction set.It include that instruction translator 104 will in processor 100 Framework instruction translation goes out in the embodiment of microcommand, and nand architecture instruction set is also different from microinstruction set.

Figure 38 is a block diagram, displays the details of the serial device 128 of neural network unit 121.Serial device 128 provides storage Device address is supplied to the nand architecture instruction of serial device 128 with selection, as previously described to program storage 129.As shown in figure 38, Storage address is loaded into the program counter 3802 of sequencer 128.Sequencer 128 would generally be with program storage 129 Sequence of addresses is incremented by proper order, except non-sequencer 128 suffers from nand architecture instruction, such as circulation or branch instruction, and in this situation Under, program counter 3802 can be updated to the destination address of control instruction by sequencer 128, that is, is updated to positioned at control instruction The address of the nand architecture instruction of target.Therefore, the address 131 for being loaded into program counter 3802 can specify currently seized for The nand architecture for the nand architecture program that neural processing unit 126 executes instructs the address in program storage 129.Program counter 3802 numerical value can be taken by framework program through the neural network unit program counter field 3912 of status register 127 , as described in subsequent figure 39.Progress of the framework program according to nand architecture program can so be made, decision stores data at random The position of 124 reading/writing data of device 122 and/or weight random access memory.

Sequencer 128 further includes cycle counter 3804, this cycle counter 3804 can arrange in pairs or groups nand architecture recursion instruction into Row running, for example, in Figure 26 A address 10 be recycled to address 11 in 1 instruction and Figure 28 be recycled to 1 instruction.In Figure 26 A and figure In 28 example, numerical value specified by the nand architecture initialization directive of load address 0 in cycle counter 3804, such as load number Value 400.Sequencer 128 suffers from recursion instruction and jumps to target instruction target word and (be located at the multiplication of address 1 in such as Figure 26 A each time The maxwacc for being located at address 1 in accumulated instruction or Figure 28 is instructed), sequencer 128 will make cycle counter 3804 successively decrease. Once cycle counter 3804 is reduced to zero, sequencer 128 goes to sort in next nand architecture instruction.In another implementation In example, the loop count specified in a recursion instruction can be loaded when suffering from recursion instruction for the first time in cycle counter, To save the demand for utilizing nand architecture initialization directive loop initialization counter 3804.Therefore, the number of cycle counter 3804 Value would indicate that the circulation group of nand architecture program waits the number executed.The numerical value of cycle counter 3804 can be penetrated by framework program The cycle count field 3914 of status register 127 obtains, as shown in subsequent figure 39.So framework program can be made according to non-frame 124 read/write number of memory 122 and/or weight random access memory is deposited data in the progress of structure program, decision at random According to position.In one embodiment, sequencer includes three additional cycle counters with the nest set in nand architecture program of arranging in pairs or groups Circulation, the numerical value of these three cycle counters also can pass through status register 127 and reads.With one with instruction in recursion instruction Which is available to current recursion instruction use in this four cycle counters.

Sequencer 128 further includes the number of iterations counter 3806.The collocation nand architecture instruction of the number of iterations counter 3806, example If the maxwacc of address 2 in the multiply-accumulate instruction of address 2 in Fig. 4, Fig. 9, Figure 20 and Figure 26 A and Figure 28 is instructed, these Instruction will referred to as " execution " instruct hereafter.In previous cases, each execute instruction respectively specifies that execution counts 511, 511,1023,2 and 3.When sequencer 128 suffers from specified when executing instruction of a non-zero iteration count, 128 meeting of sequencer The number of iterations counter 3806 is loaded with this designated value.In addition, sequencer 128 can generate micro- operation appropriate 3418 with control figure Logic in 34 in 126 pipeline stages 3401 of neural processing unit executes, and the number of iterations counter 3806 is made to successively decrease.If The number of iterations counter 3806 is greater than zero, and sequencer 128 can generate micro- operation 3418 appropriate again and control neural processing unit Logic in 126 simultaneously makes the number of iterations counter 3806 successively decrease.Sequencer 128 can continue to operate by this method, until iteration time The numerical value of counter 3806 is zeroed.Therefore, the numerical value of the number of iterations counter 3806 is that nand architecture executes instruction interior specify Wait to execute operation times (these operations are such as multiply-accumulate for accumulated value and data/weight text progress, are maximized, Add up operation etc.).The numerical value of the number of iterations counter 3806 can penetrate the number of iterations of status register 127 using framework program Count area 3916 obtains, as described in subsequent figure 39.Can so make framework program according to nand architecture program progress, determine for Data deposit the position of 124 reading/writing data of memory 122 and/or weight random access memory at random.

Figure 39 is a block diagram, shows the control of neural network unit 121 and several fields of status register 127.This A little fields include the ground for including the weight random access memory column that the neural execution of processing unit 126 nand architecture program is most recently written Location 2602, the address 2604 for the weight random access memory column that the nand architecture program that neural processing unit 126 executes is read recently, Neural processing unit 126 executes the address 2606 for the data random access memory column that nand architecture program is most recently written, and mind The address 2608 that the data random access memory column that nand architecture program is read recently are executed through processing unit 126, such as earlier figures Shown in 26B.In addition, these fields further include 3912 field of neural network unit program counter, 3914 field of cycle counter, With 3916 field of the number of iterations counter.It has been observed that framework program can delay the reading data in status register 127 to media Storage 118 and/or general caching device 116, such as reading through MFNN instruction 1500 includes neural network unit program counter 3912, the numerical value of cycle counter 3914 and 3916 field of the number of iterations counter.The numerical value of program counter field 3912 is anti- Reflect the numerical value of program counter 3802 in Figure 38.The number of the numerical value reflection cycle counter 3804 of cycle counter field 3914 Value.The numerical value of the numerical value reflection the number of iterations counter 3806 of the number of iterations counter field 3916.In one embodiment, fixed Sequence device 128 is when needing adjustment programme counter 3802, cycle counter 3804 or the number of iterations counter 3806 every time, all It will be updated program counter field 3912, the numerical value of cycle counter field 3914 and the number of iterations counter field 3916, such as This, when framework program is read, the numerical value of these fields will be numerical value instantly.In another embodiment, when neural network list When member 121 executes framework instruction with reading state buffer 127, neural network unit 121 only obtains program counter 3802, The numerical value of cycle counter 3804 and the number of iterations counter 3806 is simultaneously provided back into framework instruction (such as to be provided to media slow Storage 118 or general caching device 116).

It is possible thereby to find, the numerical value of the field of the status register 127 of Figure 39 can be understood as nand architecture instruction by mind During being executed through network unit, the information of implementation progress.About nand architecture program implementation progress it is certain it is specific towards, Such as 3802 numerical value of program counter, 3804 numerical value of cycle counter, 3806 numerical value of the number of iterations counter, nearest read/write 124 address 125 of weight random access memory field 2602/2604, and the data of read/write are deposited at random recently The field 2606/2608 of 122 address 123 of access to memory, is described in previous chapters and sections.It is implemented in the frame of processor 100 Structure program can read the nand architecture program progress value of Figure 39 from status register 127 and use such information for doing decision, example It such as penetrates such as to compare and be instructed with branch instruction framework to carry out.For example, framework program can determine to deposit data at random Access to memory 122 and/or weight random access memory 124 carry out the column of data/weight read/write, to control data The inflow and outflow of the data of random access memory 122 or weight random access memory 124, in particular for large data The overlapping of group and/or different nand architecture instructions executes.These can refer to front and back herein using the example that framework program carries out decision The description of chapters and sections.

For example, as described in Figure 26 A above, the result of convolution algorithm is write back number by framework program setting nand architecture program According to the column (such as 8 top of column) of 2402 top of convolution kernel in random access memory 122, and when neural network unit 121 is using most When result is written in the address of nearly write-in 122 column 2606 of data random access memory, framework program can be deposited from data random access Reservoir 122 reads this result.

In another example, as described in Figure 26 B above, framework program utilizes 127 field of status register from Figure 38 Validation of information nand architecture program the data array 2404 of Figure 24 is divided into the data block of 5 512 x 1600 to execute convolution fortune The progress of calculation.Framework program is random by first 512 x, 1600 data block write-in weight of this 2560 x, 1600 data array Access memory 124 simultaneously starts nand architecture program, and cycle count is 1600 and weight random access memory 124 initializes Output be classified as 0.When neural network unit 121 executes nand architecture program, framework program can read status register 127 to confirm Weight random access memory 124 is most recently written column 2602, and such framework program just can be read to be written by nand architecture program Effective convolution operation result, and this effective convolution algorithm knot is override using next 512 x, 1600 data block after reading Fruit, in this way, after neural network unit 121 completes nand architecture program for the execution of first 512 x, 1600 data block, place Reason device 100 can update nand architecture program immediately if necessary and be again started up nand architecture program to execute next 512 x 1600 data blocks.

In another example, it is assumed that framework program makes neural network unit 121 execute a series of typical neural networks to multiply Method adds up run function, wherein weight is stored in weight random access memory 124 and result can be written back into data and deposit at random Access to memory 122.It in the case, would not be again to it after a column of framework program reading weight random access memory 124 It is read out.In this way, in current weight by nand architecture program reading/use after, so that it may started using framework program By the weight on new weight manifolding weight random access memory 124, with provide nand architecture program example next time (such as Next neural net layer) it uses.In the case, framework program can read status register 127 and be deposited at random with obtaining weight New weight group is written to determine it in weight random access memory 124 in the address of the nearest reading column 2604 of access to memory Position.

In another example, it is assumed that framework program knows in nand architecture program that including one, there is big the number of iterations to count Execute instruction, such as the multiply-accumulate instruction of the nand architecture of address 2 in Figure 20.In the case, framework program needs to know iteration Counting how many times 3916 can be known and generally also need how many a time-frequency periods that could complete the instruction of this nand architecture to determine framework Next program to be taken the whichever of one of two or more movements.For example, if needing long time complete At execution, framework program will abandon control and give another framework program, such as operating system.Similarly, it is assumed that framework program Know that nand architecture program includes the circulation group with sizable cycle count, such as the nand architecture program of Figure 28.Herein In the case of, framework program, which may require that, knows cycle count 3914, can know and generally also need how many a time-frequency periods could The instruction of this nand architecture is completed next to be taken the whichever of one of two or more movements to determine.

In another example, it is assumed that framework program performs similarly to neural network unit 121 described in Figure 27 and Figure 28 Common source operation, wherein the data of wanted common source be previously stored weight random access memory 124 and result can be written back into weight with Machine accesses memory 124.But, different from the example of Figure 27 and Figure 28, it is assumed that it is random that the result of this example can be written back into weight The top 400 for accessing memory 124 arranges, such as column 1600 to 1999.In the case, nand architecture program is completed to read four After 124 data of weight random access memory for arranging its wanted common source, nand architecture program would not be read out again.Therefore, Once current four column data all by nand architecture program reading/use after, i.e., start using framework program new data is (such as non- The weight of the example next time of framework program, for example, typical multiply-accumulate run function for example is executed to acquirement data and is transported The nand architecture program of calculation) overriding weight random access memory 124 data.In the case, framework program can read state Buffer 127 is to obtain the addresses of the nearest reading column 2604 of weight random access memory, to determine new weight group write-in The position of weight random access memory 124.

Time recurrence (recurrent) neural network accelerates

Conventional feed forward neural network does not have the memory that storage network is previously entered.Feedforward neural network is normally used for Executing and inputting multiple inputs of network at any time in task is respective independence, and multiple outputs are also task so.It compares Under, time recurrent neural network, which typically facilitates the input sequence for executing and being input to neural network at any time in task, to be had The task of importance.(sequence herein is commonly known as time step.) therefore, time recurrent neural network includes a concept On memory or internal state, to load network in response to the letter for being previously entered performed calculating and generating in sequence Breath, the output of time recurrent neural network are associated with the input of this internal state Yu next time step.Following task, such as language Sound identification, language model, text generate, language translation, and it is to pass the time that image description, which generates, and some form of handwriting identification Return neural network that can execute good example.

The example of three kinds of known time recurrent neural networks is Elman time recurrent neural network, and the Jordan time passs Neural network and shot and long term is returned to remember (LSTM) neural network.Elman time recurrent neural network includes content node to remember The hiding layer state of time recurrent neural network in current time step, this state in next time step can as The input of hidden layer.Jordan time recurrent neural network is similar to Elman time recurrent neural network, in addition to content therein Node understands the output layer state of memory time recurrent neural network rather than hides layer state.Shot and long term Memory Neural Networks include by The shot and long term that shot and long term memory cell is constituted remembers layer.Each shot and long term memory cell have current time step current state with Current output and new or follow-up time step a new state and new output.Shot and long term memory cell includes input Lock and output lock, and forget lock, forgeing lock can make neuron lose the state that it is remembered.These three time recurrent neurals Network can be described in more detail in following sections.

As described herein, for time recurrent neural network, such as Elman or Jordan time recurrent neural network, Neural network unit execute every time all can use time step, obtain one group of input layer value, and execute it is necessary calculating make it It is propagated through time recurrent neural network, to generate output layer nodal value and hidden layer and content layer nodal value.Therefore, Input layer value can be associated with calculating and hide, the time step of output and content layer nodal value；And it hides, output and content layer Nodal value can be associated with the time step for generating these nodal values.Input layer value is that time recurrent neural network is emulated Systematic sampling value, such as image, phonetic sampling, the snapshot of commercial market data.For shot and long term Memory Neural Networks, mind Each execution through network unit can all use a period of time intermediate step, obtain one group of memory cell input value and execute necessary calculating to produce Raw memory cell output valve (and memory cell state and input lock, forget lock and output lock numerical value), this is it can be appreciated that be Memory cell input value is propagated through shot and long term memory layer memory cell.Therefore, memory cell input value, which can be associated with, calculates memory cell shape State and input lock forget lock and export the time step of lock numerical value；And memory cell state and input lock, forget lock and output Lock numerical value can be associated with the time step for generating these nodal values.

Content layer nodal value, also referred to as state node, are the state values of neural network, this state value is based on being associated with previously The input layer value of time step, the input layer value without being only associated with current time step.Neural network unit For (such as the hidden layer nodal value for Elman or Jordan time recurrent neural network of calculating performed by time step Calculate) be previous time steps generate content layer nodal value a function.Therefore, network-like state value when time step starts The output layer nodal value that (content node value) generates during will affect intermediate step at this time.In addition, at the end of time step Network-like state value when the input node value and time step that network-like state value will receive intermediate step at this time start influences.It is similar Ground, for shot and long term memory cell, memory cell state value is associated with the memory cell input value of previous time steps, rather than only It is associated with the memory cell input value of current time step.Because calculating that neural network unit executes time step (such as The calculating of next memory cell state) be previous time steps generate memory cell state value function, when time step starts Network-like state value (memory cell state value) will affect the memory cell output valve generated in intermediate step at this time, and intermediate step knot at this time Network-like state value when beam, which will receive the memory cell input value of intermediate step at this time and former network state value, to be influenced.

Figure 40 is a block diagram, shows an example of Elman time recurrent neural network.The Elman time recurrence of Figure 40 Neural network includes input layer or neuron, is denoted as D0, D1 to Dn, referred to collectively as multiple input layer D and it is individual It is commonly referred to as input layer D；Node layer/neuron is hidden, is denoted as Z0, Z1 to Zn, referred to collectively as multiple hiding node layer Z And it is commonly referred to as hiding node layer Z individually；Node layer/neuron is exported, Y0, Y1 to Yn, referred to collectively as multiple output layers are denoted as Node Y and individually be commonly referred to as output node layer Y；And content node layer/neuron, it is denoted as C0, C1 to Cn, it is referred to collectively as more A content node layer C and be commonly referred to as content node layer C individually.In the example of the Elman time recurrent neural network of Figure 40, respectively The output that there is a hiding node layer Z an input to be linked to each input layer D, and there is an input to be linked to each content The output of node layer C；The output that there is each output node layer Y an input to be linked to each hiding node layer Z；And each content The output that there is node layer C an input to be linked to corresponding hiding node layer Z.

In many aspects, the running of Elman time recurrent neural network is similar to traditional feed forward-fuzzy control.? That is each input connection of this node can all have an associated weight for given node；Node is defeated one Entering the numerical value for linking and receiving can be with associated multiplied by weight to generate a product；This node can will be associated with all input connections Product addition is to generate one total (may can also include a shift term in this sum)；In general, can also be executed to this sum For run function to generate the output valve of node, this output valve is sometimes referred to as the initiation value of node thus.For traditional feedforward network For, data are always flowed along the direction of input layer to output layer.That is, input layer provides a numerical value to hidden layer (usually having multiple hidden layers), and hidden layer can generate its output valve and be provided to output layer, and output layer can be generated and can be taken Output.

But, different from traditional feedforward network, Elman time recurrent neural network further includes some feedback connections, It is exactly the connection in Figure 40 from hiding node layer Z to content node layer C.The running of Elman time recurrent neural network is as follows, when Input layer D can provide a numerical value in new time step one input value of offer to hiding node layer Z, content node C To hidden layer Z, this numerical value is to hide node layer Z in response to being previously entered, that is, current time step, output valve.From this For in meaning, the content node C of Elman time recurrent neural network is depositing for the input value based on previous time steps Reservoir.Figure 41 and Figure 42 will be associated with the neural network list of the calculating of the Elman time recurrent neural network of Figure 40 to execution The running embodiment of member 121 is illustrated.

In order to illustrate the present invention, Elman time recurrent neural network is one comprising at least one input node layer, one Concealed nodes layer, the time recurrent neural network of an output node layer and a content node layer.One given time was walked Suddenly, content node layer can store concealed nodes layer and generate in previous time step and feed back to the result of content node layer.This The result for feeding back to content layer can be the implementing result of run function or concealed nodes layer executes accumulating operation and is not carried out The result of run function.

Figure 41 is a block diagram, and display is associated with the Elman time recurrent neural of Figure 40 when the execution of neural network unit 121 When the calculating of network, in the data random access memory 122 and weight random access memory 124 of neural network unit 121 Data configuration an example.Assume that the Elman time recurrent neural network of Figure 40 is defeated with 512 in the example of Figure 41 Ingress D, 512 concealed nodes Z, 512 content node C, with 512 output node Y.In addition, also assuming that this Elman time Recurrent neural network is connection completely, i.e., all 512 input node D link each concealed nodes Z as input, all 512 content node C link each concealed nodes Z as input, and all 512 concealed nodes Z link each output Node Y is as input.In addition, this neural network unit 121 is configured to 512 neural processing units 126 or neuron, such as adopt Width configuration.Finally, to be associated with the weight of the connection of content node C to concealed nodes Z be numerical value 1 to the hypothesis of this example, because without These weighted values for being one need to be stored.

As shown in the figure, the lower section 512 of weight random access memory 124 arranges (column 0 to 511) loading and is associated with input section The weighted value of connection between point D and concealed nodes Z.More precisely, as shown in the figure, the loading of column 0 is associated with by input node The weight of the input connection of D0 to concealed nodes Z, is associated between input node D0 and concealed nodes Z0 that is, text 0 can load Connection weight, text 1 can load the weight for the connection being associated between input node D0 and concealed nodes Z1, and text 2 can fill Carry the weight for being associated with connection between input node D0 and concealed nodes Z2, and so on, text 511 can load be associated with it is defeated The weight of connection between ingress D0 and concealed nodes Z511；The loading of column 1 is associated with by the defeated of input node D1 to concealed nodes Z Enter the weight of connection, that is, text 0 can load the weight for the connection being associated between input node D1 and concealed nodes Z0, text 1 The weight for the connection being associated between input node D1 and concealed nodes Z1 can be loaded, text 2, which can load, is associated with input node D1 The weight of connection between concealed nodes Z2, and so on, text 511, which can load, is associated with input node D1 and concealed nodes The weight of connection between Z511；Until column 511, column 511 load the input company being associated with by input node D511 to concealed nodes Z The weight of knot, that is, text 0 can load the weight for the connection being associated between input node D511 and concealed nodes Z0,1 meeting of text The weight for the connection being associated between input node D511 and concealed nodes Z1 is loaded, text 2 can load and be associated with input node The weight of connection between D511 and concealed nodes Z2, and so on, text 511, which can load, to be associated with input node D511 and hides The weight of connection between node Z511.This configuration is similar to purposes corresponds to fig. 4 to fig. 6 A the embodiment described above.

As shown in the figure, subsequent 512 column (column 512 to 1023) of weight random access memory 124 are with similar side Formula loads the weight for the connection being associated between concealed nodes Z and output node Y.

Data random access memory 122 loads Elman time recurrent neural network nodal value and supplies a series of time steps It uses.Furthermore, it is understood that data random access memory 122 loads the nodal value for being supplied to timing intermediate step with three column for group. As shown in the figure, by taking one with the data random access memory 122 of 64 column as an example, this data random access memory 122 The nodal value used for 20 different time steps can be loaded.In the example of Figure 41, column 0 to 2 are loaded to be used for time step 0 Nodal value, column 3 to 5 load the nodal value that uses for time step 1, and so on, column 57 to 59 are loaded for time step 19 The nodal value used.First row in each group loads the numerical value of the input node D of intermediate step at this time.Secondary series in each group loads The numerical value of the concealed nodes Z of intermediate step at this time.Third in each group equips the numerical value for carrying the output node Y of intermediate step at this time.Such as As shown in the figure, each luggage of data random access memory 122 carries the section of its corresponding neuron or neural processing unit 126 Point value.That is, the loading of row 0 is associated with node D0, the nodal value of Z0 and Y0, calculating is held by neural processing unit 0 Row；The loading of row 1 is associated with node D1, and the nodal value of Z1 and Y1, calculating is as performed by neural processing unit 1；Class according to this It pushes away, the loading of row 511 is associated with node D511, and the nodal value of Z511 and Y511, calculating is held by neural processing unit 511 Row, this part correspond at Figure 42 and can be described in more detail subsequent.

As pointed by Figure 41, for a given time step, positioned at three column memory of each group secondary series hide The numerical value of node Z can be the numerical value of the content node C of next time step.That is, neural processing unit 126 is for the moment The numerical value of calculating and the node Z being written in intermediate step can become this neural processing unit 126 and be used in next time step The numerical value of node C used in the numerical value of calculate node Z (together with the numerical value of the input node D of this next time step).It is interior The initial value (in numerical value of the time step 0 to calculate node C used in the numerical value of the node Z in column 1) for holding node C assumes It is zero.This can be described in more detail in the related Sections of the subsequent nand architecture program corresponding to Figure 42.

Preferably, the numerical value (column 0,3 in the example of Figure 41, and so on to the numerical value of column 57) of input node D is by holding Row instructs 1400 write-ins/filling data random access memory 122 through MTNN in the framework program of processor 100, and is By being implemented in nand architecture program reading/use of neural network unit 121, such as the nand architecture program of Figure 42.On the contrary, hidden Hiding/output node Z/Y numerical value (column 1 and 2,4 and 5 in the example of Figure 41, and so on to the numerical value of column 58 and 59) is then Nand architecture program by being implemented in neural network unit 121 is written/inserts data random access memory 122, and is by holding Row instructs 1500 readings/use through MFNN in the framework program of processor 100.The example of Figure 41 assumes that this framework program can be held Row following steps: the numerical value of input node D is inserted data random access memory by (1) the time step different for 20 122 (column 0,3, and so on to column 57)；(2) start the nand architecture program of Figure 42；(3) whether detecting nand architecture program has executed Finish；(4) numerical value (column 2,5, and so on to column 59) of output node Y is read from data random access memory 122；And (5) step (1) to (4) are repeated several times until completion task, such as the language of cellie is carried out recognizing required meter It calculates.

In another executive mode, framework program can execute following steps: (1) to single a time step, with input The numerical value of node D inserts data random access memory 122 (such as column 0)；(2) start nand architecture program (Figure 42 nand architecture program Amendment after version, be not required to recycle, and only access single group three of data random access memory 122 column)；(3) it detects non- Whether framework program is finished；(4) numerical value (such as column 2) of output node Y is read from data random access memory 122；With And (5) repeat step (1) to (4) several times until completing task.This two kinds of mode whichever be it is excellent can be according to time recurrent neural Depending on the sampling mode of the input value of network.For example, if this task is allowed in multiple time steps and takes to input Sample (such as about 20 time steps) simultaneously executes calculating, and first way is with regard to ideal, because mode may be brought more thus More computing resource efficiency and/or preferable efficiency, but, if this task is only allowed in single a time step and executes sampling, It just needs using the second way.

3rd embodiment is similar to the aforementioned second way, but, is different from the second way and uses single group of three columns According to random access memory 122, the nand architecture program of this mode uses three column memory of multiple groups, that is, in each time step Using different groups of three column memories, this part is similar to first way.In this 3rd embodiment, preferably, framework program It include a step before the step (2), in this step, framework program can be updated it before nand architecture program starts, such as The column of data random access memory 122 in the instruction of address 1 are updated to point to next group of three column memories.

Figure 42 is a table, and display is stored in a program of the program storage 129 of neural network unit 121, this program It is executed by neural network unit 121, and reaches Elman time recurrent neural using data and weight according to the configuration of Figure 41 Network.Some instructions in the nand architecture program of Figure 42 (and Figure 45, Figure 48, Figure 51, Figure 54 and Figure 57) in detail as it is aforementioned (such as Multiply-accumulate (MULT-ACCUM) is recycled (LOOP), initialization (INITIALIZE) instruction), following paragraphs assumes these instructions It is consistent with preceding description content, unless otherwise noted.

The example program of Figure 42 includes 13 nand architecture instructions, is located at address 0 to 12.The instruction of address 0 (INITIALIZE NPU, LOOPCNT=20) removes accumulator 202 and initializes cycle counter 3804 to numerical value 20, To execute 20 circulation groups (instruction of address 4 to 11).Preferably, this initialization directive can also make at neural network unit 121 It is configured in width, in this way, neural network unit 121 will be configured to 512 neural processing units 126.As described in following sections, In the execution process instruction of address 1 to 3 and address 7 to 11, this 512 neural processing units 126 are opposite as 512 The hiding node layer Z answered is operated, and in the execution process instruction of address 4 to 6, this 512 neural processing units 126 It is operated as 512 corresponding output node layer Y.

The instruction of address 1 to 3 is not belonging to the circulation group of program and can only execute primary.These instructions, which calculate, hides node layer The initial value of Z and the column 1 for being written into data random access memory 122 are executed for the first time of the instruction of address 4 to 6 to be made With to calculate first time step (the output node layer Y of time step 0).In addition, these instruction meters by address 1 to 3 The numerical value for calculating and being written the hiding node layer Z of the column 1 of data random access memory 122 will become the numerical value of content node layer C Use is executed for the first time of the instruction of address 7 and 8, to calculate the numerical value of hiding node layer Z for the second step (time time Step 1) uses.

In the implementation procedure of the instruction of address 1 and 2, each nerve processing in this 512 neural processing units 126 is single Member 126 can execute 512 multiplyings, will be located at 512 input node D numerical value of 122 column 0 of data random access memory It is multiplied by the weight of the row of this corresponding neural processing unit 126 in the column 0 to 511 of weight random access memory 124, to generate 512 product accumulations are in the accumulator 202 of corresponding neural processing unit 126.In the implementation procedure of the instruction of address 3, this The numerical value of 512 accumulators 202 of 512 neural processing units can be passed and data random access memory 122 is written Column 1.That is, the output order of address 3 can be by the tired of each neural processing unit 512 in 512 neural processing units Adding the column 1 of 202 numerical value of device write-in data random access memory 122, this numerical value is initial hidden layer Z numerical value, then, this Instruction can remove accumulator 202.

The ground that operation performed by the instruction of the address 1 to 2 of the nand architecture program of Figure 42 is instructed similar to the nand architecture of Fig. 4 Operation performed by the instruction of location 1 to 2.Furthermore, it is understood that the instruction (MULT_ACCUM DR ROW 0) of address 1 can indicate this Each neural processing unit 126 in 512 neural processing units 126 is by the opposite of the column 0 of data random access memory 122 Text is answered to read in its multitask buffer 208, it is more that the corresponding text of the column 0 of weight random access memory 124 is read in it Data literal is multiplied with weight text and generates product and accumulator 202 is added in this product by task buffer device 705.Address 2 Instruction (MULT-ACCUM ROTATE, WR ROW+1, COUNT=511) indicate it is each in this 512 neural processing units Text from adjacent nerve processing unit 126 is transferred to its multitask buffer 208 (using by mind by neural processing unit 126 The rotator for 512 texts that 512 208 collectives of multitask buffer running through network unit 121 is constituted, these buffers The buffer that the column of data random access memory 122 are read in the instruction instruction of as address 1), weight arbitrary access is deposited The corresponding text of the next column of reservoir 124 reads in its multitask buffer 705, and data literal is multiplied generation with weight text Simultaneously accumulator 202 is added in this product by product, and is executed foregoing operation 511 times.

In addition, in Figure 42 address 3 single nand architecture output order (OUTPUT PASSTHRU, DR OUT ROW 1, CLR ACC) operation that run function instructs can be merged with the write-in output order of address 3 and 4 in Fig. 4 (although the program of Figure 42 passes 202 numerical value of accumulator is passed, and the program of Fig. 4 is then to execute run function to 202 numerical value of accumulator).That is, in Figure 42 Program in, be implemented in the run function of 202 numerical value of accumulator, if any, in output order specify (also address 6 with Specified in 11 output order), rather than specified as the program of Fig. 4 is shown in a different nand architecture run function instruction. Another embodiment of the nand architecture program of Fig. 4 (and Figure 20, Figure 26 A and Figure 28), also i.e. operation that run function instruct and Write-in output order (address 3 and 4 of such as Fig. 4) merges into single nand architecture output order as shown in figure 42 and also belongs to the present invention Scope.The example of Figure 42 assumes that the node of hidden layer (Z) will not execute run function to accumulator value.But, hidden layer (Z) embodiment for executing run function to accumulator value also belongs to the scope of the present invention, these embodiments can utilize address 3 and 11 Instruction carry out operation, such as S type, tanh, correction function etc..

Instruction compared to address 1 to 3 can only execute once, the instruction of address 4 to 11 then be located at program circulation in and Several numbers can be performed, this number as specified by cycle count (such as 20).It is held for 19 times before the instruction of address 7 to 11 Row calculates the numerical value for hiding node layer Z and is written into the second of instruction of the data random access memory 122 for address 4 to 6 The output node layer Y (time step 1 to 19) to calculate remaining time step is used to 20 execution.(the finger of address 7 to 11 Last/the 20th column for executing the numerical value of the hiding node layer Z of calculating and being written into data random access memory 122 enabled 61, but, these numerical value are simultaneously not used by.)

Instruction (512 and MULT-ACCUM of MULT-ACCUM DR ROW+1, WR ROW in address 4 and 5 ROTATE, WR ROW+1, COUNT=511) first time execute in (correspond to time step 0), this 512 nerve processing are single Each neural processing unit 126 in member 126 can execute 512 multiplyings, by the column 1 of data random access memory 122 The numerical value (these numerical value are executed by the single time of instruction of address 1 to 3 and generated and be written) of 512 concealed nodes Z be multiplied by The weight of the row of this neural processing unit 126 is corresponded in the column 512 to 1023 of weight random access memory 124, to generate 512 A product accumulation is in the accumulator 202 of corresponding neural processing unit 126.Instruction (OUTPUT ACTIVATION in address 6 FUNCTION, DR OUT ROW+1, CLR ACC) first time execute, can this 512 accumulating values be executed with starting letters Number (such as S type, tanh, correction function) exports the numerical value of node layer Y to calculate, and implementing result can be written data and deposit at random The column 2 of access to memory 122.

(correspond to time step 1), this 512 neural processing units in second of execution of the instruction of address 4 and 5 Each neural processing unit 126 in 126 can execute 512 multiplyings, by the column 4 of data random access memory 122 The numerical value (these numerical value are executed by the first time of the instruction of address 7 to 11 and generated and be written) of 512 concealed nodes Z is multiplied by power The weight of the row of this neural processing unit 126 is corresponded in the column 512 to 1023 of weight random access memory 124, to generate 512 A product accumulation is in the accumulator 202 of corresponding neural processing unit 126, and in the executing for second of the instruction of address 6, meeting Run function is executed for this 512 accumulating values to calculate the numerical value of output node layer Y, this result write-in data are deposited at random The column 5 of access to memory 122；(correspond to time step 2, this 512 nerves in the third time of the instruction of address 4 and 5 executes Each neural processing unit 126 in processing unit 126 can execute 512 multiplyings, by data random access memory 122 The numerical value of 512 concealed nodes Z of column 7 (these numerical value are executed by second of instruction of address 7 to 11 and are generated and write Enter) it is multiplied by the weight that the row of this neural processing unit 126 is corresponded in the column 512 to 1023 of weight random access memory 124, with 512 product accumulations are generated in the accumulator 202 of corresponding neural processing unit 126, and the third time of the instruction in address 6 is held In row, can this 512 accumulating values be executed with run function to calculate the numerical value of output node layer Y, data are written in this result The column 8 of random access memory 122；The rest may be inferred, (corresponds to time step in the 20th execution of the instruction of address 4 and 5 It is rapid 19), each neural processing unit 126 in this 512 neural processing units 126 can execute 512 multiplyings, by data 512 concealed nodes Z of the column 58 of random access memory 122 numerical value (these numerical value by address 7 to 11 instruction the tenth Execute and generate and write-in for nine times) it is multiplied by the column 512 to 1023 of weight random access memory 124 and correspond to this neural processing The weight of the row of unit 126, to generate 512 product accumulations in the accumulator 202 of corresponding neural processing unit 126, and During the 20th time of the instruction of address 6 executes, can this 512 accumulating values be executed with run function to calculate output node layer The column 59 of data random access memory 122 are written in the numerical value of Y, implementing result.

Each nerve processing in the first time of the instruction of address 7 and 8 executes, in this 512 neural processing units 126 The numerical value of 512 content node C of the column 1 of data random access memory 122 is added to its accumulator 202 by unit 126, this Produced by single execution of a little numerical value as the instruction of address 1 to 3.Furthermore, it is understood that instruction (the ADD_D_ACC DR of address 7 ROW+0 each neural processing unit 126 in this 512 neural processing units 126) can be indicated data random access memory 122 read in its multitask buffer 208 when the corresponding text of forefront (being column 0 during executing first time), and will Accumulator 202 is added in this text.The instruction (ADD_D_ACC ROTATE, COUNT=511) of address 8 indicates at this 512 nerves Text from adjacent nerve processing unit 126 is transferred to its multitask and delayed by each neural processing unit 126 in reason unit 126 Storage 208 (utilizes 512 texts being made of the running of 512 208 collectives of multitask buffer of neural network unit 121 Rotator, these multitask buffers are the caching that the column of data random access memory 122 are read in the instruction instruction of address 7 Device), accumulator 202 is added in this text, and execute foregoing operation 511 times.

Each nerve processing in second of execution of the instruction of address 7 and 8, in this 512 neural processing units 126 The numerical value of 512 content node C of the column 4 of data random access memory 122 can will be added to its accumulator by unit 126 202, produced by these numerical value are executed by the first time of the instruction of address 9 to 11 and be written；In the third of the instruction of address 7 and 8 In secondary execution, each neural processing unit 126 in this 512 neural processing units 126 can will be stored data random access The numerical value of 512 content node C of the column 7 of device 122 is added to its accumulator 202, these numerical value by address 9 to 11 instruction Produced by second executes and be written；The rest may be inferred, in the 20th execution of the instruction of address 7 and 8, this 512 nerves Each neural processing unit 126 in processing unit 126 can be by will be in 512 of the column 58 of data random access memory 122 The numerical value for holding node C is added to its accumulator 202, produced by the 19th execution of these numerical value as the instruction of address 9 to 11 And it is written.

It has been observed that it is one that the example of Figure 42, which assumes that the weight for being associated with the connection of content node C to hiding node layer Z has, Value.But, in another embodiment, it is then with non-zero power that these, which are located at the connection in Elman time recurrent neural network, Weight values, these weights are placed in weight random access memory 124 (such as column 1024 to 1535) before the program of Figure 42 executes, The program instruction of address 7 is MULT-ACCUM DR ROW+0, WR ROW 1024, and the program instruction of address 8 is MULT- ACCUM ROTATE, WR ROW+1, COUNT=511.Preferably, the instruction of address 8 does not access weight random access memory 124, but the numerical value of multitask buffer 705 is read in the instruction for rotating address 7 from weight random access memory 124.? Not accessing to weight random access memory 124 in the time-frequency period that 511 execution addresses 8 instruct can retain more Bandwidth is used for framework program access weight random access memory 124.

Address 9 and 10 instruction (MULT-ACCUM DR ROW+2, WR ROW 0and MULT-ACCUM ROTATE, WR ROW+1, COUNT=511) first time execute in (correspond to time step 1), in this 512 neural processing units 126 Each neural processing unit 126 can execute 512 multiplyings, by 512 of the column 3 of data random access memory 122 The numerical value of input node D is multiplied by the row that this neural processing unit 126 is corresponded in the column 0 to 511 of weight random access memory 124 Weight to generate 512 products, together with address 7 and 8 instruction for cumulative fortune performed by 512 content node C numerical value It calculates, adds up and calculate the numerical value of hiding node layer Z, the finger in address 11 in the accumulator 202 of corresponding neural processing unit 126 It enables in the first time execution of (OUTPUT PASSTHRU, DR OUT ROW+2, CLR ACC), this 512 neural processing units 126 512 202 numerical value of accumulator be passed and be written the column 4 of data random access memory 122, and accumulator 202 can be clear It removes；(correspond to time step 2, this 512 neural processing units 126 in second of execution of the instruction of address 9 and 10 In each neural processing unit 126 can execute 512 multiplyings, by the 512 of the column 6 of data random access memory 122 The numerical value of a input node D, which is multiplied by the column 0 to 511 of weight random access memory 124, corresponds to this neural processing unit 126 Row weight, to generate 512 products, together with address 7 and 8 instruction for performed by 512 content node C numerical value Accumulating operation adds up and calculates the numerical value of hiding node layer Z in the accumulator 202 of corresponding neural processing unit 126, in address During second of 11 instruction executes, 512 202 numerical value of accumulator of this 512 neural processing units 126 are passed and are written The column 7 of data random access memory 122, and accumulator 202 can then be removed；The rest may be inferred, the instruction in address 9 and 10 Execute for the 19th time in (correspond to time step 19, each nerve processing in this 512 neural processing units 126 is single Member 126 can execute 512 multiplyings, by the numerical value of 512 input node D of the column 57 of data random access memory 122 It is multiplied by the weight that the row of this neural processing unit 126 is corresponded in the column 0 to 511 of weight random access memory 124, to generate 512 products, together with address 7 and 8 instruction for accumulating operation performed by 512 content node C numerical value, add up in phase The accumulator 202 of corresponding nerve processing unit 126 is to calculate the numerical value for hiding node layer Z, and the tenth of the instruction in address 11 In nine execution, 512 202 numerical value of accumulator of this 512 neural processing units 126 are passed and data random access are written The column 58 of memory 122, and accumulator 202 can then be removed.As previously mentioned, the 20th time in the instruction of address 9 and 10 is held Produced by row and the numerical value of the hiding node layer Z of write-in can't be used.

The instruction (LOOP 4) of address 12 can make cycle counter 3804 successively decrease and in the new number of cycle counter 3804 Value returns to the instruction of address 4 in the case where being greater than zero.

Figure 43 is the example that a block diagram shows Jordan time recurrent neural network.The Jordan time of Figure 43 passs Return neural network to be similar to the Elman time recurrent neural network of Figure 40, there is input layer/neuron D, hidden layer section Point/neuron Z exports node layer/neuron Y, with content node layer/neuron C.But, it is passed in the Jordan time of Figure 43 Returning in neural network, content node layer C is linked using the output feedback from its corresponding output node layer Y as its input, and The output from hiding node layer Z is as its input connection in the non-Elman time recurrent neural network such as Figure 40.

In order to illustrate the present invention, Jordan time recurrent neural network is one comprising at least one input node layer, one A concealed nodes layer, the time recurrent neural network of an output node layer and a content node layer.It is walked in a given time Rapid beginning, content node layer can store output node layer and generate in previous time step and be fed back to the knot of content node layer Fruit.This result for being fed back to content layer can be the result of run function or output node layer executes accumulating operation and is not carried out The result of run function.

Figure 44 is a block diagram, and display is associated with the Jordan time recurrence mind of Figure 43 when the execution of neural network unit 121 When calculating through network, the data random access memory 122 and weight random access memory 124 of neural network unit 121 One example of interior data configuration.Assume that the Jordan time recurrent neural network of Figure 43 has 512 in the example of Figure 44 Input node D, 512 concealed nodes Z, 512 content node C, with 512 output node Y.In addition, also assuming that this Jordan Time recurrent neural network is connection completely, i.e., all 512 input node D link each concealed nodes Z as input, entirely 512, portion content node C links each concealed nodes Z as input, and all 512 concealed nodes Z link it is each defeated Egress Y is as input.Although the example of the Jordan time recurrent neural network of Figure 44 can impose 202 numerical value of accumulator and open Dynamic function is to generate the numerical value of output node layer Y, and but, this example assumes that the number of accumulator 202 before run function can will be imposed Value is transferred to content node layer C, rather than really output node layer Y numerical value.In addition, neural network unit 121 is provided with 512 Neural processing unit 126 or neuron, such as take wide configuration.Finally, the hypothesis of this example is associated with by content node C to hidden The weight for hiding the connection of node Z all has numerical value 1；Because without storing these weighted values for being one.

Such as the example of Figure 41, as shown in the figure, the lower section 512 of weight random access memory 124 arranges (column 0 to 511) The weighted value for the connection being associated between input node D and concealed nodes Z can be loaded, and after weight random access memory 124 Continuous 512 column (column 512 to 1023) can load the weighted value for the connection being associated between concealed nodes Z and output node Y.

Data random access memory 122 loads Jordan time recurrent neural network nodal value and is similar to figure for a series of Time step in 41 example uses；But, the memory loads arranged in the example of Figure 44 with one group four provide given time The nodal value of step.As shown in the figure, in the embodiment of the data random access memory 122 with 64 column, data are random Nodal value needed for access memory 122 can load 15 different time steps.In the example of Figure 44, column 0 to 3, which load, to be supplied The nodal value that time step 0 uses, column 4 to 7 load the nodal value used for time step 1, and so on, column 60 to 63 load The nodal value used for time step 15.The first row of this four column storage stack loads the input node D's of intermediate step at this time Numerical value.The secondary series of this four column storage stack loads the numerical value of the concealed nodes Z of intermediate step at this time.This four column storage stack Third equip the numerical value for carrying the content node C of intermediate step at this time.4th column of this four column storage stack are then to load at this time The numerical value of the output node Y of intermediate step.As shown in the figure, it is corresponding to carry its for each luggage of data random access memory 122 The nodal value of neuron or neural processing unit 126.That is, the loading of row 0 is associated with node D0, the node of Z0, C0 and Y0 Value, calculating is executed by neural processing unit 0；The loading of row 1 is associated with node D1, the nodal value of Z1, C1 and Y1, and calculating is It is executed by neural processing unit 1；The rest may be inferred, and the loading of row 511 is associated with node D511, the node of Z511, C511 and Y511 Value, calculating is executed by neural processing unit 511.This part corresponds at Figure 44 and can be described in more detail subsequent.

The numerical value of the content node C of given time step is in generation in intermediate step at this time and as next time in Figure 44 The input of step.It, can be at that is, the numerical value of node C that neural processing unit 126 is calculated in intermediate step at this moment and is written Numerical value of the neural processing unit 126 in next time step for node C used in the numerical value of calculate node Z thus (together with the numerical value of the input node D of this next time step).(instant intermediate step 0 calculates column 1 to the initial value of content node C The numerical value of node C used in the numerical value of node Z) it is assumed to zero.This part is in the subsequent nand architecture program corresponding to Figure 45 Chapters and sections can be described in more detail.

As described in Figure 41 above, preferably, the numerical value (column 0,4 in the example of Figure 44, and so on extremely of input node D The numerical value of column 60) by being implemented in the framework program of processor 100 through MTNN instruction 1400 write-ins/filling data random access Memory 122, and be nand architecture program reading/use by being implemented in neural network unit 121, such as the nand architecture of Figure 45 Program.On the contrary, the numerical value of concealed nodes Z/ content node C/ output node Y (is respectively column 1/2/3,5/ in the example of Figure 44 6/7, and so on to the numerical value of column 61/62/63) by being implemented in nand architecture program write-in/filling of neural network unit 121 Data random access memory 122, and be by be implemented in the framework program of processor 100 through MFNN instruction 1500 read/ It uses.The example of Figure 44 assumes that this framework program can execute following steps: (1) the time step different for 15 will input The numerical value of node D inserts data random access memory 122 (column 0,4, and so on to column 60)；(2) start the non-frame of Figure 45 Structure program；(3) whether detecting nand architecture program is finished；(4) output node Y is read from data random access memory 122 Numerical value (column 3,7, and so on to column 63)；And (5) repeat step (1) to (4) several times until completion task, such as The language of cellie is carried out recognizing required calculating.

In another executive mode, framework program can execute following steps: (1) to single a time step, with input The numerical value of node D inserts data random access memory 122 (such as column 0)；(2) start nand architecture program (Figure 45 nand architecture program Amendment after version, be not required to recycle, and only access data deposit single group four of memory 122 column at random)；(3) non-frame is detected Whether structure program is finished；(4) numerical value (such as column 3) of output node Y is read from data random access memory 122；And (5) step (1) to (4) are repeated several times until completing task.This two kinds of mode whichever be it is excellent can be according to time recurrent neural net Depending on the sampling mode of the input value of network.For example, input is taken if this task is allowed in multiple time steps Sample (such as about 15 time steps) simultaneously executes calculating, and first way is with regard to ideal, because mode can be brought more thus Computing resource efficiency and/or preferable efficiency, but, if this task, which is only allowed in single a time step, executes sampling, It just needs using the second way.

3rd embodiment is similar to the aforementioned second way, but, is different from the second way and uses single group of four numbers It is arranged according to random access memory 122, the nand architecture program of this mode uses four column memory of multiple groups, that is, in each time step Suddenly different groups of four column memories are used, this part is similar to first way.In this 3rd embodiment, preferably, framework journey Sequence includes a step before step (2), and in this step, framework program can be updated it before nand architecture program starts, Such as the column of data random access memory 122 in the instruction of address 1 are updated to point to next group of four column memories.

Figure 45 be a table, display be stored in neural network unit 121 program storage 129 program, this program by Neural network unit 121 executes, and uses data and weight according to the configuration of Figure 44, to reach Jordan time recurrent neural Network.The nand architecture program of Figure 45 is similar to the nand architecture program of Figure 42, and the difference of the two can refer to saying for this paper related Sections It is bright.

The example program of Figure 45 includes 14 nand architecture instructions, is located at address 0 to 13.The instruction of address 0 is one Initialization directive, to remove accumulator 202 and initialize cycle counter 3804 to numerical value 15, to execute 15 circulation groups (instruction of address 4 to 12).Preferably, this initialization directive and neural network unit 121 can be made to be in wide configuration and be configured to 512 neural processing units 126.As described herein, in the execution process instruction of address 1 to 3 and address 8 to 12, this 512 A nerve processing unit 126 is corresponding and is operated as 512 hiding node layer Z, and the instruction execution in address 4,5 and 7 In the process, this 512 neural processing units 126 are corresponding and operated as 512 output node layer Y.

The instruction of address 1 to 5 and address 7 is identical as the instruction of address 1 to 6 in Figure 42 and has identical function.Address 1 to 3 instruction calculate the initial value for hiding node layer Z and be written into the column 1 of data random access memory 122 for address 4,5 with The first time of 7 instruction executes use, to calculate first time step (the output node layer Y of time step 0).

During executing the first time of the output order of address 6, this 512 instructions by address 4 and 5 are cumulative to be generated 202 numerical value of accumulator (these following numerical value can be used by the output order of address 7 to calculate and be written and export node layer Y Numerical value) can be passed and be written the column 2 of data random access memory 122, these numerical value are step (time first time The middle content node layer C numerical value generated of step 0) simultaneously (is used in the second time step in time step 1)；Output in address 6 During second of instruction executes, cumulative 202 numerical value of accumulator generated of this 512 instructions by address 4 and 5 (is connect down Come, these numerical value can use the numerical value to calculate and be written output node layer Y by the output order of address 7) it can be passed and write Enter the column 6 of data random access memory 122, these numerical value are the second time step (content generated in time step 1) Node layer C numerical value simultaneously (is used in third time step in time step 2；The rest may be inferred, the tenth of the output order of address 6 the During five times execute, cumulative 202 numerical value of accumulator (next these numbers generated of this 512 instructions by address 4 and 5 Value can be used by the output order of address 7 to calculate and be written the numerical value of output node layer Y) it can be passed and that data are written is random The column 58 of memory 122 are accessed, these numerical value are the 15th time step (the content node layer generated in time step 14) C numerical value (and being read by the instruction of address 8, but not used).

The instruction of address 8 to 12 is roughly the same with the instruction of address 7 to 11 in Figure 42 and has identical function, and the two only has There is a discrepancy.This discrepancy that is, in Figure 45 the instruction (ADD_D_ACC DR ROW+1) of address 8 data random access can be made to deposit The columns of reservoir 122 increases by one, and the instruction (ADD_D_ACC DR ROW+0) of address 7 can make data random access in Figure 42 The columns of memory 122 increases by zero.This difference inducement is special in the difference of the data configuration in data random access memory 122 It is not that the configuration of one group of four column includes an independent column for content node layer C numerical value use (such as column 2,6,10 etc.) in Figure 44, and is schemed The configuration of one group of three column does not have this then and independently arranges in 41, but allows the numerical value of content node layer C and the numerical value of hiding node layer Z Shared same row (such as column Isosorbide-5-Nitrae, 7 etc.).15 times of the instruction of address 8 to 12 execute the numerical value that can calculate hidden layer node Z And data random access memory 122 (write-in column 5,9,13, and so on until column 57) is written into for address 4,5 and 7 Instruction second to 16 time execute using with calculate the second to 15 time step output node layer Y (time step 1 to 14).(instruction of address 8 to 12 it is last/the 15th time execute calculate hide node layer Z numerical value and be written into data with Machine accesses the column 61 of memory 122, but these numerical value and is not used by.)

The recursion instruction of address 13 can make cycle counter 3804 successively decrease and big in new 3804 numerical value of cycle counter The instruction of address 4 is returned in the case where zero.

In another embodiment, the design of Jordan time recurrent neural network loads output node Y using content node C Run function value, this run function value, that is, run function execute after accumulated value.In this embodiment, because of output node Y Numerical value it is identical as the numerical value of content node C, the nand architecture of address 6 is instructed and is not included in nand architecture program.It thus can be with Reduce the columns used in data random access memory 122.More precisely, each load contents node C number in Figure 44 The column (such as column 2,6,59) of value are not present in the present embodiment.In addition, each time step of this embodiment only needs data Three column of random access memory 122, and the 20 time steps that can arrange in pairs or groups, rather than 15, the instruction of nand architecture program in Figure 45 Address also will do it adjustment appropriate.

Shot and long term memory cell

Shot and long term memory cell is concept known by the art for time recurrent neural network.For example, Long Short-Term Memory, Sepp Hochreiter and J ü rgen Schmidhuber, Neural Computation, November 15,1997, Vol.9, No.8, Pages 1735-1780；Learning to Forget: Continual Prediction with LSTM, Felix A.Gers, J ü rgen Schmidhuber, and Fred Cummins, Neural Computation, October 2000, Vol.12, No.10, Pages 2451-2471；These documents It can be obtained from Massachusetts science and engineering publishing house periodical (MIT Press Journals).Shot and long term memory cell can be configured as a variety of Different types.The shot and long term memory cell 4600 of Figure 46 as described below is with network address http://deeplearning.net/ Entitled shot and long term memory network (the LSTM Networks for for mood analysis of tutorial/lstm.html Sentiment Analysis) study course described in shot and long term memory cell be model, the copy of this study course is in October, 2015 Downloading (hereinafter referred to as " shot and long term memory study course ") on the 19th is simultaneously provided in the old report book of US application case data exposure of this case.This Shot and long term memory cell 4600, which can be used for generally describing 121 embodiment of neural network unit as described herein, to be performed effectively It is associated with the ability of the calculating of shot and long term memory.It is worth noting that, the embodiment of these neural network units 121, including figure 49 the embodiment described can perform effectively the other shot and long terms memory being associated with other than shot and long term memory cell described in Figure 46 The calculating of born of the same parents.

Preferably, neural network unit 121 can be used to for one there is shot and long term memory cell layer to link other levels Time recurrent neural network executes calculating.For example, in the memory study course of this shot and long term, network includes mean value common source layer to connect Output (H) and the logistic regression layer of the shot and long term memory cell of shot and long term memory layer are received to receive the output of mean value common source layer.

Figure 46 is a block diagram, shows an embodiment of shot and long term memory cell 4600.

As shown in the figure, this shot and long term memory cell 4600 includes memory cell input (X), and memory cell exports (H), inputs lock (I), lock (O) is exported, forgotten lock (F), memory cell state (C) and candidate memory cell state (C ').Input lock (I) can gate memory The signal that born of the same parents input (X) to memory cell state (C) transmits, and memory cell state (C) can be gated to memory cell output by exporting lock (O) (H) signal transmitting.This memory cell state (C) can be fed back to the candidate memory cell state (C ') of a period of time intermediate step.Forget lock (F) This candidate memory cell state (C ') can be gated, this candidate memory cell state can be fed back and become the memory cell of next time step State (C).

The embodiment of Figure 46 calculates aforementioned various different numerical value using following equalities:

(1) I=SIGMOID (Wi*X+Ui*H+Bi)

(2) F=SIGMOID (Wf*X+Uf*H+Bf)

(3) C '=TANH (Wc*X+Uc*H+Bc)

(4) C=I*C '+F*C

(5) O=SIGMOID (Wo*X+Uo*H+Bo)

(6) H=O*TANH (C)

Wi and Ui is associated with the weighted value of input lock (I), and Bi is associated with the deviant of input lock (I).Wf and Uf It is associated with the weighted value for forgeing lock (F), and Bf is associated with the deviant for forgeing lock (F).Wo and Uo are associated with output lock (O) weighted value, and Bo is associated with the deviant of output lock (O).It has been observed that equation (1), (2) and (5) calculate separately input Lock (I) forgets lock (F) and output lock (O).Equation (3) calculates candidate memory cell state (C '), and equation (4) is calculated with current Memory cell state (C) is the candidate memory cell state (C ') of input, the memory of current time step of current memory cell state (C) i.e. Born of the same parents' state (C).Equation (6) calculates memory cell output (H).But the present invention is not limited thereto.Input is calculated using his mode of planting Lock forgets lock, exports lock, candidate memory cell state, the embodiment of memory cell state and the shot and long term memory cell of memory cell output Also covered by the present invention.

In order to illustrate the present invention, shot and long term memory cell includes memory cell input, memory cell output, memory cell state, candidate Memory cell state inputs lock, output lock and forgetting lock.For each time step, lock is inputted, exports lock, forget lock and is waited Select memory memory cell input that memory cell state is current time step and the memory cells of previous time steps export with it is related The function of weight.The memory cell state of intermediate step is the memory cell state of previous time steps at this time, and candidate memory cell state is defeated Enter lock and exports the function of lock.In this sense, memory cell state can feed back the note for calculating next time step Recall born of the same parents' state.The memory cell output of intermediate step at this time is the function of the calculated memory cell state of intermediate step and output lock at this time. Shot and long term Memory Neural Networks are the neural networks with a shot and long term memory cell layer.

Figure 47 is a block diagram, and display is associated with the shot and long term memory nerve net of Figure 46 when the execution of neural network unit 121 When 4600 layers of shot and long term memory cell of calculating of network, the data random access memory 122 of neural network unit 121 and weight with Machine accesses an example of the data configuration in memory 124.In the example of Figure 47, neural network unit 121 is configured to 512 Neural processing unit 126 or neuron, such as wide configuration is adopted, but, only 128 neural processing units 126 (are handled as neural Unit 0 to 127) caused by numerical value can be used, this is because this example shot and long term memory layer there was only 128 shot and long terms Memory cell 4600.

As shown in the figure, weight random access memory 124 can load the corresponding nerve processing of neural network unit 121 The weighted value of unit 0 to 127, deviant be worth between two parties.The row 0 to 127 of weight random access memory 124 loads nerve net The weighted value of the corresponding neural processing unit 0 to 127 of network unit 121, deviant be worth between two parties.Each column in column 0 to 14 are then Being that loading 128 is following corresponds to previous equations (1) to the numerical value of (6) to be supplied to neural processing unit 0 to 127, these are counted Value are as follows: Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, C ', TANH (C), C, Wo, Uo, Bo.Preferably, weighted value and deviant- Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo (being located at column 0 to 8 and column 12 to 14)-are by being implemented in processor 100 Framework program instruct 1400 write-ins/filling weight random access memory 124 through MTNN, and by being implemented in neural network list Nand architecture program reading/use of member 121, such as the nand architecture program of Figure 48.Preferably, value-C ' between two parties, TANH (C), C (is located at Column 9 to 11)-the nand architecture program by being implemented in neural network unit 121 is written/insert weight random access memory 124 simultaneously It is read out/uses, the details will be described later.

As shown in the figure, data random access memory 122 loads input (X), exports (H), inputs lock (I), forgets lock (F) it is used with output lock (O) numerical value for a series of time steps.Furthermore, it is understood that this five column of memory one group of loading X, H, I, F It is used with the numerical value of O for a given time step.By taking one with the data random access memory 122 of 64 column as an example, such as scheme Shown in, this data random access memory 122 can load the memory cell numerical value used for 12 different time steps.In Figure 47 Example in, column 0 to 4 load the memory cell numerical value that uses for time step 0, and column 5 to 9 load the note used for time step 1 Recall born of the same parents' numerical value, and so on, column 55 to 59 load the memory cell numerical value used for time step 11.In this five column storage stack First row load the X numerical value of intermediate step at this time.Secondary series in this five column storage stack loads the H number of intermediate step at this time Value.Third in this five column storage stack equips the I numerical value for carrying intermediate step at this time.The 4th column in this five column storage stack Load the F numerical value of intermediate step at this time.The 5th in this five column storage stack equips the O numerical value for carrying intermediate step at this time.In such as figure It is shown, what each luggage load in data random access memory 122 was used for corresponding neuron or neural processing unit 126 Numerical value.That is, row 0 loads the numerical value for being associated with shot and long term memory cell 0, and its calculating is held by neural processing unit 0 Row；Row 1 loads the numerical value for being associated with shot and long term memory cell 1, and its calculating is as performed by neural processing unit 1；Class according to this It pushes away, row 127 loads the numerical value for being associated with shot and long term memory cell 127, and its calculating is as performed by neural processing unit 127, in detail As described in subsequent figure 48.

(it is located at column 0,5,9, and so on to column 55) preferably, X numerical value by being implemented in the framework program of processor 100 1400 write-ins/filling data random access memory 122 is instructed through MTNN, and by being implemented in the non-of neural network unit 121 Framework program is read out/uses, nand architecture program as shown in figure 48.Preferably, I numerical value, F numerical value and O numerical value (are located at column 2/3/4,7/8/9,12/13/14, and so on to column 57/58/59) by being implemented in the nand architecture program of neural processing unit 121 Write-in/filling data random access memory 122, the details will be described later.Preferably, H numerical value (is located at column 1,6,10, and so on extremely Column 56) data random access memory 122 is written/inserted by the nand architecture program for being implemented in neural processing unit 121 and carried out Reading/use, and the framework program by being implemented in processor 100 is read out through MFNN instruction 1500.

The example of Figure 47 assumes that this framework program can execute following steps: (1) the time step different for 12, will be defeated Enter numerical value filling data random access memory 122 (column 0,5, and so on to column 55) of X；(2) start the nand architecture of Figure 48 Program；(3) whether detecting nand architecture program is finished；(4) numerical value of output H is read from data random access memory 122 (column 1,6, and so on to column 59)；And (5) repeat step (1) to (4) several times until completion task, such as make to mobile phone The language of user carries out recognizing required calculating.

In another executive mode, framework program can execute following steps: (1) to single a time step, to input X Numerical value insert data random access memory 122 (such as column 0)；(2) start nand architecture program (Figure 48 nand architecture program is repaired Version after just, is not required to recycle, and only accesses single group five column of data random access memory 122)；(3) nand architecture is detected Whether program is finished；(4) numerical value (such as column 1) of output H is read from data random access memory 122；And (5) weight Multiple step (1) to (4) are several times until completing task.This two kinds of mode whichever are the excellent input X that layer can be remembered according to shot and long term Depending on the sampling mode of numerical value.For example, if this task be allowed in multiple time steps input is sampled it is (such as big About 12 time steps) and calculating is executed, first way is with regard to ideal, because mode may bring more computing resources thus Efficiency and/or preferable efficiency, but, if this task is only allowed in single a time step and executes sampling, it is necessary to use The second way.

3rd embodiment is similar to the aforementioned second way, but, is different from the second way and uses single group of five columns According to random access memory 122, the nand architecture program of this mode uses five column memory of multiple groups, that is, in each time step Using five different column storage stacks, this part is similar to first way.In this 3rd embodiment, preferably, framework Program includes a step before the step (2), and in this step, framework program can be updated it before nand architecture program starts, Such as the column of data random access memory 122 in the instruction of address 0 are updated to point to next group of five column memories.

Figure 48 be a table, display be stored in neural network unit 121 program storage 129 program, this program by Neural network unit 121 executes and uses data and weight according to the configuration of Figure 47, is associated with shot and long term memory cell layer to reach Calculating.The example program of Figure 48 includes that 24 nand architecture instructions are located at address 0 to 23.The instruction of address 0 (INITIALIZE NPU, CLR ACC, LOOPCNT=12, DR IN ROW=-1, DR OUT ROW=2) can remove accumulator 202 and by cycle counter 3804 initialize to numerical value 12, to execute 12 circulation groups (instruction of address 1 to 22).This is initial Change instruction and can be numerical value -1 by the row initialization to be read of data random access memory 122, and the of the instruction of address 1 After primary execution, it is zero that this numerical value, which will increase,.This initialization directive can simultaneously fall in lines the to be written of data random access memory 122 (such as buffer 2606 of Figure 26 and Figure 39) is initialized as column 2.Preferably, this initialization directive and neural network unit can be made 121 in wide configuration, in this way, neural network unit 121 will be configured with 512 neural processing units 126.Such as following sections Described, in the execution process instruction of address 0 to 23, the 126 therein 128 nerve processing of this 512 neural processing units are single Member 126 is corresponding and is operated as 128 shot and long term memory cells 4600.

In the first time of the instruction of address 1 to 4 executes, this 128 neural 126 (i.e. neural processing units 0 of processing unit Each neural processing unit 126 into 127) can be for step (time first time of corresponding shot and long term memory cell 4600 Step 0) calculates input lock (I) numerical value and I numerical value is written to the corresponding text of the column 2 of data random access memory 122；? 126 meeting of each neural processing unit during second of the instruction of address 1 to 4 executes, in this 128 neural processing units 126 For corresponding shot and long term memory cell 4600 the second time step (time step 1) calculate I numerical value simultaneously by I numerical value be written number According to the corresponding text of the column 7 of random access memory 122；The rest may be inferred, in the 12nd execution of the instruction of address 1 to 4 In, each neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding shot and long term memory cell 4600 The 12nd time step (time step 11) calculate I numerical value and by I numerical value write-in data random access memory 122 column 57 Corresponding text, as shown in figure 47.

Furthermore, it is understood that the multiply-accumulate instruction of address 1 can read data random access memory 122 when forefront rear Next column (executing first is column 0, and executing second is column 5, and so on, executing the 12nd is column 55), this Memory cell input (X) value of column comprising being associated with current time step, this instructs and can read weight random access memory 124 In include Wi numerical value column 0, and aforementioned reading numerical values are multiplied to produce the first product accumulation to just by the initial of address 0 Change the accumulator 202 that instruction or the instruction of address 22 are removed.Then, the multiply-accumulate instruction of address 2 can read next data The column of random access memory 122 (executing first is column 1, and executing second is column 6, and so on, it is held the 12nd Row be column 56), this column comprising be associated with current time step memory cell output (H) value, this instruct and can read weight with Machine accesses the column 1 in memory 124 comprising Ui numerical value, and aforementioned value is multiplied to produce the second product accumulation to accumulator 202.The H numerical value for being associated with current time step is random by data by the instruction (and instruction of address 6,10 and 18) of address 2 It accesses memory 122 to read, be generated in previous time steps, and by the write-in data random access storage of the output order of address 22 Device 122；But, in first time executes, the column 1 of data random access memory can be written in the instruction of address 2 with an initial value As H numerical value.Preferably, framework program can deposit at random initial H numerical value write-in data before the nand architecture program of starting Figure 48 The column 1 (such as using MTNN instruction 1400) of access to memory 122；But, the present invention is not limited thereto, includes in nand architecture program There is initialization directive that the other embodiments of the column 1 of initial H numerical value write-in data random access memory 122 are also belonged to this hair Bright scope.In one embodiment, this initial H numerical value is zero.Next, the finger that weight text is added to accumulator of address 3 It enables (ADD_W_ACC WR ROW 2) to can read the column 2 in weight random access memory 124 comprising Bi numerical value and is added into Accumulator 202.Finally, the output order (OUTPUT SIGMOID, DR OUT ROW+0, CLR ACC) of address 4 can be to accumulator 202 numerical value execute a S type run function and by implementing result write-in data random access memory 122 current output arrange ( First execution is column 2, and executing second is column 7, and so on, as column 57 are executed the 12nd) and remove cumulative Device 202.

Each nerve processing in the first time of the instruction of address 5 to 8 executes, in this 128 neural processing units 126 Unit 126 can for corresponding shot and long term memory cell 4600 first time step (time step 0) calculate its forget lock (F) number It is worth and F numerical value is written the corresponding text of the column 3 of data random access memory 122；The second of the instruction of address 5 to 8 In secondary execution, each neural processing unit 126 in this 128 neural processing units 126 can be remembered for corresponding shot and long term (time step 1) calculates it and forgets lock (F) numerical value and data random access is written in F numerical value the second time step of born of the same parents 4600 The corresponding text of the column 8 of memory 122；The rest may be inferred, in the 12nd execution of the instruction of address 5 to 8, this 128 When each neural processing unit 126 in neural processing unit 126 can be directed to the 12nd of corresponding shot and long term memory cell 4600 (time step 11) calculates its column 58 forgotten lock (F) numerical value and F numerical value is written to data random access memory 122 to intermediate step Corresponding text, as shown in figure 47.The mode that the instruction of address 5 to 8 calculates F numerical value is similar to the finger of aforementioned addresses 1 to 4 Enable, but, the instruction of address 5 to 7 can be respectively from the column 3 of weight random access memory 124, and column 4 and column 5 read Wf, Uf and Bf numerical value is to execute multiplication and/or add operation.

In 12 execution of the instruction of address 9 to 12, at each nerve in this 128 neural processing units 126 Reason unit 126 can calculate its candidate memory cell state (C ') for the corresponding time step of corresponding shot and long term memory cell 4600 Numerical value and by C ' numerical value write-in weight random access memory 124 column 9 corresponding text.The instruction of address 9 to 12 calculates The mode of C ' numerical value is similar to the instruction of aforementioned addresses 1 to 4, and but, the instruction of address 9 to 11 can be respectively from weight arbitrary access The column 6 of memory 124, column 7 and column 8 read Wc, Uc and Bc numerical value to execute multiplication and/or add operation.In addition, address 12 Output order can execute tanh run function rather than the output order of such as address 4 (execute) S type run function.

Furthermore, it is understood that the multiply-accumulate instruction of address 9 can read data random access memory 122 when forefront ( Executing for the first time is column 0, and executing at second is column 5, and so on, executing at the 12nd time is column 55), this is current Memory cell input (X) value of column comprising being associated with current time step, this instructs and can read weight random access memory 124 In include Wc numerical value column 6, and by aforementioned value be multiplied to produce the first product accumulation to just by address 8 instruction remove Accumulator 202.Next, the multiply-accumulate instruction of address 10 can read a time column for data random access memory 122 (executing in first time is column 1, and executing at second is column 6, and so on, executing at the 12nd time is column 56), this Memory cell output (H) value of column comprising being associated with current time step, this instructs and can read weight random access memory 124 In include Uc numerical value column 7, and aforementioned value is multiplied to produce the second product accumulation to accumulator 202.Next, address 11 instruction that accumulator is added in weight text can read the column 8 in weight random access memory 124 comprising Bc numerical value simultaneously It is added into accumulator 202.Finally, output order (OUTPUT TANH, WR OUT ROW 9, the CLR ACC) meeting pair of address 12 202 numerical value of accumulator executes tanh run function and implementing result is written to the column 9 of weight random access memory 124, And remove accumulator 202.

In 12 execution of the instruction of address 13 to 16, at each nerve in this 128 neural processing units 126 Reason unit 126 can calculate new memory cell state (C) number for the corresponding time step of corresponding shot and long term memory cell 4600 It is worth and this new C numerical value is written the corresponding text of the column 11 of weight random access memory 122, each nerve processing is single Member 126 can also calculate tanh (C) and be written into the corresponding text of the column 10 of weight random access memory 124.Further For, the multiply-accumulate instruction of address 13 can read data random access memory 122 when the next column at forefront rear is (first Secondary execute is column 2, and executing at second is column 7, and so on, executing at the 12nd time is column 57), this column is comprising closing It is coupled to input lock (I) numerical value of current time step, this instructs and reads in weight random access memory 124 comprising candidate note Recall the column 9 (being just written by the instruction of address 12) of born of the same parents' state (C ') numerical value, and aforementioned value is multiplied to produce first and is multiplied Accumulation adds to the accumulator 202 just removed by the instruction of address 12.Next, the multiply-accumulate instruction of address 14 can read number According to random access memory 122 next column (executing in first time is column 3, and executing second is column 8, and so on, Execute at the 12nd time is column 58), forgetting lock (F) numerical value of this column comprising being associated with current time step, this instructs and reads Current memory cell state (C) numerical value calculated in previous time steps is contained in weighting weight random access memory 124 (by ground The last execute of the instruction of location 15 is written) column 11, and aforementioned value is multiplied to produce the second product and is added Accumulator 202.It adds up next, the output order (OUTPUT PASSTHRU, WR OUT ROW 11) of address 15 can transmit this 202 numerical value of device and the column 11 for being written into weight random access memory 124.It is to be appreciated that the instruction of address 14 is by counting It is generated in the last time execution according to the instruction that the C numerical value that the column 11 of random access memory 122 are read is address 13 to 15 And the C numerical value being written.The output order of address 15 can't remove accumulator 202, in this way, its numerical value can be by the finger of address 16 It enables and using.Finally, the output order (OUTPUT TANH, WR OUT ROW 10, CLR ACC) of address 16 can be to accumulator 202 Numerical value executes tanh run function and by the column 10 of its implementing result write-in weight random access memory 124 for address 21 Instruction use with calculate memory cell output (H) value.The instruction of address 16 can remove accumulator 202.

In the first time of the instruction of address 17 to 20 executes, at each nerve in this 128 neural processing units 126 Reason unit 126 can for corresponding shot and long term memory cell 4600 first time step (time step 0) calculate its export lock (O) Numerical value and by O numerical value write-in data random access memory 122 column 4 corresponding text；Instruction in address 17 to 20 In second of execution, each neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding shot and long term (time step 1) calculates it and exports lock (O) numerical value and deposit O numerical value write-in data at random second time step of memory cell 4600 The corresponding text of the column 9 of access to memory 122；The rest may be inferred, in the 12nd execution of the instruction of address 17 to 20, this Each neural processing unit 126 in 128 neural processing units 126 can be directed to the tenth of corresponding shot and long term memory cell 4600 (time step 11) calculates it and exports lock (O) numerical value and data random access memory 122 is written in O numerical value two time steps The corresponding text of column 58, as shown in figure 47.Address 17 to 20 instruction calculate O numerical value mode be similar to aforementioned addresses 1 to 4 instruction, but, the instruction of address 17 to 19 can be respectively from the column 12 of weight random access memory 124, column 13 and column 14 Wo, Uo and Bo numerical value are read to execute multiplication and/or add operation.

In the first time of the instruction of address 21 to 22 executes, at each nerve in this 128 neural processing units 126 Manage unit 126 can for the first time step of corresponding shot and long term memory cell 4600 (it is defeated that time step 0) calculates its memory cell Out (H) value and by H numerical value write-in data random access memory 122 column 6 corresponding text；Finger in address 21 to 22 During second enabled executes, each neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding length (time step 1) calculates its memory cell output (H) value and data is written in H numerical value the second time step of short-term memory born of the same parents 4600 The corresponding text of the column 11 of random access memory 122；The rest may be inferred, in the 12nd execution of the instruction of address 21 to 22 In, each neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding shot and long term memory cell 4600 The 12nd time step (time step 11) calculate its memory cell output (H) value and by H numerical value write-in data random access deposit The corresponding text of the column 60 of reservoir 122, as shown in figure 47.

Furthermore, it is understood that the multiply-accumulate instruction of address 21 can read data random access memory 122 when forefront rear Third column (executing in first time is column 4, and executing second is column 9, and so on, be in the 12nd execution Column 59), output lock (O) numerical value of this column comprising being associated with current time step, this instructs and reads weight random access memory Column 10 (being written by the instruction of address 16) in device 124 comprising tanh (C) numerical value, and aforementioned value is multiplied to produce one Product accumulation is to just by the accumulator 202 for instructing removing of address 20.Then, the output order of address 22 can transmit accumulator 202 numerical value and be written into data random access memory 122 following second output column 11 (first time execute be Column 6, executing at second is column 11, and so on, executing at the 12nd time is column 61), and remove accumulator 202. It is to be appreciated that (being executed i.e. in first time by the H numerical value that the instruction write-in data random access memory 122 of address 22 arranges For column 6, executing at second is column 11, and so on, executing at the 12nd time is column 61) it is address 2,6,10 and 18 Instruction subsequent execution in the H numerical value that consumes/read.But, the H numerical value for column 61 being written in the 12nd execution can't It is consumed/is read by the execution of the instruction of address 2,6,10 and 18；For a preferred embodiment, this numerical value can be by framework Program is consumed/is read.

The instruction (LOOP 1) of address 23 can make cycle counter 3804 successively decrease and in the new number of cycle counter 3804 Value returns to the instruction of address 1 in the case where being greater than zero.

Figure 49 is a block diagram, shows the embodiment of neural network unit 121, the neural processing unit group of this embodiment It is interior that there is the masking of output buffering and feedback capability.Figure 49 is shown at the nerve that single is made of four neural processing units 126 Manage cell group 4901.Although Figure 49 only shows single neural processing unit group 4901, it is to be appreciated, however, that neural Each neural processing unit 126 in network unit 121 can be all contained in a neural processing unit group 4901, therefore, N/J neural processing unit group 4901 is had altogether, and wherein N is that the quantity of neural processing unit 126 is (for example, just wide For configuration be 512, with regard to narrow configuration for be 1024) and J be the neural processing unit 126 in single a group 4901 quantity It (is for example, four) for the embodiment of Figure 49.By four nerves in neural processing unit group 4901 in Figure 49 Processing unit 126 is known as neural processing unit 0, neural processing unit 1, neural processing unit 2 and neural processing unit 3.

Each neural processing unit in the embodiment of Figure 49 is similar to the neural processing unit 126 of earlier figures 7, and schemes In with identical label component it is also similar.But, multitask buffer 208 is adjusted to include four additional inputs 4905, multitask buffer 705 is adjusted can be from original to select input 213 adjusted comprising four additional inputs 4907 Selection is carried out in this input 211 and 207 and additional input 4905 and is provided to output 209, also, selects input 713 through adjusting It is whole and can be carried out from the input 711 and 206 and additional input 4907 of script selection be provided to output 203.

As shown in the figure, the column buffer 1104 of Figure 11 is output buffer 1104 in Figure 49.Furthermore, it is understood that figure Shown in the text 0,1,2 and 3 of output buffer 1104 receive and be associated with four of neural processing unit 0,1,2 and 3 and start The corresponding output of function unit 212.The output buffer 1104 of this part includes that N number of text corresponds to neural processing unit group Group 4901, these texts are known as an output buffering text group.In the embodiment of Figure 49, N tetra-.Output buffer 1104 This four texts feed back to multitask buffer 208 and 705, and as four additional inputs 4905 by multitask buffer 208 receive and are received as four additional inputs 4907 by multitask buffer 705.Output buffering text group is anti- It is fed to the feedback action of its corresponding neural processing unit group 4901, enables the arithmetic instruction of nand architecture program from being associated with Selection one or two in the text (i.e. output buffering text group) of the output buffer 1104 of neural processing unit group 4901 A text is as its input, and example please refers to the nand architecture program of subsequent figure 51, such as the finger of address 4,8,11,12 and 15 in figure It enables.That is, 1104 text of output buffer being specified in nand architecture instruction can confirm that selection input 213/713 generates Numerical value.This ability actually makes output buffer 1104 can be used as a classification scratch memory (scratch pad Memory), nand architecture program can be allowed to reduce write-in data random access memory 122 and/or weight random access memory 124 and the subsequent number therefrom read, such as the numerical value for generating and using between two parties during reducing.Preferably, output buffering Device 1104 or column buffer 1104, including an one-dimensional cache array, to store 1024 narrow texts or 512 A wide text.Preferably, the reading of output buffer 1104 can be executed within single a time-frequency period, and for defeated The write-in of buffer 1104 can also execute within single a time-frequency period out.Different from data random access memory 122 with Weight random access memory 124 can be accessed by framework program and nand architecture program, and output buffer 1104 can not be by frame Structure program is accessed, and can only be accessed by nand architecture program.

Output buffer 1104 is received adjusted shields input (mask input) 4903.Preferably, shielding input 4903 include corresponding four texts to output buffer 1104 in four positions, this four character associatives are in neural processing unit group The neural processing unit 126 of four of 4901.Preferably, if this is corresponded to the shielding input of the text of output buffer 1104 4903 are very, and the text of this output buffer 1104 will maintain its current value；Otherwise, the text of this output buffer 1104 The output that function unit 212 will be activated is updated.That is, if this is corresponded to the text of output buffer 1104 Shielding input 4903 is vacation, and the output of run function unit 212 will be written into the text of output buffer 1104.In this way, The output order of nand architecture program is that output buffer 1104 optionally is written in the output of run function unit 212 Certain texts simultaneously make the current value of other texts of output buffer 1104 remain unchanged, and example please refers to subsequent figure 51 The instruction of nand architecture program, such as the instruction of address 6,10,13 and 14 in figure.That is, being specified in defeated in nand architecture program The text of buffer 1104 is the numerical value for certainly resulting from shielding input 4903 out.

To simplify the explanation, input 1811 (such as Figure 18, Figure 19 of multitask buffer 208/705 are not shown in Figure 49 With shown in Figure 23).But, while feedback/shielding of dynamically configurable neural processing unit 126 and output buffer 1104 is supported Embodiment also belong to the scope of the present invention.Preferably, output buffering text group is can corresponding earthquake in these embodiments State configuration.

Although it is to be appreciated that neural processing unit 126 in the neural processing unit group 4901 of this embodiment Quantity is four, and but, the present invention is not limited thereto, and the more or less embodiment of 126 quantity of neural processing unit is equal in group Belong to scope of the invention.In addition, for one has the embodiment of shared run function unit 1112, as shown in figure 52, In 126 quantity of neural processing unit and 212 group of run function unit in one neural processing unit group 4901 Neural 126 quantity of processing unit has synergy.The masking and feedback of output buffer 1104 in neural processing unit group Ability is particularly helpful to promote the computational efficiency for being associated with shot and long term memory cell 4600, in detail as described in subsequent figure 50 and Figure 51.

Figure 50 is a block diagram, and display is remembered when the execution of neural network unit 121 is associated in Figure 46 by 128 shot and long terms When the calculating for the level that born of the same parents 4600 are constituted, the data random access memory 122 of the neural network unit 121 of Figure 49, weight One example of the data configuration in random access memory 124 and output buffer 1104.In the example of Figure 50, neural network Unit 121 is configured to 512 neural processing units 126 or neuron, such as takes wide configuration.Such as the model of Figure 47 and Figure 48 Example only has 128 shot and long term memory cells 4600 in the shot and long term memory layer in the example of Figure 50 and Figure 51.But, scheming In 50 example, the numerical value that all 512 neural processing units 126 generate (such as neural processing unit 0 to 127) can all be made With.When executing the nand architecture program of Figure 51, each 4901 meeting collective, nerve processing unit group is as a shot and long term Memory cell 4600 is operated.

As shown in the figure, data deposit at random memory 122 load memory cell input (X) and output (H) value for it is a series of when Intermediate step uses.Furthermore, it is understood that having a pair of two column memories for a given time step and loading X numerical value and H number respectively Value.By taking one with the data random access memory 122 of 64 column as an example, as shown in the figure, this data random access memory The 122 memory cell numerical value loaded use for 31 different time steps.In the example of Figure 50, column 2 and 3 were loaded for the time The numerical value that step 0 uses, column 4 and 5 load the numerical value used for time step 1, and so on, column 62 and 63 were loaded for the time The numerical value that step 30 uses.This loads the X numerical value of intermediate step at this time to the first row in two column memories, and secondary series is then Load the H numerical value of intermediate step at this time.As shown in the figure, four row of each group is corresponded to nerve in data random access memory 122 The memory loads for managing cell group 4901 correspond to the numerical value that shot and long term memory cell 4600 uses for it.That is, row 0 to 3 The numerical value for being associated with shot and long term memory cell 0 is loaded, calculating is executed by neural processing unit 0-3, i.e., neural processing unit group Group 0 executes；Row 4 to 7 loads the numerical value for being associated with shot and long term memory cell 1, and calculating is executed by neural processing unit 4-7, i.e., Neural processing unit group 1 executes；The rest may be inferred, and row 508 to 511 loads the numerical value for being associated with shot and long term memory cell 127, meter It is executed at last by neural processing unit 508-511, i.e., neural processing unit group 127 executes, in detail as shown in subsequent figure 51.Such as figure Shown in, column 1 are simultaneously not used by, and column 0 load initial memory cell output (H) value can be by framework for a preferred embodiment Program inserts zero, and but, the present invention is not limited thereto, is exported using the initial memory cell of nand architecture program instruction filling column 0 (H) numerical value also belongs to scope of the invention.

Preferably, X numerical value (being located at column 2,4,6, the rest may be inferred to column 62) is saturating by being implemented in the framework program of processor 100 Cross MTNN instruction 1400 write-ins/filling data random access memory 122, and the non-frame by being implemented in neural network unit 121 Structure program is read out/uses, such as nand architecture program shown in Figure 50.Preferably, (being located at column 3,5,7, the rest may be inferred for H numerical value To column 63) data random access memory 122 to be written/inserted by the nand architecture program for being implemented in neural network unit 121 go forward side by side Row reading/use, the details will be described later.Preferably, H numerical value and the framework program by being implemented in processor 100 are instructed through MFNN 1500 are read out.It should be noted that the nand architecture program of Figure 51 assumes to correspond to neural processing unit group 4901 Four line storage of each group (such as row 0-3, row 4-7, row 5-8, and so on into row 508-511), in four X numbers of a given column Value inserts identical numerical value (such as being inserted by framework program).Similarly, the nand architecture program of Figure 51 can corresponded to nerve In four line storage of each group for managing cell group 4901, calculates and identical numerical value is written to four H numerical value of a given column.

As shown in the figure, weight random access memory 124 loads needed for the neural processing unit of neural network unit 121 Weight, offset with memory cell state (C) value.(the example in corresponding four line storage of each group to neural processing unit group 121 The rest may be inferred by such as row 0-3, row 4-7, row 5-8 to row 508-511): (1) row number is equal to 3 row divided by 4 remainder, can be at it Column 0,1,2 and 6 load the numerical value of Wc, Uc, Bc, with C respectively；(2) row number is equal to 2 row divided by 4 remainder, understands in its column 3, 4 and 5 load the numerical value of Wo, Uo and Bo respectively；(3) row number is equal to 1 row divided by 4 remainder, can distinguish in its column 3,4 and 5 Load the numerical value of Wf, Uf and Bf；And (4) row number is equal to 0 row divided by 4 remainder, can load respectively in its column 3,4 with 5 The numerical value of Wi, Ui and Bi.Preferably, these weights and deviant-Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo (in column 0 to 5)-instruct 1400 write-ins/filling weight arbitrary access through MTNN by the framework program for being implemented in processor 100 Memory 124, and the nand architecture program by being implemented in neural network unit 121 is read out/uses, such as the nand architecture journey of Figure 51 Sequence.Preferably, C value between two parties is written/inserts weight arbitrary access by the nand architecture program for being implemented in neural network unit 121 and deposits Reservoir 124 is simultaneously read out/uses, and the details will be described later.

The example of Figure 50 assumes that framework program can execute following steps: (1) the time step different for 31 will input The numerical value of X inserts data random access memory 122 (column 2,4, and so on to column 62)；(2) start the nand architecture journey of Figure 51 Sequence；(3) whether detecting nand architecture program is finished；(4) numerical value (column of output H are read from data random access memory 122 3,5, and so on to column 63)；And (5) repeat step (1) to (4) and use several times until completing task, such as to mobile phone The language of person carries out recognizing required calculating.

In another executive mode, framework program can execute following steps: (1) to single a time step, to input X Numerical value insert data random access memory 122 (such as column 2)；(2) start (amendment of Figure 51 nand architecture program of nand architecture program Version afterwards is not required to recycle, and only accesses the single of data random access memory 122 and arrange two)；(3) nand architecture journey is detected Whether sequence is finished；(4) numerical value (such as column 3) of output H is read from data random access memory 122；And (5) repeat Step (1) to (4) is several times until completing task.This two kinds of mode whichever are the excellent input X number that layer can be remembered according to shot and long term Depending on the sampling mode of value.For example, if this task is allowed in multiple time steps and is sampled to input (such as about 31 time steps) and calculating is executed, first way is with regard to ideal, because mode may bring more computing resources to imitate thus Rate and/or preferable efficiency, but, if this task is only allowed in single a time step and executes sampling, it is necessary to use the Two kinds of modes.

3rd embodiment is similar to the aforementioned second way, but, uses different from the second way single to two columns According to random access memory 122, the nand architecture program of this mode uses multipair memory column, that is, makes in each time step With difference to memory column, this part is similar to first way.Preferably, the framework program of this 3rd embodiment is in step It (2) include a step before, in this step, framework program can be updated it before nand architecture program starts, such as by ground The column of data random access memory 122 in the instruction of location 1 are updated to point to lower a pair of two column memory.

As shown in the figure, for the neural processing unit 0 to 511 of neural network unit 121, in the nand architecture program of Figure 51 After the instruction execution of middle different address, output buffer 1104 loads memory cell output (H), and candidate memory cell state (C ') is defeated Enter lock (I), forget lock (F), exports lock (O), the value between two parties of memory cell state (C) and tanh (C), each output buffering text In sub-block group (such as the group of corresponding four texts to neural processing unit group 4901 of output buffer 1104, such as text The rest may be inferred by 0-3,4-7,5-8 to 508-511), and text number is [3] OUTBUF divided by the textual representation that 4 remainder is 3, text Word number is [2] OUTBUF divided by the textual representation that 4 remainder is 2, and text number is divided by the textual representation that 4 remainder is 1 OUTBUF [1], and text number is [0] OUTBUF divided by the textual representation that 4 remainder is 0.

As shown in the figure, in the nand architecture program of Figure 51 after the instruction execution of address 2, for each neural processing unit For group 4901, the initial of corresponding shot and long term memory cell 4600 can be all written in all four texts of output buffer 1104 Memory cell exports (H) value.After the instruction execution of address 6, for each neural processing unit group 4901, output buffering Candidate memory cell state (the C ') value of corresponding shot and long term memory cell 4600 can be written in OUTBUF [3] text of device 1104, and defeated Other three texts of buffer 1104 can then maintain its preceding numerical values out.After the instruction execution of address 10, for each mind For processing unit group 4901, corresponding shot and long term memory cell can be written in OUTBUF [0] text of output buffer 1104 Forgetting lock (F) number of corresponding shot and long term memory cell 4600 can be written in 4600 input lock (I) numerical value, OUTBUF [1] text Value, OUTBUF [2] text can be written output lock (O) numerical value of corresponding shot and long term memory cell 4600, and OUTBUF [3] text It is then to maintain its preceding numerical values.After the instruction execution of address 13, for each neural processing unit group 4901, output New memory cell state (C) value that corresponding shot and long term memory cell 4600 can be written in OUTBUF [3] text of buffer 1104 is (right For output buffer 1104, includes the C numerical value of slot (slot) 3, the column 6 of weight random access memory 124 are written, in detail As described in subsequent figure 51), and other three texts of output buffer 1104 are then to maintain its preceding numerical values.Finger in address 14 It enables after executing, for each neural processing unit group 4901, OUTBUF [3] text of output buffer 1104 can be write Enter tanh (C) numerical value of corresponding shot and long term memory cell 4600, and other three texts of output buffer 1104 are then to maintain Its preceding numerical values.After the instruction execution of address 16, for each neural processing unit group 4901, output buffer New memory cell output (H) value of corresponding shot and long term memory cell 4600 can be all written in 1104 all four texts.Aforementionedly The execution process (the namely execution of excluded address 2, this is because address 2 is not belonging to a part of program circulation) of location 6 to 16 It can repeat 30 times, the program circulation of address 3 is returned to as address 17.

Figure 51 be a table, display be stored in neural network unit 121 program storage 129 program, this program by The neural network unit 121 of Figure 49 executes and uses data and weight according to the configuration of Figure 50, is associated with shot and long term note to reach Recall the calculating of born of the same parents' layer.The example program of Figure 51 includes that 18 nand architecture instructions are located at address 0 to 17.The instruction of address 0 is One initialization directive, to remove accumulator 202 and by the initialization of cycle counter 3804 to numerical value 31, to execute 31 times Circulation group (instruction of address 1 to 17).This initialization directive simultaneously can be by (the example to be written of falling in lines of data random access memory 122 Such as the buffer 2606 of Figure 26/Figure 39) it is initialized as numerical value 1, and after the first time of the instruction of address 16 executes, this numerical value It will increase to 3.Preferably, this initialization directive and neural network unit 121 can be made to be in wide configuration, in this way, neural network list Member 121 will be configured with 512 neural processing units 126.Execution process instruction as described in following sections, in address 0 to 17 In, 128 neural processing unit groups 4901 that this 512 neural processing units 126 are constituted are used as 128 corresponding length Phase memory cell 4600 is operated.

The instruction of address 1 and 2 is not belonging to the circulation group of program and can only execute primary.These instructions can generate initial memory Born of the same parents export (H) value (such as 0) and are written into all texts of output buffer 1104.The instruction of address 1 can be random from data The column 0 of access memory 122 read initial H numerical value and are placed on the accumulator 202 removed by the instruction of address 0.Address 2 Instruction (OUTPUT PASSTHRU, NOP, CLR ACC) 202 numerical value of accumulator can be transferred to output buffer 1104, such as scheme Shown in 50." NOP " mark in the output order (and other output orders of Figure 51) of address 2 indicates that output valve can only be write Enter output buffer 1104, without being written into memory, that is, will not be written into data random access memory 122 or Weight random access memory 124.The instruction of address 2 simultaneously can remove accumulator 202.

The instruction of address 3 to 17 is located in circulation group, executes the numerical value (such as 31) that number is cycle count.

Executing each time for the instruction of address 3 to 6 can calculate tanh (the C ') numerical value of current time step and be written into Text OUTBUF [3], this text will be used by the instruction of address 11.More precisely, the multiply-accumulate instruction meeting of address 3 It is read from the current reading of data random access memory 122 column (such as column 2,4,6 the rest may be inferred to column 62) and is associated with this time The memory cell of step inputs (X) value, reads Wc numerical value from the column 0 of weight random access memory 124, and aforementioned value is multiplied The accumulator 202 removed by the instruction of address 2 is added to generate a product.

The multiply-accumulate instruction (MULT-ACCUM OUTBUF [0], WRROW 1) of address 4 can be read from text OUTBUF [0] H numerical value (i.e. all four neural processing units 126 of neural processing unit group 4901) is taken, from weight random access memory The column 1 of device 124 read Uc numerical value, and aforementioned value are multiplied to produce one second product, accumulator 202 is added.

Address 5 can deposit weight text addition accumulator instruction (ADD_W_ACC WR ROW 2) from weight arbitrary access The column 2 of reservoir 124 read Bc numerical value and are added into accumulator 202.

The output order (OUTPUT TANH, NOP, MASK [0:2], CLR ACC) of address 6 can hold 202 numerical value of accumulator Row tanh run function, and text OUTBUF [3] only are written into (that is, only neural processing unit group in implementing result This result can be written in the neural processing unit 126 that number removes that 4 remainder is 3 in group 4901), also, accumulator 202 can be clear It removes.That is, the output order of address 6 can cover text OUTBUF [0], OUTBUF [1] and OUTBUF [2] are (such as instruction art Language MASK [0:2] is represented) and its current value is maintained, as shown in figure 50.In addition, the output order of address 6 can't be written Memory (as represented by instructions nomenclature NOP).

The instruction of address 7 to 10 executes input lock (I) numerical value that can calculate current time step each time, forgets lock (F) numerical value and output and are respectively written into text OUTBUF [0], OUTBUF [1] lock (O) numerical value, and OUTBUF [2], these Numerical value will be used by the instruction of address 11,12 and 15.More precisely, the multiply-accumulate instruction of address 7 can be random from data Current readings for accessing memory 122 arranges the memory that (such as column 2,4,6 the rest may be inferred to column 62) reading is associated with intermediate step at this time Born of the same parents input (X) value, read Wi, Wf and Wo numerical value from the column 3 of weight random access memory 124, and aforementioned value is multiplied to It generates a product and the accumulator 202 removed by the instruction of address 6 is added.More precisely, in neural processing unit group 4901 In, it numbers except the neural processing unit 126 that 4 remainder is 0 can calculate the product of X and Wi, number removes 4 remainder as 1 nerve Processing unit 126 can calculate the product of X and Wf, and number except the neural processing unit 126 that 4 remainder is 2 can calculate X and Wo Product.

The multiply-accumulate instruction of address 8 can read H numerical value (i.e. neural processing unit group 4901 from text OUTBUF [0] All four neural processing units 126), read Ui, Uf and Uo numerical value, and general from the column 4 of weight random access memory 124 Aforementioned value is multiplied to produce one second product and accumulator 202 is added.More precisely, in neural processing unit group 4901 In, it numbers except the neural processing unit 126 that 4 remainder is 0 can calculate the product of H and Ui, number removes 4 remainder as 1 nerve Processing unit 126 can calculate the product of H and Uf, and number except the neural processing unit 126 that 4 remainder is 2 can calculate H and Uo Product.

Address 9 can deposit weight text addition accumulator instruction (ADD_W_ACC WR ROW 2) from weight arbitrary access The column 5 of reservoir 124 read Bi, Bf and Bo numerical value and are added into accumulator 202.More precisely, in neural processing unit group In group 4901, number except 4 remainder be 0 neural processing unit 126 can execute the additional calculation of Bi numerical value, number except 4 it is remaining Number can execute the additional calculation of Bf numerical value for 1 neural processing unit 126, and number the neural processing unit for being 2 except 4 remainder 126 can execute the additional calculation of Bo numerical value.

The output order (OUTPUT SIGMOID, NOP, MASK [3], CLR ACC) of address 10 can be to 202 numerical value of accumulator Execute S type run function simultaneously I, F and the O numerical value calculated is respectively written into text OUTBUF [0], OUTBUF [1] and OUTBUF [2], this instructs and can remove accumulator 202, and is not written into memory.That is, the output order meeting of address 10 Masking text OUTBUF [3] (such as instructions nomenclature MASK [3] represented) and the current value (namely C ') for maintaining this text, As shown in figure 50.

The instruction of address 11 to 13 executes the new memory cell state (C) that can calculate current time step generation each time It is worth and the column 6 for being written into weight random access memory 124 is used for next time step (namely for the finger of address 12 Enable and being used when circulation executes next time), more precisely, this numerical value is written column 6 and corresponds to neural processing unit group 4901 Four row texts in label except 4 remainder be 3 text.In addition, the execution each time of the instruction of address 14 all can be by tanh (C) Numerical value is written OUTBUF [3] and uses for the instruction of address 15.

More precisely, the multiply-accumulate instruction (MULT-ACCUM OUTBUF [0], OUTBUF [3]) of address 11 can be from Text OUTBUF [0] reads input lock (I) numerical value, reads candidate memory cell state (C ') value from text OUTBUF [3], and will before It states numerical value and is multiplied to produce the accumulator 202 that the addition of one first product is removed by the instruction of address 10.More precisely, at nerve Each neural processing unit 126 in the neural processing unit 126 of four of reason cell group 4901 can all calculate I numerical value and C ' number First product of value.

The multiply-accumulate instruction (MULT-ACCUM OUTBUF [1], WR ROW 6) of address 12 can indicate neural processing unit 126 read forgetting lock (F) numerical value from text OUTBUF [1], and it is corresponding to read its from the column 6 of weight random access memory 124 Text, and the instruction for being multiplied to produce the second product and address 11 results from the first product addition in accumulator 202.More It speaks by the book, for neural 4901 internal label of processing unit group removes 4 remainder as 3 neural processing unit 126, from The text that column 6 are read is calculated current memory cell state (C) value of previous time steps, and the first product adds with the second product Total is memory cell state (C) new thus.But, for other three neural processing units of neural processing unit group 4901 For 126, the text read from column 6 is the numerical value for being not required to comprehend, this is because accumulated value caused by these numerical value will not be by It uses, namely output buffer 1104 will not be put by the instruction of address 13 and 14 and can be removed by the instruction of address 14.? That is label removes caused by the neural processing unit 126 that 4 remainder is 3 newly only in neural processing unit group 4901 Memory cell state (C) value will be used, i.e., by address 13 and 14 instruction use.With regard to the second to three of the instruction of address 12 Ten it is primary execute for, the C numerical value read from the column 6 of weight random access memory 124 be in the previous execution of circulation group by The numerical value of the instruction write-in of address 13.But, for the instruction of address 12 first time execute for, the C numerical value of column 6 be then by The initial value that framework program is written before the nand architecture program of starting Figure 51 or by version after an adjustment of nand architecture program.

The output order (OUTPUT PASSTHRU, WR ROW 6, MASK [0:2]) of address 13 can only transmit accumulator 202 Numerical value, i.e., calculated C numerical value, until text OUTBUF [3] is (that is, label only in neural processing unit group 4901 Except 4 remainder be 3 neural processing unit 126 output buffer 1104 can be written in its calculated C numerical value), and weight with The column 6 of machine access memory 124 are then with the write-in of updated output buffer 1104, as shown in figure 50.That is, address 13 output order can cover text OUTBUF [0], OUTBUF [1] and OUTBUF [2] and maintain its current value (i.e. I, F with O numerical value).It has been observed that only column 6 are corresponding to label in four row texts of neural processing unit group 4901 except 4 remainder is 3 C numerical value in text can be used, that is, be used by the instruction of address 12；Therefore, nand architecture program will not comprehend weight with Machine, which accesses, is located at row 0-2, row 4-6 in the column 6 of memory 124, and so on to the numerical value of row 508-510, as shown in figure 50 (i.e. I, F and O numerical value).

The output order (OUTPUT TANH, NOP, MASK [0:2], CLR ACC) of address 14 can be to 202 numerical value of accumulator Tanh run function is executed, and text OUTBUF [3] are written into the tanh calculated (C) numerical value, this instructs and understands clear Except accumulator 202, and it is not written into memory.The output order of address 14 can cover text such as the output order of address 13 OUTBUF [0], OUTBUF [1] and OUTBUF [2] and maintain its script numerical value, as shown in figure 50.

The instruction of address 15 to 16 executes memory cell output (H) value that can calculate current time step generation each time And it is written into the current output column rear secondary series of data random access memory 122, numerical value will be read by framework program It takes and is used for time step (namely by the instruction use of address 3 and 7 in circulation next time executes) next time.More accurately It says, the multiply-accumulate instruction of address 15 can read output lock (O) numerical value from text OUTBUF [2], read from text OUTBUF [3] Tanh (C) numerical value is taken, and is multiplied to produce a product and the accumulator 202 removed by the instruction of address 14 is added.It is more accurate Say that each neural processing unit 126 in the neural processing unit 126 of four of neural processing unit group 4901 can all calculate in ground The product of numerical value O and tanh (C).

The output order of address 16 can transmit 202 numerical value of accumulator and write calculated H numerical value in first time executes Fall in lines 3, executed at second in column 5 are written into calculated H numerical value, and so on the 30th it is primary execute in will calculate H numerical value column 63 are written, as shown in figure 50, these following numerical value can be used by the instruction of address 4 and 8.In addition, such as Figure 50 institute Show, these H numerical value calculated can be placed into output buffer 1104 for the subsequent use of instruction of address 4 and 8.Address 16 Output order simultaneously can remove accumulator 202.In one embodiment, the design of shot and long term memory cell 4600 refers to the output of address 16 It enables the output order of address 22 (and/or in Figure 48) that there is a run function, such as S type or hyperbolic tangent function, rather than transmits 202 numerical value of accumulator.

The recursion instruction of address 17 can make cycle counter 3804 successively decrease and big in new 3804 numerical value of cycle counter The instruction of address 3 is returned in the case where zero.

Thus can find because the feedback of the output buffer 1104 in 121 embodiment of neural network unit of Figure 49 with Screening ability, nand architecture instruction of the instruction number compared to Figure 48 in the circulation group of the nand architecture program of Figure 51 are substantially reduced 34%.In addition, because of the feedback and screening ability of the output buffer 1104 in 121 embodiment of neural network unit of Figure 49, Memory in the data random access memory 122 of Figure 51 nand architecture program configures arranged in pairs or groups time number of steps substantially Three times of Figure 48.Aforementioned improvement facilitates certain frameworks calculated using 121 executive chairman's short-term memory born of the same parents' layer of neural network unit Program application is less equal than 128 application especially for 4600 quantity of shot and long term memory cell in shot and long term memory cell layer.

The embodiment of Figure 47 to Figure 51 assumes that the weight in each time step remains unchanged with deviant.But, this hair Bright to be not limited to this, other weights are also belonged to the scope of the present invention with the deviant embodiment that intermediate step changes at any time, wherein weight Random access memory 124 inserts single group of weight and deviant not as shown in Figure 47 to Figure 50, but in each time step Suddenly 124 address of weight random access memory of different groups of weights and deviant and the nand architecture program of Figure 48 to Figure 51 is inserted It can adjust therewith.

Substantially, in the embodiment of earlier figures 47 to Figure 51, weight, offset is stored in value (such as C, C ' numerical value) between two parties Weight random access memory 124, and input and be then stored in data random access memory with output valve (such as X, H numerical value) 122.This feature is conducive to that data random access memory 122 is dual-port and weight random access memory 124 is single port Embodiment, this is because having more flows from nand architecture program and framework program to data random access memory 122. But, because weight random access memory 124 is larger, in another embodiment of the invention then be exchange storage nand architecture with The memory (i.e. exchange data random access memory 122 and weight random access memory 124) of numerical value is written in framework program. That is, W, U, B, C ', tanh (C) and C numerical value are stored in data random access memory 122 and X, H, I, F and O numerical value Then it is stored in weight random access memory 124 (embodiment after the adjustment of Figure 47)；And W, U, B, with C numerical value are stored in number According to random access memory 122, X and H numerical value is then stored in weight random access memory 124 and (implements after the adjustment of Figure 50 Example).Because weight random access memory 124 is larger, these embodiments can handle more time step in a batch.It is right For the application for executing the framework program calculated using neural network unit 121, this feature be conducive to it is certain can be from more Application that time step is got profit and can be for the offer of memory (such as weight random access memory 124) that single port design enough Enough bandwidths.

Figure 52 is a block diagram, shows the embodiment of neural network unit 121, the neural processing unit group of this embodiment It is interior that there is the masking of output buffering and feedback capability, and shared run function unit 1112.The neural network unit 121 of Figure 52 Similar to the neural network unit 121 of Figure 47, and the component in figure with identical label is also similar.But, the four of Figure 49 A run function unit 212 is then this single as replaced single shared run function unit 1112 in the present embodiment Run function unit can receive four outputs 217 from four accumulators 202 and generate four outputs to text OUTBUF [0], [1] OUTBUF, OUTBUF [2] and OUTBUF [3].The function mode of the neural network unit 212 of Figure 52 is similar to above Figure 49 to Figure 51 the embodiment described, and its mode for operating shared run function unit 1112 be similar to above Figure 11 to scheming 13 the embodiment described.

Figure 53 is a block diagram, and display has 128 length when the execution of neural network unit 121 is associated in Figure 46 one When the calculating of the level of phase memory cell 4600, the data random access memory 122 of the neural network unit 121 of Figure 49, weight Another embodiment of data configuration in random access memory 124 and output buffer 1104.The example of Figure 53 is similar to figure 50 example.But, in Figure 53, Wi, Wf and Wo value are located at column 0 (rather than as Figure 50 is located at column 3)；Ui, Uf and Uo value are located at Column 1 (rather than as Figure 50 is located at column 4)；Bi, Bf and Bo value are located at column 2 (rather than as Figure 50 is located at column 5)；C value be located at column 3 (rather than As Figure 50 is located at column 6).In addition, the content of the output buffer 1104 of Figure 53 is similar to Figure 50, but, because of Figure 54 and Figure 51 Nand architecture program difference, tertial content (i.e. I, F, O and C ' numerical value) is appeared in after the instruction execution of address 7 Output buffer 1104 (rather than if Figure 50 is the instruction of address 10)；The content (i.e. I, F, O and C numerical value) of 4th column is in address Output buffer 1104 (rather than if Figure 50 is the instruction of address 13) is appeared in after 10 instruction execution；The content of 5th column is (i.e. I, F, O and tanh (C) numerical value) it is that output buffer 1104 is appeared in after the instruction execution of address 11 (rather than as Figure 50 is ground The instruction of location 14)；And the content (i.e. H numerical value) of the 6th column is to appear in output buffer after the instruction execution of address 13 1104 (rather than if Figure 50 is the instruction of address 16), the details will be described later.

Figure 54 be a table, display be stored in neural network unit 121 program storage 129 program, this program by The neural network unit 121 of Figure 49 executes and uses data and weight according to the configuration of Figure 53, is associated with shot and long term note to reach Recall the calculating of born of the same parents' layer.The example program of Figure 54 is similar to the program of Figure 51.More precisely, in Figure 54 and Figure 51, address 0 to 5 Instruction it is identical；The instruction of address 7 and 8 is identical to the instruction of address 10 and 11 in Figure 51 in Figure 54；And address 10 in Figure 54 Instruction to 14 is identical to the instruction of address 13 to 17 in Figure 51.

But, in Figure 54 the instruction of address 6 can't remove accumulator 202 (in comparison, in Figure 51 address 6 instruction Accumulator 202 can then be removed).In addition, the instruction of address 7 to 9 is not present in the nand architecture program of Figure 54 in Figure 51.Most Afterwards, for the instruction of address 12 in the instruction of address 9 in Figure 54 and Figure 51, in addition to weight is read in the instruction of address 9 in Figure 54 The column 3 of random access memory 124 and in Figure 51 the instruction of address 12 then be read weight random access memory column 6 outside, Other parts are all the same.

Because of the difference of the nand architecture program of the nand architecture program and Figure 51 of Figure 54, the weight that the configuration of Figure 53 uses is random The columns of access memory 124 can reduce three, and the instruction number in program circulation can also reduce three.The nand architecture journey of Figure 54 Circulation packet size in sequence substantially only has the half of the circulation packet size in the nand architecture program of Figure 48, and substantially only schemes 80% of circulation packet size in 51 nand architecture program.

Figure 55 is a block diagram, shows the part of the neural processing unit 126 of another embodiment of the present invention.More accurately It says, for single neural processing unit 126 in multiple neural processing units 126 of Figure 49, multitask is shown in figure The input 207,211 and 4905 associated with it of buffer 208 and the input 206,711 associated with it of multitask buffer 705 With 4907.Other than the input of Figure 49, the multitask buffer 208 and multitask buffer 705 of neural processing unit 126 are not It receives and numbers (index_within_group) input 5599 in a group.Specific nerve is pointed out in number input 5599 in group Number of the processing unit 126 in its neural processing unit group 4901.So that it takes up a position, for example, with each neural processing unit The tool of group 4901 is there are four for the embodiment of neural processing unit 126, in each neural processing unit group 4901, In a neural processing unit 126 number in its group in input 5599 and receives value of zero, one of neural processing unit 126 receive numerical value one in its group in number input 5599, one of nerve processing unit 126 is numbered defeated in its group Enter reception numerical value two in 5599, and one of neural processing unit 126 numbers in input 5599 in its group and receives numerical value Three.In other words, number 5599 numerical value of input are exactly this neural processing unit in neural 126 received group, institute of processing unit 126 number in neural network unit 121 is divided by the remainder of J, and wherein J is at the nerve in neural processing unit group 4901 Manage the quantity of unit 126.So that it takes up a position, for example, neural processing unit 73 numbers input 5599 in its group receives numerical value One, neural processing unit 353 numbers input 5599 in its group and receives numerical value three, and neural processing unit 6 is in its group Number input 5599 receives numerical value two.

In addition, being expressed as herein " SELF ", multitask buffer 208 can select when the specified default value of control input 213 The output of output buffer 1,104 4905 corresponding to 5599 numerical value of number input in group.Therefore, when nand architecture is instructed with SELF Numerical value specified receive the data from output buffer 1104 and (be denoted as OUTBUF in the instruction of the address 2 and 7 Figure 57 [SELF]), the multitask buffer 208 of each nerve processing unit 126 can receive its corresponding text from output buffer 1104 Word.So that it takes up a position, for example, when neural network unit 121 executes the nand architecture instruction of address 2 and 7 in Figure 57, neural processing unit 73 multitask buffer 208 can input in four inputs 4905 to receive and come from output buffering in selection second (number 1) The multitask buffer 208 of the text 73 of device 1104, neural processing unit 353 can select the 4th in four inputs 4905 (number 3) input is to receive the text 353 from output buffer 1104, and the multitask buffer 208 of neural processing unit 6 It selection third (number 2) can input in four inputs 4905 to receive the text 6 from output buffer 1104.Although It is not used in the nand architecture program of Figure 57, but, it is specified that SELF numerical value (OUTBUF [SELF]) can also be used in nand architecture instruction It receives the data from output buffer 1104 and control 713 specified default values of input is made to make each neural processing unit 126 Multitask buffer 705 receives its corresponding text from output buffer 1104.

Figure 56 is a block diagram, and display is associated with the Jordan time recurrent neural net of Figure 43 when the execution of neural network unit The calculating of network and when using the embodiment of Figure 55, the data random access memory 122 and weight of neural network unit 121 are random Access an example of the data configuration in memory 124.Weight configuration in figure in weight random access memory 124 is identical to The example of Figure 44.The example of numerical value in figure in data random access memory 122 being similarly configured in Figure 44, in addition at this In example, each time step has corresponding a pair of two column memory to load input layer D value and output node layer Y Value, rather than as the example of Figure 44 uses the memory of one group of four column.That is, in this example, hidden layer Z numerical value with it is interior Hold layer C numerical value and is not written into data random access memory 122.But by output buffer 1104 as hidden layer Z numerical value with The classification scratch memory of content layer C numerical value, in detail as described in the nand architecture program of Figure 57.Aforementioned OUTBUF [SELF] output buffering The feedback characteristic of device 1104, can making the running of nand architecture program, more quick (this is will be for data random access memory 122 write-ins twice executed and twi-read act, with the write-in twice executed for output buffer 1104 and twi-read Act to replace) and the space of the data random access memory 122 that each time step uses is reduced, and make the present embodiment When the data that data random access memory 122 is loaded can be used for being approximately twice possessed by the embodiment of Figure 44 and Figure 45 Intermediate step, as shown in the figure, i.e. 32 time steps.

Figure 57 be a table, display be stored in neural network unit 121 program storage 129 program, this program by Neural network unit 121 executes and uses data and weight according to the configuration of Figure 56, to reach Jordan time recurrent neural net Network.The nand architecture program of Figure 57 is similar to the nand architecture program of Figure 45, as described below at difference.

There are the example program of Figure 57 12 nand architecture instructions to be located at address 0 to 11.The initialization directive meeting of address 0 It removes accumulator 202 and the numerical value of cycle counter 3804 is initialized as 32, hold circulation group (instruction of address 2 to 11) Row 32 times.The zero of accumulator 202 (being removed by the instruction of address 0) can be put into output buffer by the output order of address 1 1104.Thus can be observed, in the implementation procedure of the instruction of address 2 to 6, this 512 neural processing units 126 are corresponding simultaneously It is operated as 512 hiding node layer Z, and in the implementation procedure of the instruction of address 7 to 10, it corresponds to and as 512 Output node layer Y is operated.That is, 32 execution of the instruction of address 2 to 6 can calculate 32 corresponding time steps Hiding node layer Z numerical value, and corresponding 32 execution for putting it into instruction of the output buffer 1104 for address 7 to 9 make With, to calculate the output node layer Y of this 32 corresponding time steps and be written into data random access memory 122, and Corresponding 32 times for providing the instruction of address 10 execute use, and the content node layer C of this 32 corresponding time steps is put Enter output buffer 1104.(the content node layer C for being put into the 32nd time step in output buffer 1104 can't be made With.)

Instruction (ADD_D_ACC OUTBUF [SELF] and ADD_D_ACC ROTATE, COUNT=511) in address 2 and 3 First time execute, each neural processing unit 126 in 512 neural processing units 126 can be by output buffer 1104 512 content node C values be added to its accumulator 202, these content nodes C value is produced by the instruction execution of address 0 to 1 Raw and write-in.In second of execution of the instruction of address 2 and 3, at each nerve in this 512 neural processing units 126 512 content node C values of output buffer 1104 can be added to its accumulator 202, these content nodes C by reason unit 126 It is worth as produced by the instruction execution of address 7 to 8 and 10 and write-in.More precisely, the instruction of address 2 can indicate at each nerve The multitask buffer 208 for managing unit 126 selects its corresponding 1104 text of output buffer, as previously mentioned, and being added into Accumulator 202；The instruction of address 3 can indicate that neural processing unit 126 rotates content node C in the rotator of 512 texts The rotator of value, this 512 texts is transported by the collective for the multitask buffer 208 being connected in this 512 neural processing units Work is constituted, and allows each neural processing unit 126 that this 512 content node C values are added to its accumulator 202.Ground The instruction of location 3 can't remove accumulator 202, and input layer D value (can be multiplied by its phase by the instruction of such address 4 and 5 Respective weights) plus the cumulative content node layer C value out of instruction by address 2 and 3.

Instruction (MULT-ACCUM DR ROW+2, WR ROW 0 and MULT-ACCUM ROTATE, WR in address 4 and 5 ROW+1, COUNT=511) execute for each time, 126 meeting of each neural processing unit in this 512 neural processing units 126 Execute 512 multiplyings, by be associated in data random access memory 122 current time step column (such as: for when It is column 0 for intermediate step 0, is column 2 for time step 1, and so on, for being for time step 31 For column 62) 512 input node D values, be multiplied by the column 0 to 511 of weight random access memory 124 correspond to this nerve The weight of the row of processing unit 126, to generate 512 products, and together with the instruction of this address 2 and 3 for this 512 content sections The accumulation result that point C value executes is added to the accumulator 202 of corresponding neural processing unit 126 together to calculate concealed nodes Z Number of plies value.

In each execution of the instruction (OUTPUT PASSTHRU, NOP, CLR ACC) of address 6, at this 512 nerves 512 202 numerical value of accumulator of reason unit 126 transmit and are written the corresponding text of output buffer 1104, and accumulator 202 can be removed.

Instruction (MULT-ACCUM OUTBUF [SELF], WR ROW 512 and MULT-ACCUM in address 7 and 8 ROTATE, WR ROW+1, COUNT=511) implementation procedure in, at each nerve in this 512 neural processing units 126 Reason unit 126 can execute 512 multiplyings, by 512 concealed nodes Z values in output buffer 1104 (by address 2 to 6 Instruction it is corresponding time execute produced by and be written), it is right in the column 512 to 1023 of weight random access memory 124 to be multiplied by It should be in the weight of the row of this neural processing unit 126, to generate 512 product accumulations to corresponding neural processing unit 126 Accumulator 202.

In each execution of the instruction (OUTPUT ACTIVATION FUNCTION, DR OUT ROW+2) of address 9, meeting Run function (such as hyperbolic tangent function, S type function, correction function) is executed to calculate output node Y for this 512 accumulated values Value, this output node Y value can be written into data random access memory 122 corresponding to current time step column (such as: it is right It is column 1 for time step 0, is column 3 for time step 1, and so on, for time step 31 i.e. For column 63).The instruction of address 9 can't remove accumulator 202.

In each execution of the instruction (OUTPUT PASSTHRU, NOP, CLRACC) of address 10, the instruction of address 7 and 8 It is cumulative go out 512 numerical value can be placed into instruction of the output buffer 1104 for address 2 and 3 execute use next time, and Accumulator 202 can be removed.

The recursion instruction of address 11 can make the number decrements of cycle counter 3804, and if new cycle counter 3804 Numerical value is still greater than zero, would indicate that the instruction for returning to address 2.

As described in the chapters and sections corresponding to Figure 44, Jordan time recurrent neural is being executed using the nand architecture program of Figure 57 In the example of network, although run function can be imposed for 202 numerical value of accumulator to generate output node layer Y value, but, this model Official holiday is scheduled on impose run function before, 202 numerical value of accumulator is just transferred to content node layer C, rather than transmits real output layer Node Y value.But, it is passed for run function is applied to 202 numerical value of accumulator with the Jordan time for generating content node layer C Return for neural network, the instruction of address 10 will be removed from the nand architecture program of Figure 57.In embodiment as described herein In, Elman or Jordan time recurrent neural network has single a concealed nodes layer (such as Figure 40 and Figure 42), however, it is desirable to Understanding, the embodiment of these processors 100 and neural network unit 121 can be used similar to manner described herein, Efficiently perform the calculating for being associated with the time recurrent neural network with multiple hidden layers.

As described in corresponding to the chapters and sections of Fig. 2 above, each nerve processing unit 126 is as in an artificial neural network Neuron is operated, and neural processing unit 126 all in neural network unit 121 can be with the side of extensive parallel processing Formula effectively calculates the neuron output value of a level of this network.The parallel processing of this neural network unit, especially The rotator constituted using neural processing unit multitask buffer collective not traditionally calculates the mode of neuronal layers output Institute's energy intuition is expected.Furthermore, it is understood that traditional approach, which is usually directed to, is associated with single a neuron or a very small mind Calculating (for example, executing multiplication and additional calculation using parallel arithmetical unit) through first subclass, then continues to execute association In the calculating of next neuron of same level, and so on continue to execute in serial fashion, until completing for this level In all neuron calculating.In comparison, the present invention is within each time-frequency period, all minds of neural network unit 121 A small set in calculating needed for generating all neurons outputs is associated with through processing unit 126 (neuron) meeting parallel execution (such as single a multiplication and accumulation calculating).- M is the number of nodes-linked in current level after about M time-frequency end cycle Neural network unit 121 will calculate the output of all neurons.In the configuration of many artificial neural networks, because existing big Neural processing unit 126 is measured, neural network unit 121 can be in M time-frequency end cycle for all minds of flood grade Its neuron output value is calculated through member.As described herein, this is calculated for all types of artificial neural networks all Has efficiency, these artificial neural networks are including but not limited to feedforward and time recurrent neural network, such as Elman, Jordan and length Short-term memory network.Finally, although neural network unit 121 is configured to 512 neural processing units in the embodiments herein For 126 (such as take wide text configure) to execute the calculating of time recurrent neural network, but, the present invention is not limited thereto, will Neural network unit 121 is configured to 1024 neural processing units 126 (such as narrow text is taken to configure) to execute time recurrence The embodiment of the calculating of neural network unit, and such as the aforementioned neural processing unit with other quantity other than 512 and 1024 126 neural network unit 121, also belongs to the scope of the present invention.

Only as described above, only presently preferred embodiments of the present invention, when the model that cannot be limited the present invention with this and implement It encloses, i.e., all still belongs to according to simple equivalent changes and modifications made by scope of the present invention patent and invention description content generally In the range of the invention patent covers.For example, software can execute the function of apparatus and method of the present invention, system It makes, shape, emulate, describe and/or tests.This can be by general program language (such as C, C++), hardware description language (HDL) reach comprising Verilog HDL, VHDL etc. or other existing programs.This software can be set in any of Computer can utilize medium, such as tape, semiconductor, disk, CD (such as CD-ROM, DVD-ROM), network connection, it is wireless or It is other medium of communication.The embodiment of apparatus and method described herein may be included with semiconductor intelligence wealth core, such as micro- place It manages core (such as with the embodiment of hardware description language) and is converted to hardware through the production of integrated circuit.In addition, herein Described apparatus and method also may include the combination of hardware and software.Therefore, any embodiment as described herein, not to It limits the scope of the invention.In addition, present invention can apply to the micro processor, apparatus of general general purpose computer.Finally, affiliated skill Art field have usually intellectual utilize disclosed herein idea and embodiment based on, to design and adjust out difference Structure reached identical purpose, also without departing from the scope of the present invention.

Claims

1. a kind of neural network unit executes convolution algorithm or multiply-accumulate and run function operation, feature in a mixed manner It is, comprising:

One buffer gives sequencing using field reciprocal and run function field；

One first memory；

One second memory；

And

It is multiple nerve processing units (NPU), respectively the nerve processing unit include:

One arithmetic logic unit executes a series of arithmetic logical operations to sequence of operations number to generate series of results；

One accumulator, the arithmetic logic unit store the cumulative accumulated value out of the series result to the accumulator；

Multiple run function units, respectively the run function unit executes run function to the accumulated value to generate run function knot Fruit；And

One multiplication unit reciprocal, receives the reciprocal value and the accumulated value of divisor, to generate multiplication result reciprocal, the inverse multiplication knot Fruit is the quotient of the accumulated value Yu the divisor,

Wherein, which specifies the run function for the accumulated value, which specifies being somebody's turn to do for the divisor Reciprocal value,

Wherein, when the neural network unit executes convolution algorithm, the element of the first memory loading data matrix and this The element of two memory loads convolution kernels, the accumulated value are the corresponding submatrix of the convolution kernel and those elements of the data matrix Convolution algorithm as a result,

Wherein, when the neural network unit executes multiply-accumulate and run function operation, which loads multiple numbers Multiple weight texts, the data literal and the weight text institute according to the accumulated value are loaded according to text and the second memory The aggregation result of the multiple products generated.

2. neural network unit according to claim 1, which is characterized in that sequencing is in the inverse field of the buffer Including one first numerical value and a second value, which is the reciprocal value of the divisor after leading null suppression, this second Numerical value is the quantity for specifying the leading zero inhibited in first numerical value.

3. neural network unit according to claim 2, which is characterized in that the inverse multiplication unit includes:

First numerical value and the accumulated value are multiplied to produce a product by one multiplier；And

One shift unit, by the product move right the second value offset to generate the quotient.

4. neural network unit according to claim 2, which is characterized in that those are transported performed by the arithmetic logic unit Calculating is integer arithmetic, and the accumulated value is a fixed-point value, which further includes one second buffer, using tired The position for adding to a binary point of the accumulated value of the accumulator gives sequencing.

5. neural network unit according to claim 2, which is characterized in that those are transported performed by the arithmetic logic unit Calculating is integer arithmetic, and those operands are fixed-point value, which further includes one second buffer, using this One position of one binary point of operand gives sequencing.

6. neural network unit according to claim 2, which is characterized in that those are transported performed by the arithmetic logic unit Calculating is integer arithmetic, and the accumulated value is a fixed-point value, which further includes one second buffer, using this One position of one binary point of quotient gives sequencing.

7. neural network unit according to claim 1, which is characterized in that the divisor is those elements of the convolution kernel Quantity.

8. neural network unit according to claim 1, which is characterized in that the divisor is those yuan of prime number of the convolution kernel The sum of value.

9. neural network unit according to claim 1, which is characterized in that

Wherein, which is a sum of the element numerical value of a corresponding submatrix of the data matrix.

10. neural network unit according to claim 9, which is characterized in that the divisor is those elements of the submatrix Quantity.

11. a kind of method for operating neural network unit, executes convolution algorithm or multiply-accumulate and run function in a mixed manner Operation, which is characterized in that the neural network unit has at buffer, first memory, second memory and multiple nerves Unit is managed, respectively the nerve processing unit has an arithmetic logic unit, and an accumulator, multiple run function units multiply with an inverse Method unit, this method comprises:

Utilize field reciprocal and run function field, the sequencing buffer；

Using the respectively arithmetic logic unit, a series of arithmetic logical operations are executed to generate a series of knots to sequence of operations number Fruit；

Using the respectively arithmetic logic unit, the cumulative accumulated value out of the series result is stored to the accumulator；

Using the respectively run function unit, run function is executed to generate run function result to the accumulated value；And

Using the inverse multiplication unit, the reciprocal value and the accumulated value of divisor are received, to generate multiplication result reciprocal, which multiplies Method the result is that the accumulated value and the divisor quotient,

12. according to the method for claim 11, which is characterized in that sequencing includes in the inverse value field of the buffer One first numerical value and a second value, first numerical value are the reciprocal value of the divisor after leading null suppression, the second value For the quantity for specifying the leading zero inhibited in first numerical value.

13. according to the method for claim 12, which is characterized in that further include:

Using the respectively inverse multiplication unit, first numerical value and the accumulated value are multiplied to produce a product；And

Using the respectively inverse multiplication unit, by the product move right the second value offset to generate the quotient.

14. according to the method for claim 12, which is characterized in that those operations performed by the arithmetic logic unit are whole Number operation, and the accumulated value is a fixed-point value, and this method further includes one or two using the accumulated value for being added to the accumulator One position of system decimal point, one second buffer of sequencing.

15. according to the method for claim 12, which is characterized in that those operations performed by the arithmetic logic unit are whole Number operations, and those operands are fixed-point value, and this method further includes one of a binary point using the operand It sets, one second buffer of sequencing.

16. according to the method for claim 12, which is characterized in that those operations performed by the arithmetic logic unit are whole Number operation, and the accumulated value is a fixed-point value, and this method further includes utilizing a position of a binary point of the quotient, One second buffer of sequencing.