CN108804139A

CN108804139A - Programmable device and its operating method and computer usable medium

Info

Publication number: CN108804139A
Application number: CN201810620150.0A
Authority: CN
Inventors: G·葛兰·亨利; 泰瑞·派克斯
Original assignee: Shanghai Zhaoxin Integrated Circuit Co Ltd
Current assignee: Shanghai Zhaoxin Semiconductor Co Ltd
Priority date: 2017-06-16
Filing date: 2018-06-15
Publication date: 2018-11-13
Anticipated expiration: 2038-06-15
Also published as: CN108804139B

Abstract

The present invention relates to a kind of programmable device and its operating method and computer usable mediums.The programmable device includes：Program storage, the instruction of the program for keeping the equipment to pick up and executing；Data storage, for keeping the data handled by described instruction；Status register, for keeping the state with following field：Program memory address, wherein from the nearest instruction of described program memory pickup at the program memory address；Data store access address, wherein equipment the last time in the data storage accesses to data at the data store access address；And repeat count, it is used to indicate operation specified in current procedure instruction and waits the number executed.Condition register has condition field corresponding with status register field.Control logic meets condition specified in condition register in response to being detected as the state retained in the status register, generates the interrupt requests for processing core.

Description

Programmable device and its operating method and computer usable medium

Technical field

It is used for according to condition come the neural network unit of interrupt processing core the present invention relates to a kind of.

Background technology

Recently, artificial neural network (artificial neural network, ANN) has attracted the interest of people again, And this research is commonly known as deep learning, computer learning and similar term.General processor computing capability carries Height is so that recur to the interest to be subsided decades ago.The more recent application of ANN includes speech recognition and image recognition etc..For The demand of the performance and efficiency of improving calculating associated with ANN is increasing.

Invention content

A kind of programmable device, including：Output is used to generate the interruption for the processing core for being coupled to the equipment Request；Program storage, the instruction of program for being used to keep the equipment to pick up and executing；Data storage is used for Keep the data handled by described instruction；Status register is used to keep the equipment to be updated during the operation of the equipment State, the state have include following item field：Program memory address, wherein at the program memory address, From the nearest instruction of described program memory pickup；Data store access address, wherein in the data store access At location, equipment the last time in the data storage accesses data；And repeat count, it is used to refer to Show that operation specified in current procedure instruction keeps the number carried out；Condition register has and the status register Retained in the corresponding condition field of mode field, wherein the condition register can be write via the instruction of described program Enter the condition for including one or more of condition field condition field；And control logic, it is used in response to detection Meet condition specified in the condition register for the state retained in the status register, is produced in the output The raw interrupt requests for the processing core.

A kind of operating method of equipment, the equipment include：Program storage is used to keep the equipment to pick up simultaneously The instruction of the program of execution；Data storage is used to keep the data handled by described instruction；Status register is used for The equipment newer state during the operation of the equipment is kept, wherein the state is with the field for including following item：Journey Sequence storage address, wherein at the program memory address, from the nearest instruction of described program memory pickup；Data are deposited Reservoir access address, wherein at the data store access address, the equipment is the last in the data storage Data are accessed；And repeat count, it is used to indicate operation specified in current procedure instruction and keeps progress Number, the equipment further include condition register, and the condition register has and the shape retained in the status register The corresponding condition field of state field, the method includes：Via the instruction of described program to the condition register write packet Include the condition of one or more of condition field condition field；And in response to being detected as institute in the status register The state of holding meets condition specified in the condition register, generates the interrupt requests for processing core.

A kind of non-transitory computer usable medium comprising computer available programs, the computer available programs make Computer is obtained to be used as according to each component in processor described herein.

Description of the drawings

Fig. 1 be show include the processor of neural network unit (neural network unit, NNU) block diagram.

Fig. 2 is the block diagram for the NPU for showing Fig. 1.

Fig. 3 is the frame of the embodiment of the arrangement of N number of multiplexing register (mux-reg) of the N number of NPU for the NNU for showing Fig. 1 Figure, to illustrate N number of multiplexing register as the N word wheel-turning devices (N- from the data RAM of Fig. 1 data line words received Word rotater) or cyclic shifter operation.

Fig. 4 is shown in the program storage of the NNU for being stored in Fig. 1 and by the table of the NNU programs executed.

Fig. 5 is to show that NNU executes the sequence diagram of the program of Fig. 4.

Fig. 6 A are to show that the NNU of Fig. 1 executes the block diagram of the program of Fig. 4.

Fig. 6 B are to show that the processor of Fig. 1 carries out the flow chart of the operation of framework program, and the framework program is come using NNU It is tired to execute (performed by the program of such as Fig. 4) multiplication typically associated with the neuron of the hidden layer of artificial neural network Activation primitive is added to calculate.

Fig. 7 is the block diagram for the NPU for showing Fig. 1 according to alternative embodiment.

Fig. 8 is the block diagram for the NPU for showing Fig. 1 according to alternative embodiment.

Fig. 9 is shown in the program storage of the NNU for being stored in Fig. 1 and by the table of the NNU programs executed.

Figure 10 is to show that NNU executes the sequence diagram of the program of Fig. 9.

Figure 11 is the block diagram of the embodiment for the NNU for showing Fig. 1.In the embodiment in figure 11, neuron is divided into two parts i.e. Activation primitive cell mesh and the parts ALU (this part further includes shift register portion), and each activation primitive cell mesh By multiple ALU partial sharings.

Figure 12 is to show that the NNU of Figure 11 executes the sequence diagram of the program of Fig. 4.

Figure 13 is to show that the NNU of Figure 11 executes the sequence diagram of the program of Fig. 4.

Figure 14 is to show to instruct the NNU relative to Fig. 1 to the instruction of neural Network Mobility (MTNN) framework and the framework The block diagram of partial operation.

Figure 15 is to show to instruct the NNU relative to Fig. 1 from neural network movement (MFNN) framework instruction and the framework The block diagram of partial operation.

Figure 16 is the block diagram of the embodiment for the data RAM for showing Fig. 1.

Figure 17 is the block diagram of the embodiment of the weight RAM and buffer that show Fig. 1.

Figure 18 is the block diagram for the dynamically configurable NPU for showing Fig. 1.

Figure 19 is the arrangement of the 2N multiplexing register of N number of NPU of the NNU for the Fig. 1 for showing the embodiment according to Figure 18 The block diagram of embodiment, to illustrate 2N multiplexing register as the wheel from the data RAM of Fig. 1 data line words received Turn the operation of device.

Figure 20 is shown in the program storage of the NNU for being stored in Fig. 1 and by the table of the NNU programs executed, The wherein described NNU has the NPU of the embodiment according to Figure 18.

Figure 21 is to show that NNU executes the sequence diagram of the program of Figure 20, and the wherein NNU includes being operated in narrow configuration for Figure 18 NPU.

Figure 22 is the block diagram for the NNU for showing Fig. 1, and the wherein NNU includes the NPU of Figure 18 to execute the program of Figure 20.

Figure 23 is the block diagram for the dynamically configurable NPU for showing Fig. 1 according to alternative embodiment.

Figure 24 is to show that the NNU of Fig. 1 executes the exemplary block diagram of data structure used in convolution algorithm.

Figure 25 is to show that the processor of Fig. 1 executes the flow chart of the operation of framework program, and the framework program will be to convolution Core executes data arrays of the NNU for Figure 24 of convolution.

Figure 26 A are the program listings of NNU programs, and wherein the NNU programs execute data matrix using the convolution kernel of Figure 24 Convolution is simultaneously write back weight RAM.

Figure 26 B are the block diagrams of the specific fields of the control register for the NNU for showing Fig. 1 according to one embodiment.

Figure 27 is the exemplary block diagram of the weight RAM for the Fig. 1 for showing to be filled with input data, and wherein the NNU of Fig. 1 is to described defeated Enter data and carries out pond (pooling) operation.

Figure 28 is the program listing of NNU programs, and wherein the NNU programs carry out pond operation to the input data matrix of Figure 27 And write back weight RAM.

Figure 29 A are the block diagrams of the embodiment for the control register for showing Fig. 1.

Figure 29 B are the block diagrams of the embodiment for the control register for showing Fig. 1 according to alternative embodiment.

Figure 29 C are the block diagrams for the embodiment reciprocal for showing Figure 29 A for being stored as two parts according to one embodiment.

Figure 30 is the block diagram of the embodiment for the AFU that Fig. 2 is shown in further detail.

Figure 31 is the example of the operation of the AFU of Figure 30.

Figure 32 is the second example of the operation of the AFU of Figure 30.

Figure 33 is the third example of the operation of the AFU of Figure 30.

Figure 34 is the block diagram of the more detailed part of the NNU of the processor and Fig. 1 that show Fig. 1.

Figure 35 be show include the processor of variable bit rate NNU block diagram.

Figure 36 A are to show the operation with the processor for operating the NNU operated with master clock rate in general modfel Exemplary sequence diagram.

Figure 36 B are to show the place with the NNU that operation is operated with the rate smaller than master clock rate in mitigation pattern Manage the sequence diagram of the operation example of device.

Figure 37 is the flow chart of the operation for the processor for showing Figure 35.

Figure 38 is the block diagram for the sequence that NNU is shown in further detail.

Figure 39 is the block diagram of the control for showing NNU and the specific fields of status register.

Figure 40 is the block diagram of the embodiment for the part for showing NNU.

Figure 41 is the block diagram for showing processor.

Figure 42 is the block diagram for the ring station (ring stop) that Figure 41 is shown in further detail.

Figure 43 is the block diagram for the slave interface that Figure 42 is shown in further detail.

Figure 44 is the block diagram for the main interface 0 that Figure 42 is shown in further detail.

Figure 45 is the block diagram of the part of the ring bus coupling embodiment of the ring station and NNU that show Figure 42.

Figure 46 is the block diagram for the ring bus coupling embodiment for showing NNU.

Figure 47 is the block diagram for the embodiment for showing NNU.

Figure 48 is the block diagram for the interrupt condition register that Figure 47 is shown in further detail.

Figure 49 is the block diagram for the status register that Figure 47 is shown in further detail.

Figure 50 is to show that the NNU of Figure 47 generates the flow chart of the operation of the interrupt requests for core based on condition.

Figure 51 is shown in the program storage of the NNU for being stored in Figure 47 and by the table of the NNU programs executed.

Figure 52 is executed in the program storage according to the NNU for being stored in Figure 47 of alternative embodiment and by the NNU Setting interrupt condition instruction.

Figure 53 is shown in the program storage of the NNU for being stored in Figure 47 and by the table of the NNU programs executed.

Specific implementation mode

Processor with framework neural network unit

Referring now to figure 1, a block diagram is shown, the block diagram show include neural network unit (NNU) 121 processor 100.Place Device 100 is managed to include instruction pickup unit 101, command cache 102, instruction translator 104, renaming unit 106, retain The execution unit 112 and memory stood other than 108, media register 118, general register (GPR) 116, NNU 121 System 114.

Processor 100 is used as the electronic device of the central processing unit (CPU) on integrated circuit.Processor 100 receives Numerical data is as input, according to the instruction treatmenting data picked up from memory, and generates the result of the operation of instruction defined As output.Processor 100 can be used for desktop computer, mobile computer or tablet computer, and be compiled for such as calculating, word Volume, the purposes of multimedia display and internet browsing etc..Processor 100 may also be disposed in embedded system, include with control The various devices of household electrical appliance, mobile phone, smart phone, vehicle and industrial control device etc..CPU is to pass through logarithm Include that the operation of arithmetical operation, logical operation and input/output operation (also referred to as " calculates to execute computer program according to executing Machine application " or " application ") instruction electronic circuit (i.e. " hardware ").Integrated circuit (IC) is to be made in fritter semi-conducting material One group of electronic circuit on (being usually silicon).IC is also referred to as chip, microchip or crystal grain (die).

The control of instruction pickup unit 101, which picks up framework instruction 103 to instruction cache from system storage (not shown), delays Storage 102.Pickup unit 101 is instructed to provide the pickup address of designated memory address to command cache 102, wherein locating Reason device 100 picks up the cache line of framework command byte into command cache 102 at the storage address. It is based on the instruction pointer (not shown) of processor 100 or the current value of program counter to pick up address.In general, program counter It is incremented by proper order according to instruction size, unless encountering the control instruction of branch, calling or return instruction etc. in instructing crossfire Or the exceptional condition of generation interruption, trap (trap), exception or mistake etc. utilizes such as branch in these cases More new program counter is carried out in the non-sequential address of destination address, return address or exception vector etc..In general, program counter It executes instruction and is updated in response to execution unit 112/121.Program counter may also respond to detect that exceptional condition is (all Such as instruction translator 104 encounters the undefined instruction 103 of instruction set architecture of processor 100) and be updated.

Command cache 102 to picked up from the system storage for being coupled to processor 100 framework instruction 103 into Row cache.Framework instruction 103 includes described in greater detail below instructed to neural Network Mobility (MTNN) and from nerve Network Mobility (MFNN) instructs.In one embodiment, framework instruction 103 is the instruction of x86 instruction set architectures (ISA), and attached In addition MTNN instructions are instructed with MFNN.In the context of the present invention, x86ISA processors are used asPlace Manage device execute identical machine language instruction when instruction set architecture layer generate withWhat processor was generated The processor of identical result.However, other embodiments contemplate other instruction set architectures, such as advanced reduced instruction set machineSUNOrCommand cache 102 provides framework to instruction translator 104 Framework instruction 103 is translated to microcommand 105 by instruction 103, instruction translator 104.

Microcommand 105 is provided to renaming unit 106 and is finally executed by execution unit 112/121.Microcommand 105 realize framework instruction.Preferably, instruction translator 104 includes first part, and the wherein first part will frequently execute And/or relatively uncomplicated framework instructs 103 to translate to microcommand 105.Instruction translator 104 further includes second part, In the second part include microcode unit (not shown).Microcode unit includes keeping the microcode memory of micro-code instruction, wherein institute State the instruction that micro-code instruction is realized the complexity of architecture instruction set and/or infrequently used.Microcode unit further includes micro-sequencer (microsequencer), wherein nand architecture microprogram counter (micro-PC) is provided to microcode storage by the micro-sequencer Device.Preferably, micro-code instruction is translated for microcommand 105 via micro- transfer interpreter (not shown).Selector is according to microcode unit It is current whether to be possessed of control power to select microcommand 105 to be provided to renaming list from the first part or the second part Member 106.

Framework is instructed the physics of architectural registers renamed as processor 100 specified in 103 by renaming unit 106 Register.Preferably, processor 100 includes the buffer (not shown) that reorders.Renaming unit 106 is directed to each microcommand 105 The entry in reordering buffer is distributed according to program sequence.Processor 100 is enable to retire from office according to program sequence in this way (retire) microcommand 105 and its corresponding framework instruction 103.In one embodiment, media register 118 has 256 Width, and GPR 116 has 64 bit widths.In one embodiment, media register 118 is such as high-level vector expansion (AVX) the x86 media registers of register etc..

In one embodiment, each entry of buffer of reordering includes memory space for the result of microcommand 105； In addition, processor 100 include architectural registers file comprising for each architectural registers (for example, media register 118, GPR 116 and other architectural registers) physical register.(preferably, for example, due to media register 118 and GPR 116 it is of different sizes, therefore for both, there are individual register files.) for specifying framework to post in microcommand 105 Each source operand of storage, renaming unit utilize the newest microcommand in the old microcommand 105 of write-in architectural registers The buffer that reorders indexes to fill the source operand field of microcommand 105.When execution unit 112/121 completes microcommand 105 Execution when, execution unit 112/121 writes the result into the buffer entries that reorder of microcommand 105.When microcommand 105 is retired from office When, the microcommand 105 with resignation is written from the buffer entries that reorder of microcommand in the result by retirement unit (not shown) The register of the specified associated physical register file of framework destination register.

In another embodiment, processor 100 include physical register file without including architectural registers file, wherein The quantity for the physical register that the physical register file includes is more than the quantity of architectural registers, and the buffer item that reorders Mesh does not include result memory space.(preferably, for example, due to media register 118 and GPR 116 of different sizes, needle To the two, there are individual physical register files.) processor 100 further includes pointer gauge, have and is deposited for each framework The association pointer of device.For the operand of the specified architectural registers in microcommand 105, renaming unit is posted using physics is directed toward The pointer of idle register in register file fills the vector element size field of microcommand 105.If physical register Idle register is not present in file, then renaming unit 106 shelves assembly line (pipeline).For the finger of microcommand 105 Determine each source operand of architectural registers, renaming unit is posted using write-in framework is assigned in direction physical register file The pointer of the register of newest microcommand in the old microcommand 105 of storage, to fill the source operand word of microcommand 105 Section.When execution unit 112/121 completes the execution of microcommand 105, execution unit 112/121 writes the result into physical register The register pointed by vector element size field in file by microcommand 105.When microcommand 105 is retired from office, retirement unit The vector element size field value of microcommand 105 is copied into the framework specified by the microcommand 105 in pointer gauge with the resignation The associated pointer of destination register.

Reservation station 108 keeps microcommand 105, until microcommand is ready to be released to execution unit 112/121 for holding Behavior stops.When all source operands of microcommand 105 are all available and execution unit 112/121 can be used for executing microcommand 105 When, get out publication microcommand 105.Execution unit 112/121 is from the framework in reorder buffer or aforementioned first embodiment Register file or from the physical register file receiving register source operand in aforementioned second embodiment.In addition, executing Unit 112/121 can forward bus (not shown) via result and directly be operated from 112/121 receiving register source of execution unit Number.In addition, execution unit 112/121 can receive the real time operation number specified by microcommand 105 from reservation station 108.It is such as following It is described in more detail, MTNN and the real time operation that MFNN frameworks instruction 103 includes for specifying 121 function to be performed of NNU Number, the wherein function setting are in one of MTNN and MFNN frameworks instruction 103 one or more microcommands 105 being translated into.

Execution unit 112 includes one or more load/store unit (not shown), is added from memory sub-system 114 Carry data and by data storage to memory sub-system 114.Preferably, memory sub-system 114 includes memory management list First (not shown) for example can search (lookaside) buffer and table movement (tablewalk) unit, 1 series including translation According to Cache (and command cache 102), 2 grades of unified caches and it is used for processor 100 and system The Bus Interface Unit of memory engagement.In one embodiment, the processor 100 of Fig. 1 is as being total in multi-core processor Enjoy the representative of the processing core of one of multiple processing cores of last level cache memory.Execution unit 112 may also include whole Counting unit, media units, floating point unit and branch units.

NNU 121 includes weight random access memory (RAM) 124, data RAM 122, N number of neural processing unit (NPU) 126, program storage 129, sequencer 128 and control and status register (CSRS) 127.NPU 126 is in concept The upper neuron as in neural network.Weight RAM 124, data RAM 122 and program storage 129 can via MTNN with MFNN frameworks instruction 103 is respectively written into and reads.Weight RAM 124 is arranged as W rows, often capable N number of weight word, and data RAM 122 are arranged as D rows, often capable N number of data word.Each data word and each weight word all have multiple positions, it is therefore preferable to 8,9 Position, 12 or 16.Each data word is used as output valve (the otherwise referred to as activation value of the neuron of previous layer in network (activation)), and each weight word is used as weight associated with the connection of neuron of network current layer is entered.To the greatest extent Pipe is in many applications of NNU 121, the word that is maintained in weight RAM 124 or operand are actually and enters neuron Connect associated weight, but it is to be understood that in the other application of NNU 121, be maintained at word in weight RAM 124 simultaneously Non- weight, but because these words are stored in weight RAM 124, therefore still referred to as " weight word ".For example, in NNU 121 Certain applications in, such as in the convolution example of Figure 24 to Figure 26 A or the pond example of Figure 27 to Figure 28, weight RAM 124 can To keep non-weight, the element etc. of data matrix (such as image pixel data).Similarly, although being permitted in NNU 121 In applying, the word or operand that are maintained in data RAM 122 are actually the output valve or activation value of neuron, but answer more Work as understanding, in the other application of NNU 121, the word being maintained in data RAM 122 is really not so, but because these words store up It is stored in data RAM 122, therefore is still referred to as " data word ".For example, in certain applications of NNU 121, such as Figure 24 Into the convolution example of Figure 26 A, data RAM 122 can keep non-neuron to export, the element etc. of convolution kernel.

In one embodiment, NPU 126 and sequencer 128 include combinational logic, sequencing logic, state machine or its group It closes.The content of status register 127 is loaded onto one of GPR 116 by framework instruction (for example, MFNN instructions 1500), with determination The state of NNU 121, such as be determined as NNU 121 and have been completed order or have been completed NNU 121 from program storage The state of the program of 129 operations, or be determined as NNU 121 and freely receive new order or start the state of new NNU programs.

Advantageously, the quantity of NPU 126 can increase as needed, and the size of weight RAM 124 and data RAM 122 It can correspondingly be extended in depth in width.Preferably, weight RAM 124 is larger, this is because in typical neural network In layer, there is many connections associated with each neuron, thus there are many weights.This document describes with data word and power The size of weight word, weight RAM 124 each embodiment related with the quantity of the size of data RAM 122 and NPU 126. In one embodiment, there is 64KB (8192 × 64 row) data RAM 122,2MB (8192 × 2048 row) weight RAM The NNU 121 of 124 and 512 NPU 126 is implemented in 16 nanometer technologies of Taiwan Semiconductor Manufacturing Co., Ltd (TSMC), Area occupied is about 3.3 square millimeters.

Sequencer 128 picks up from program storage 129 and instructs and execute, and further includes generating address and control signal to provide To data RAM 122, weight RAM 124 and NPU 126.Sequencer 128 generates storage address 123 and reading order to provide To data RAM 122, to one of selection in the D rows that every row has N number of data word and it is supplied to N number of NPU 126.It is fixed Sequence device 128 also generates storage address 125 and reading order to be supplied to weight RAM 124, to have N number of weight in every row One of selection is to be supplied to N number of NPU 126 in the W rows of word.Sequencer 128 is generated to be supplied to the address 123 of NPU 126 And 125 sequence determines " connection " between neuron.Sequencer 128 also generates storage address 123 and writing commands to carry Data RAM 122 is supplied, to select one of them in the D rows that every row has N number of data word to be carried out from N number of NPU 126 Write-in.Sequencer 128 also generates storage address 125 and writing commands to be supplied to weight RAM 124, to have in every row One of selection from N number of NPU 126 to be written in the W rows of N number of weight word.Sequencer 128 is also generated to be deposited for program The storage address 131 of reservoir 129 is to select to be supplied to the NNU of sequencer 128 waited as described below to instruct.Storage address 131 is corresponding with program counter (not shown), and wherein sequencer 128 is usually made by the ordinal position of program storage 129 The program counter is incremented by, and except non-sequencer 128 encounters the control instruction of recursion instruction (such as seeing Figure 26 A) etc., is meeting In this case to control instruction, program counter is updated to the destination address of the control instruction by sequencer 128.Sequencer 128 also generate the control signal for NPU 126, to indicate that NPU 126 executes such as initialization, arithmetic/logic, rotation With shift operation, activation primitive and the various operations for writing back operation etc. or function, such example is described more particularly below (such as seeing the microoperation 3418 of Figure 34).

N number of NPU 126 generates N number of result word 133, and wherein result word 133 can be written back into going or writing back for weight RAM 124 Data RAM 122.Preferably, weight RAM 124 and data RAM 122 is coupled directly to N number of NPU 126.More specifically, weight RAM 124 and data RAM 122 is exclusively used in NPU 126, without being shared by other execution units 112 of processor 100, and this A little NPU 126 can in each clock cycle, consumption comes from 124 Hes of weight RAM with continuous fashion (preferably, in a pipeline fashion) A line of 122 either or both of which of data RAM.In one embodiment, data RAM 122 and 124 each self energys of weight RAM It is enough to provide 8192 to NPU 126 in each clock cycle.As described in more detail below, this 8192 can be used as 512 16 Word or 1024 8 words consume.

Advantageously, the size for the data set that can be handled by NNU 121 is not only restricted to weight RAM 124 and data RAM 122 Size, and the size of system storage is limited solely by, this is because MTNN and MFNN can be used to instruct (for example, passing through matchmaker Body register 118) data and weight are moved between system storage and weight RAM 124 and data RAM 122.At one In embodiment, data RAM 122 is dual-port, enabling is concurrently being read from data RAM 122 or to data RAM When 122 write-in data word, data word is written to data RAM 122.In addition, including the memory subsystem of cache memory The larger memory hierarchical structure of system 114 provides very big data bandwidth for the biography between system storage and NNU 121 It passs.Moreover it is preferred that memory sub-system 114 includes the pre- pick-up of hardware data, trace memory access module is (such as Load etc. from system storage to neural deta and weight), and to Cache hierarchical structure execute data in advance pickup with Promote the transmission of the high bandwidth and low latency to weight RAM 124 Yu data RAM 122.

Although the description of one of the operand for being provided to each NPU 126 be provided from weights memory and by It is expressed as the embodiment of weight (this term is usually used in neural network), it is to be understood that, operand can be and can pass through institute State the associated other types of data of calculating that equipment improves speed.

Referring now to Fig. 2, a block diagram is shown, which shows the NPU 126 of Fig. 1.NPU 126 is operated to execute many functions Or operation.Especially, advantageously, NPU 126 be configured as neuron or node and operated in artificial neural network with Execute classical multiply-accumulate function or operation.That is, in general, NPU 126 (neuron) is configured as：(1) from this NPU 126 has each neuron of connection (usually but it is not necessary to centainly from closely preceding one layer in artificial neural network) Receive input value；(2) each input value is multiplied by respective weights value associated with the connection to generate product；(3) multiply all Product is added to generate summation；And activation primitive (4) is executed to generate the output of neuron to the summation.However, with such as traditional Mode executes like that inputs associated all multiplication and then will be different together with all product additions with all connections, favorably Ground, each neuron are configured as executing within the given clock cycle and one of them associated weight multiplication of connection input Operation then by the product with the associated product of handled in the first clock cycle until point until connection input Accumulated value be added (cumulative).It is assumed that in the presence of the M connection to neuron, then cumulative rear (consuming is being carried out to all M products About M clock cycle), which executes activation primitive to generate output or result to accumulated value.This has the following advantages： Compared with the adder that the subset of all products or even product that will be associated with all connection inputs is added, in neuron It needs less multiplier and needs smaller, simpler and more quick adder circuit (for example, 2 input summers). Thus this has the following advantages：Be conducive to realize the neuron (NPU 126) of huge amount (N number of) in NNU 121 so that After about M clock cycle, NNU 121 has generated the output of all these a large amount of (N number of) neurons.Finally, by such nerve The NNU 121 that member is constituted, which has, effectively to be executed excellent for a large amount of different connections inputs as artificial neural network network layers Point.That is, as M is increased or decreased for different layers, generates the clock periodicity needed for neuron output and correspondingly increase It adds deduct few, and resource (for example, multiplier and accumulator) is fully used；And in more conventional design, for compared with Small M values, certain multipliers and part adder are not utilized.Therefore, embodiment as described herein is about to NNU's 121 The connection input number of neuron has flexibly with efficient benefit, and provides high performance.

NPU 126 includes the input multiplexing register of register 205,2 (mux-reg) 208, arithmetic logic unit (ALU) 204, accumulator 202 and activation primitive unit (AFU) 212.Register 205 receives weight word 206 simultaneously from weight RAM 124 Its output 203 is provided in subsequent clock period.It is multiplexed one of the selection of register 208 input 207 or 211, to be stored in Then it is provided in subsequent clock period in output 209 in its register.One input 207, which receives, comes from data RAM 122 Data word.Another input 211 receives the output 209 of adjacent NPU 126.N number of NPUs of the NPU 126 shown in Fig. 2 in Fig. 1 NPU J are denoted as in 126.That is, NPU J are the representative examples of N number of NPU 126.Preferably, the multiplexing of NPU J The input 211 of register 208 receives the output 209 of the multiplexing register 208 of the example J-1 of NPU 126, and NPU J are answered The input 211 of the multiplexing register 208 of the example J+1 of NPU 126 is provided to the output 209 of register 208.In this way, such as Described in more detail below for Fig. 3, the multiplexing register 208 of N number of NPU 126 is whole as N words wheel-turning device or cyclic shifter Operation.Control input 213 controls which of the two inputs and is re-used the selection of register 208 to be stored in register simultaneously It is follow-up to be provided in output 209.

There are three inputs for the tools of ALU 204.One input receives weight word 203 from register 205.Another input receives multiplexing The output 209 of register 208.Another inputs the output 217 for receiving accumulator 202.ALU 204 inputs it execution arithmetic And/or logical operation is to generate the result being provided in its output.Preferably, the arithmetic performed by ALU 204 and/or logic fortune Calculation is specified by the instruction for being stored in program storage 129.For example, multiply-accumulate operation is specified in the multiply-accumulate instruction of Fig. 4, that is, As a result 215 be weight word 203 and be multiplexed register 208 output 209 data word product and accumulator 202 value 217 Summation.The other operations that can be specified include but not limited to：As a result 215 be to be multiplexed register output 209 to pass through value；As a result 215 be that weight word 203 passes through value；As a result 215 be zero；As a result 215 be that weight word 203 passes through value；As a result 215 be accumulator The summation of 202 value 217 and weight word 203；As a result 215 be accumulator 202 value 217 with multiplexing register output 209 Summation；As a result 215 be accumulator 202 value 217 and weight word 203 maximum value；As a result 215 be accumulator 202 value 217 with It is multiplexed the maximum value of the output 209 of register.

ALU 204 is provided to accumulator 202 to be stored in the accumulator 202 by exporting 215.ALU 204 includes multiplication Device 242 is used to weight word 203 and the data word of the output 209 of multiplexing register 208 being multiplied to produce a product 246.? In one embodiment, two 16 positional operands are multiplied to produce 32 results by multiplier 242.ALU 204 further includes adder 244, it is used to product 246 being added to generate a summation with the output 217 of accumulator 202, which is in accumulator 202 It adds up for being stored in the result 215 in accumulator 202.In one embodiment, adder 244 is by 32 of multiplier 242 As a result it is added with 41 place values 217 of accumulator 202 to generate 41 results.In this way, by during multiple clock cycle For the use of the wheel-turning device of multiplexing register 208, NPU 126 completes the phase of the product for neuron needed for neural network Add.ALU 204 can also execute foregoing other arithmetic/logics including other circuit elements.Implement at one In example, second adder subtracts weight word 203 to generate a difference, so from the data word of the output 209 of multiplexing register 208 The difference is added with the output 217 of accumulator 202 to generate summation 215 by adder 244 afterwards, which is accumulator 202 Interior accumulated result.In this way, during multiple clock cycle, NPU 126 can complete the addition of difference.Preferably, such as It is described more fully below, although weight word 203 is identical as the size of data word 209 (as unit of position), can also have difference Binary point position.Preferably, as described in more detail below, multiplier 242 and adder 244 be integer multiplier with Adder, be advantageously carried out the lower complexity compared with floating-point multiplier and adder, smaller, more rapidly and energy consumption it is lower ALU 204.It will be appreciated, however, that in other embodiments, ALU 204 executes floating-point operation.

Although Fig. 2 only shows the multiplier 242 in ALU 204 and adder 244, it is preferable that the ALU 204 includes Other elements execute above-mentioned other operations.For example, ALU 204 is preferably included for by accumulator 202 and data/weight word The comparator (not shown) that is compared and for selecting in two values indicated by comparator the greater (maximum value) to store up The multiplexer (not shown) being stored in accumulator 202.For another example ALU 204 preferably includes selection logic (not shown), it is used for Data/weight word is set to skip multiplier 242 so that adder 244 is by the value 217 of data/weight word and accumulator 202 It is added to generate the summation for being stored in accumulator 202.These additional arithmetics are at following (such as Figure 18 to Figure 29 A) It is described in more detail, and can be used for executing such as convolution algorithm and pond operation.

AFU 212 receives the output 217 of accumulator 202.AFU 212 executes activation primitive to the output 217 of accumulator 202 To generate the result 133 of Fig. 1.In general, the activation primitive in the neuron of the middle layer of artificial neural network can be used to it is excellent Selection of land makes the cumulative and standardization of product using nonlinear mode.In order to make cumulative and " standardization ", Current neural member swashs Function living generates result in the range of the other neurons for being connected to Current neural member are expected the value received as input Value.(standardization result is sometimes referred to as " activation value ", and as described herein, activation value is the output of present node, and receiving node will The output is multiplied by the associated weight of the connection between output node and receiving node to generate a product, and the product with close It is coupled to other product accumulations of other inputs connection of the receiving node.) for example, reception/connected neuron expection connects The value between 0 and 1 is received as input, in this case, output neuron may need will be cumulative except 0~1 range It is the value in desired extent non-linearly to squeeze and/or adjust (such as upward displacement by negative value to be converted to positive value).Therefore, AFU 212 executes operation so that result 133 enters in known range to the value 217 of accumulator 202.The knot of all N number of NPU 126 Fruit 133 can be by concurrently write back data RAM 122 or weight RAM 124.Preferably, AFU 212 is configured as executing multiple sharp Function living, and for example from one of the input selection of control register 127 these activation primitives with the output to accumulator 202 217 execute.Activation primitive can include but is not limited to meet jump function (step function), correction function (rectify Function), S type functions (sigmoid function), tanh (tanh) function (hyperbolic tangent ) and soft plus function (softplus function) (also referred to as smooth correction function) function.Soft plus function is parsing letter Number f (x)=f (x)=ln (1+e^x), i.e., 1 and e^xSummation natural logrithm, wherein " e " is Euler (Euler) number, and x is The input 217 of function.Preferably, as described in more detail below, activation primitive can also include the value 217 by accumulator 202 Or part thereof of pass through (pass-through) function.In one embodiment, the circuit of AFU 212 is in the single clock cycle Interior execution activation primitive.In one embodiment, AFU 212 includes table, and the table receives accumulated value and swashs for certain Function (such as S type functions, hyperbolic tangent function, soft plus function etc.) output living is close by the value provided with real activation primitive As be worth.

Preferably, the width of output 133 of the width (as unit of position) of accumulator 202 more than AFU 212.For example, In one embodiment, the width of accumulator is 41, to avoid to (such as following such as Figure 30 described in more detail) up to The cumulative loss of significance of 512 32 products, and the width of result 133 is 16.It is being retouched in more detail below for Fig. 8 It has stated in its exemplary one embodiment, during subsequent clock period, the difference of 217 value of output of " original " accumulator 202 Part is written back into data RAM 122 or weight RAM 124 by AFU 212.Making it possible in this way will via MFNN instructions The value of original accumulator 202 is loaded back into media register 118, the finger executed on other execution units 112 of such processor 100 The complicated activation of all greatly (softmax) activation primitives etc. soft as everyone knows that AFU 212 can not be executed can be executed by enabling Function (also referred to as standardizes exponential function).In one embodiment, the instruction set architecture of processor 100 includes executing index letter Several instructions, commonly referred to as e^xOr exp (x), the instruction can be used for accelerating other execution units 112 of processor 100 to soft The execution of very big activation primitive.

In one embodiment, NPU 126 is the pipeline design.For example, NPU 126 may include the register of ALU 204 (register such as between multiplier and adder and/or other circuits of ALU 204) and holding AFU's 212 is defeated The register etc. gone out.The following describe the other embodiments of NPU 126.

Referring now to Fig. 3, a block diagram is shown, which shows N number of multiplexing deposit of N number of NPU 126 of the NNU 121 of Fig. 1 The embodiment of the arrangement of device 208, to illustrate N number of multiplexing register as one received from the data RAM 122 of Fig. 1 The N words wheel-turning device of row data word 207 or the operation of cyclic shifter.In the fig. 3 embodiment, N is 512 so that NNU 121 has Have as shown in the figure it is corresponding with 512 NPU 126 be denoted as 0 to 511 512 multiplexing registers 208.Each multiplexing deposit Device 208 receives the corresponding data word 207 in a line in the D rows of data RAM 122.That is, multiplexing register 0 receives Data word 0 in the row of data RAM 122, multiplexing register 1 receive the data word 1 in the row of data RAM 122, multiplexing deposit Device 2 receives the data word 2 in the row of data RAM 122, and so on, multiplexing register 511 receives in the row of data RAM 122 Data word 511.In addition, multiplexing register 1 receives the output 209 of multiplexing register 0, multiplexing deposit in another input 211 Device 2 receives the output 209 of multiplexing register 1 in another input 211, and multiplexing register 3 receives multiplexing in another input 211 The output 209 of register 2, and so on, multiplexing register 511 receives the output of multiplexing register 510 in another input 211 209, and it is multiplexed the output 209 that register 0 receives multiplexing register 511 in another input 211.Each multiplexing register 208 Control input 213 is received, it is selection data word 207 or selection rotation input 211 to be used to control.Such as institute more fully below It states, in an operation mode, within the first clock cycle, 213 each multiplexing register 208 of control of control input selects data word 207 to store in a register and be subsequent supplied to ALU 204；And in the subsequent clock cycle (for example, M- as described above 1 clock cycle) during, each selection of multiplexing register 208 rotation input 211 of 213 control of control input is to be stored in register In and be subsequent supplied to ALU 204.

Although in Fig. 3 (and Fig. 7 below and Figure 19) described embodiment, NPU 126 is configured as to be multiplexed The value of register 208/705 rotates to the right, i.e. from NPU J to NPU J+1, it is contemplated however that following embodiment (such as The embodiment etc. of Figure 24 to Figure 26), wherein NPU 126 be configured as by be multiplexed register 208/705 value rotate to the left, i.e., from NPU J to NPU J-1.In addition, it is contemplated to which following embodiment, wherein NPU 126 are configured as selectively depositing multiplexing The value of device 208/705 rotates to the left or to the right, such as is instructed and specified by NNU.

Referring now to Fig. 4, a table is shown, which shows the program storage 129 of the NNU 121 for being stored in Fig. 1 In and the program that is executed by the NNU 121.As described above, exemplary process executes meter associated with the layer of artificial neural network It calculates.In the table on fig. 4, the five-element and three row are shown.Correspond to the ground for being shown in the first row in program storage 129 per a line Location.Secondary series designated order, and third row indicate clock periodicity associated with the instruction.Preferably, clock periodicity Effective clock number in indicating in assembly line embodiment per instruction clock types value, rather than instruction delay.As shown, because For the essence of the assembly line of NNU 121, therefore each instruction has an associated clock cycle, wherein the finger positioned at address 2 Order be one exception, as described in more detail below, due to the instruction actually oneself repeat 511 times, thus need 511 when Clock.

For each instruction of program, all 126 parallel processings of NPU instructions.That is, all N number of NPU 126 execute the instruction in the first row all in same (or multiple) clock cycle, and all N number of NPU 126 are same The instruction in the second row is executed in (or multiple) clock cycle, and so on.However, the following describe other embodiments, wherein Some instructions be by part parallel and part in proper order in a manner of execute, for example, for example for the such NPU of the embodiment of Figure 11 In the embodiment of 126 shared activation primitive units, it is with this side to be located at the activation primitive of address 3 and 4 and output order Formula executes.The example of Fig. 4 assumes that has 512 neurons (NPU 126), and each neuron has from previous layer 512 connection inputs of 512 neurons, a total of 256K connection.Each neuron is inputted from each connection receives 16 digits It is multiplied by 16 weighted values appropriate according to value, and by 16 bit data value.

Initialization NPU instructions are specified positioned at the first row of address 0 (although also may specify other addresses).The initialization directive The value of accumulator 202 is reset.In one embodiment, initialization directive also can refer to orientation accumulator 202 and load data RAM The corresponding word of address is specified in a line of 122 or weight RAM 124 by the instruction.It is such as more detailed below for Figure 29 A and Figure 29 B Described, which is also loaded into Configuration Values in control register 127.For example, data word 207 and weight can be loaded The width of word 209, wherein the width can by ALU 204 using with determine the operation performed by circuit size and can be with Influence the result 215 being stored in accumulator 202.In one embodiment, NPU 126 includes for the output in ALU 204 215 are stored in the circuit that accumulator 202 is before saturated the output 215, and Configuration Values are loaded into the electricity by initialization directive To influence saturation in road.In one embodiment, can also by ALU function instructions (for example, multiply-accumulate at address 1 Instruction) or output order (the write-in AFU output orders at such as address 4) in it is so specified, accumulator 202 is cleared to zero Value.

The second row positioned at address 1 specifies multiply-accumulate instruction, the wherein multiply-accumulate instruction to indicate 512 NPU 126 Corresponding data word is loaded from a line of data RAM 122 and loads corresponding weight word from a line of weight RAM 124, and First multiply-accumulate operation is executed to data word input 207 and weight word input 206, which is with first Beginningization accumulator 202 is cumulative to be carried out in the state of zero.More specifically, the instruction indicates that sequencer 128 is inputted in control Generation value is to select data word to input 207 on 213.In the example of fig. 4, the row of specified data RAM 122 is row 17, and And the row of specified weight RAM 124 is row 0, to indicate 128 output data address ram 123 of sequencer value 17 and Export the value 0 of weight address ram 125.Therefore, 512 data words of the row 17 from data RAM 122 are provided to 512 The corresponding data input 207 of NPU 126, and 512 from the row of weight RAM 124 0 weight words are provided to 512 The respective weights input 206 of NPU 126.

The third line positioned at address 2 specifies the multiply-accumulate rotation for being counted as 511 to instruct, which indicates this 512 NPU Each NPU 126 in 126 executes 511 multiply-accumulate operations.The instruction is tired in 511 multiplication to this 512 instructions of NPU 126 It is the rotation value 211 from adjacent NPU 126 to add the data word 209 of input ALU 204 in the operation each time of operation.Namely Say, the instruction indicate sequencer 128 control input 213 on generation value to select rotation value 211.In addition, the instruction indicates this 512 NPU 126 are directed to the phase of the operation each time of 511 multiply-accumulate operations from " next " row load of weight RAM 124 Answer weighted value.That is, instruction instruction sequencer 128 makes weight address ram 125 relative to it in preceding clock cycle Value add 1, in this example, the first clock cycle of instruction is row 1, and next clock cycle is row 2, next one clock week Phase is row 3, and so on, the 511st clock cycle is row 511.For each fortune in this 511 multiply-accumulate operations It calculates, the product and the preceding value of accumulator 202 of rotation input 211 and weight word input 206 is added up.This 512 NPU 126 exist 511 multiply-accumulate operations are executed in 511 clock cycle, wherein each NPU 126 to the row 17 from data RAM 122 not The data word of operation is executed in previous periodic and be conceptually the different connections of neuron with data word, that is, adjacent NPU 126 Input executes multiply-accumulate operation from the associated different weight words of the data word.In this example, it is assumed that each NPU 126 The quantity of the connection input of (neuron) is 512, therefore is related to 512 data words and 512 weight words.Once performing to row The last time iteration of 2 multiply-accumulate rotation instruction, accumulator 202 is just comprising the total of all products of 512 connection inputs With.In one embodiment, the instruction set of NPU 126 includes being used to indicate the execution of ALU 204 to be specified by initializing NPU instructions " execution " instruction of the ALU operations of (being specified such as in the ALU functions 2926 of Figure 29 A), rather than for all types of ALU operations (for example, multiply-accumulate, accumulator as described above and weight word seek maximum etc.) has individually instruction.

Fourth line positioned at address 3 specifies activation primitive instruction.AFU 212 is for accumulator for activation primitive instruction instruction 202 value 217 executes specified activation primitive to generate result 133.Following more detailed description swashs according to one embodiment Function living.

Fifth line positioned at address 4 specifies write-in AFU output orders, is used to indicate 512 NPU 126 by AFU 212 Output 133 a line (being in this example row 16) for being written back to data RAM 122 as a result.That is, the instruction refers to Show 128 output valve of sequencer be 16 data address ram 123 and writing commands (with the multiply-accumulate instruction at address 1 In the case of reading order it is opposite).Preferably, under the essence of assembly line, the execution of write-in AFU output orders can be with other fingers Order executes overlapping so that write-in AFU output orders actually execute within the single clock cycle.

Preferably, each NPU 126 is configured as assembly line, and wherein the assembly line includes various functions element, such as multiple With register 208 (and multiplexing register 705 of Fig. 7), ALU 204, accumulator 202, AFU 212, (Fig. 8's) multiplexer 802, row buffer 1104 and (Figure 11's) AFU 1112 etc., wherein some in these function element itself can be stream Line type.Other than data word 207 and weight word 206, assembly line is also received from program storage 129 and is instructed.These instructions It is flowed along assembly line and controls various functions unit.In an alternative embodiment, activation primitive is not included in program to instruct.Phase Instead, the specified activation primitive value 217 of accumulator 202 is executed of initialization NPU instructions, and point out specified activation letter Several values are saved in configuration register, thus later after having generated last 202 value 217 of accumulator, that is, After the completion of the last time iteration of multiply-accumulate rotation instruction at address 2, used by 212 parts AFU of assembly line.It is preferred that Ground, for energy saving purpose, 212 parts AFU of assembly line are inactive, until write-in AFU output orders reach the AFU Until 212 parts, AFU 212 starts and executes activation letter to the output 217 of the accumulator 202 specified by initialization directive at this time Number.

Referring now to Fig. 5, a sequence diagram is shown, which illustrates execution of the NNU 121 to the program of Fig. 4.The sequence diagram Every a line correspond to continuous clock period for pointing out of the first row.Other each row correspond to different one in 512 NPU 126 A NPU 126 simultaneously indicates its operation.In order to keep explanation simple and clear, the operation of NPU 0,1 and 511 is only shown.

At clock 0, each NPU 126 in 512 NPU 126 executes the initialization directive of Fig. 4, and wherein this is initial Change instruction to show by the way that zero is assigned to accumulator 202 in Figure 5.

At clock 1, each NPU 126 in 512 NPU 126 executes the multiply-accumulate finger in Fig. 4 at address 1 It enables.As shown, NPU 0 by the product of the word 0 of the row 17 of data RAM 122 and the word 0 of the row 0 of weight RAM 124 with it is cumulative The value (i.e. zero) of device 202 is cumulative；NPU 1 multiplies the word 1 of the row 17 of data RAM 122 and the word 1 of the row 0 of weight RAM 124 Product and the value (i.e. zero) of accumulator 202 are cumulative；The rest may be inferred, and NPU 511 is by the word 511 and weight of the row 17 of data RAM 122 The product of the word 511 of the row 0 of RAM 124 is cumulative with the value (i.e. zero) of accumulator 202.

At clock 2, each NPU 126 in 512 NPU 126 executes the multiply-accumulate rotation in Fig. 4 at address 2 The first time iteration of instruction.As shown, the wheel that NPU 0 will be received from the output 209 of the multiplexing register 208 of NPU 511 Turn the product of the word 0 of the row 1 of data word 211 (data word 511 received from data RAM 122) and weight RAM 124 with The value of accumulator 202 is cumulative；The rotation data word 211 that NPU 1 will be received from the output 209 of the multiplexing register 208 of NPU 0 The value of (data word 0 received from data RAM 122) and the product and accumulator 202 of the word 1 of the row 1 of weight RAM 124 It is cumulative；The rest may be inferred, the rotation data word 211 that NPU 511 will be received from the output 209 of the multiplexing register 208 of NPU 510 The product and accumulator 202 of (data word 510 received from data RAM 122) and the word 511 of the row 1 of weight RAM 124 Value it is cumulative.

At clock 3, each NPU 126 in 512 NPU 126 executes the multiply-accumulate rotation in Fig. 4 at address 2 Second of iteration of instruction.As shown, the wheel that NPU 0 will be received from the output 209 of the multiplexing register 208 of NPU 511 Turn the product of the word 0 of the row 2 of data word 211 (data word 510 received from data RAM 122) and weight RAM 124 with The value of accumulator 202 is cumulative；The rotation data word 211 that NPU 1 will be received from the output 209 of the multiplexing register 208 of NPU 0 The product and accumulator 202 of (data word 511 received from data RAM 122) and the word 1 of the row 2 of weight RAM 124 Value is cumulative；The rest may be inferred, the rotation data word that NPU 511 will be received from the output 209 of the multiplexing register 208 of NPU 510 The product and accumulator of 211 (data words 509 received from data RAM 122) and the word 511 of the row 2 of weight RAM 124 202 value is cumulative.As shown in the ellipsis of Fig. 5, next 509 clock cycle respectively persistently carry out according to this, until clock 512。

At clock 512, each NPU 126 in 512 NPU 126 executes the multiply-accumulate wheel in Fig. 4 at address 2 Turn the 511st iteration of instruction.As shown, NPU 0 will be received from the output 209 of the multiplexing register 208 of NPU 511 Rotate the product of data word 211 (data word 1 received from data RAM 122) and the word 0 of the row 511 of weight RAM 124 It is cumulative with the value of accumulator 202；The rotation data word that NPU 1 will be received from the output 209 of the multiplexing register 208 of NPU 0 The product and accumulator 202 of 211 (data words 2 received from data RAM 122) and the word 1 of the row 511 of weight RAM 124 Value it is cumulative；The rest may be inferred, the rotation data that NPU 511 will be received from the output 209 of the multiplexing register 208 of NPU 510 The product of the word 511 of the row 511 of word 211 (data word 0 received from data RAM 122) and weight RAM 124 with it is cumulative The value of device 202 is cumulative.In one embodiment, it is necessary to which multiple clock cycle read number from data RAM 122 and weight RAM 124 According to word and weight word to execute the multiply-accumulate instruction in Fig. 4 at address 1；However, data RAM 122,124 and of weight RAM NPU 126 is pipeline system so that once the first multiply-accumulate operation starts (for example, as shown in 1 period of clock of Fig. 5), Just start subsequent multiply-accumulate operation (for example, as shown in during clock 2-512) within the continuous clock cycle.It is preferred that Ground, in response to utilizing framework instruction (such as MTNN or MFNN is instructed, and is illustrated for Figure 14 and Figure 15 follow-up) or framework Instruction translation at access of the microcommand for data RAM 122 and/or weight RAM 124, NPU 126 can shelve.

At clock 513, the AFU 212 of each NPU 126 in 512 NPU 126 is executed in Fig. 4 at address 3 Activation primitive instructs.Finally, at clock 514, each NPU 126 in this 512 NPU 126 is by writing result 133 The corresponding word in the row 16 of data RAM 122 is returned, i.e., by the word 0 of the result 133 of NPU 0 write-in data RAM 122, by NPU 1 Result 133 word 1 of data RAM 122 is written, and so on, until data RAM 122 is written in the result 133 of NPU 511 Word 511, the write-in AFU output orders at address 4 to execute Fig. 4.Above in relation to the operation described in Fig. 5 also in fig. 6 with The form of block diagram is shown.

Referring now to Fig. 6 A, a block diagram is shown, which shows the execution of the NNU 121 of Fig. 1 to the program of Fig. 4.NNU 121 Including 512 NPU 126, receives the data RAM 122 of address input 123 and receive the weight RAM that address inputs 125 124.Although being not shown, at clock 0,512 NPU 126 execute initialization directive.As shown, at clock 1, row 17 512 16 bit data words by from data RAM 122 read and be provided to 512 NPU 126.At clock 1 to 512, row 0 to 511 512 16 weight words are read from weight RAM 124 and are provided to 512 NPU 126 respectively.Although being not shown, At clock 1,512 NPU 126 execute corresponding multiply-accumulate operation to the data word and weight word of load.Clock 2 to At 512, the multiplexing register 208 of 512 NPU 126 as 512 16 words wheel-turning device operation with by the previously loaded number Go to adjacent NPU 126 according to the data word wheel of the row 17 of RAM 122, and NPU 126 to after rotation each data word and Each weight word loaded from weight RAM 124 executes multiply-accumulate operation.Although being not shown, at clock 513,512 AFU 212 execute activation instruction.At clock 514,512 NPU 126 are by 133 write back data RAM of corresponding 512 16 results 122 row 16.

It can be found that generate result word (neuron output) and write back data RAM 122 or weight RAM 124 needed for when The square root of data input (connection) quantity received by the current layer of clock number substantially neural network.For example, if current Layer includes 512 neurons for respectively having 512 connections from previous layer, then the summation of these connections is 256K, and The clock number needed for the result of current layer is generated slightly above 512.Therefore, NNU 121 is provided high for neural computing Performance.

Referring now to Fig. 6 B, a flow chart is shown, which illustrates the operation of the execution framework program of processor 100 of Fig. 1, The associated multiplication of neuron that the framework program executes the hidden layer typically with artificial neural network using NNU 121 Cumulative activation primitive calculates (operation etc. performed by the program of Fig. 4).The example of Fig. 6 B assumes four hidden layer (sides of passing through The NUM_LAYERS variable initializer shows of frame 602 indicate) calculating, each hidden layer has 512 neurons, each nerve 512 neurons (passing through the program of Fig. 4) of member connection previous layer whole.It will be appreciated, however, that the number of these layers and neuron Amount is to select for illustrative purposes, and NNU 121 can be used for the hidden layer for different number, different numbers in each layer The neuron of amount and the neuron not connected all execute identical calculating.In one embodiment, in this layer not Existing neuron or the connection being not present to neuron, weighted value can be arranged to zero.Preferably, framework program is by One group of weight write-in weight RAM 124 simultaneously starts NNU 121, and just executes calculating associated with first layer in NNU 121 When, weight RAM 124 is written in second group of weight by this framework program so that once NNU 121 completes the calculating of the first hidden layer, NNU 121 can start the calculating of the second layer.In this way, between framework program travels to and fro between two regions of weight RAM 124, with Ensure that NNU 121 is fully utilized.Flow starts from box 602.

At box 602, as shown in Fig. 6 A and described in being directed to, processor 100 (operates in the framework on processor 100 Program) by the Current neural member hidden layer of input value write-in data RAM 122, such as the row 17 of data RAM 122 is written.It is optional Ground, these values may also be directed to previous layer (for example, convolution, Chi Hua in the row 17 of data RAM 122 as NNU 121 Or input layer) operation result 133.In addition, variable N is initialized as value 1 by framework program.In variable N mark hidden layer just by The current layer of the processing of NNU 121.In addition, variable NUM_LAYERS is initialized as value 4 by framework program, this is because in this example It is middle that there are four hidden layers.Flow enters box 604.

At box 604, as shown in Figure 6A, processor 100 is by the weight word write-in weight RAM 124 of layer 1, such as is written Row 0 to 511.Flow enters box 606.

At box 606, processor 100 uses the MTNN for specifying the function 1432 that program storage 129 is written Instruction 1400, will (for example, Fig. 4) multiply-accumulate activation primitive program write-in NNU 121 program storage 129.Processor 100 start NNU programs followed by the MTNN instructions 1400 of the specified function 1432 for starting to execute program.Flow enters decision Box 608.

At decision block 608, framework program determines whether the value of variable N is less than NUM_LAYERS.If it is, flow Into box 612；Otherwise enter box 614.

At box 612, processor 100 by the weight word of layer N+1 write-in weight RAM 124, such as writing line 512 to 1023.Hence it is advantageous to by next layer of weight word when the hidden layer that framework program is just executing current layer in NNU 121 calculates Weight RAM 124 is written so that once completing the calculating of current layer, that is, after data RAM 122 being written, NNU 121 can be stood Start to execute next layer of hidden layer quarter to calculate.Flow enters box 614.

At box 614, processor 100 is determined as (in the case of layer 1, at box 606 starting, in layer 2 to 4 In the case of, then it is 618 to start at box) currently running NNU programs have completed.Preferably, processor 100 passes through The status register 127 of 1500 reading NNU 121 of MFNN instructions is executed to be determined to this.In an alternative embodiment, NNU 121, which generate one, interrupts to indicate that it has completed multiply-accumulate activation primitive layer program.Flow enters decision block 616.

At decision block 616, framework program determines whether the value of variable N is less than NUM_LAYERS.If so, flow into Enter box 618；Otherwise flow enters box 622.

At box 618, processor 100 updates multiply-accumulate activation primitive program so that the processor can be with execution level N + 1 hidden layer calculates.More specifically, processor 100 is by the data RAM's 122 of the multiply-accumulate instruction at the address 1 of Fig. 4 Row value is updated to 122 rows of data RAM (for example, being updated to row 16) that the result of previous layer is written, and also update output row (for example, being updated to row 15).Processor 100 then begins to updated NNU programs.Optionally, the program of Fig. 4 is in address 4 It is specified in output order identical as the row of the multiply-accumulate instruction at address 1 (that is, the row read from data RAM 122) Row.In this embodiment, the current line of Input Data word is written (since this journey data word has been read into multiplexing register 208 and rotated between these NPU 126 via N words wheel-turning device, as long as therefore this row data word without the need for other purposes, Such processing mode is just acceptable).In this case, it at box 618, without updating NNU programs, and only needs Restart NNU programs.Flow enters box 622.

At box 622, result of the processor 100 from the NNU programs of 122 reading layer N of data RAM.However, if this A little results are only applied to next layer, then framework program is just not necessarily to read these as a result, can be by it as replacement from data RAM 122 It is retained in data RAM 122 and is calculated for next hidden layer.Flow enters decision block 624.

At decision block 624, framework program determines whether the value of variable N is less than NUM_LAYERS.If it is, flow Into box 626；Otherwise flow terminates.

At box 626, framework program makes N add one.Flow returns to decision block 608.

It can such as be determined from the example of Fig. 6 B, generally every 512 clock cycle, NPU 126 is (by means of the NNU of Fig. 4 The operation of program) primary reading and write-once are executed to data RAM 122.In addition, NPU 126 is substantially per the clock cycle to power Weight RAM 124 is read out to read a line weight word.Therefore, whole bandwidth of weight RAM 124 are all executed by NNU 121 hidden Hide the hybrid mode consumption used in layer operation.Furthermore, it is assumed that embodiment includes being written and reading the buffer (caching of such as Figure 17 Device 1704 etc.), it is read out concurrently with NPU 126, weight RAM 124 is written in processor 100 so that buffer 1704 is substantially Every 16 clock cycle execute write-once so that weight word is written to weight RAM 124.Therefore, in the single of weight RAM 124 (for Figure 17 the embodiment described etc.) in the embodiment of port, substantially every 16 clock cycle, NPU 126 must just be put The reading to weight RAM 124 is set, so that weight RAM 124 can be written in buffer 1704.However, in weight RAM 124 is in the embodiment of dual-port, and NPU 126 is without shelving.

Referring now to Fig. 7, a block diagram is shown, which shows the NPU 126 according to Fig. 1 of alternative embodiment.The NPU of Fig. 7 126 is similar with the NPU of Fig. 2 126 in many aspects.However, the NPU 126 of Fig. 7 also comprises the 2nd 2 input multiplexing register 705.One of the multiplexing register 705 selection input 206 or 711, to store in a register then in subsequent clock week Phase is provided in output 203.Input 206 receives weight word from weight RAM 124.Another input 711 receives adjacent NPU's 126 The output 203 of second multiplexing register 705.Preferably, it is real to receive NPU 126 for the input 711 of the multiplexing register 705 of NPU J The output 203 of the multiplexing register 705 of example J-1, and the output of NPU J is provided to the multiplexing deposit of 126 example J+1 of NPU The input 711 of device 705.In this way, identical as above in relation to the mode described in Fig. 3, the multiplexing register 705 of N number of NPU 126 is whole Operation is N word wheel-turning devices, but is directed to weight word rather than data word.Which in the two inputs control input 713 control One selection of register 705 that is re-used, to be stored in register and be provided in output 203 follow-up.

Including multiplexing register 208 and/or multiplexing (and the other realities shown in Figure 18 and Figure 23 etc. of register 705 Apply the multiplexing register in example) to actually form one for will be received from data RAM 122 and/or weight RAM 124 The large-scale wheel-turning device that row data/weight is rotated has the following advantages：NNU 121 do not need other aspect needed in number According to the very big multiplexer between RAM 122 and/or weight RAM 124 necessary data word/power is provided to NNU 121 appropriate Weight word.

Accumulator value is also write back in addition to activation primitive result

In some applications, processor 100 be received back (such as via Figure 15 MFNN command receptions to media register 118) value 217 of original accumulator 202 is useful, wherein the instruction executed on other execution units 112 can be to this The value 217 of a little accumulators 202 executes calculating.For example, in one embodiment, in order to reduce the complexity of AFU 212, AFU 212 are not configured as executing soft very big activation primitive.Therefore, NNU 121 can by the value 217 of original accumulator 202 or its Subset is exported to data RAM 122 or weight RAM 124, and framework program is then read from data RAM 122 or weight RAM 124 The value 217 or its subset of the accumulator 202 for taking this original simultaneously calculate original value.However, to original accumulator 202 The application of value 217 is not limited to the execution of soft very big operation, it is also contemplated that other application.

Referring now to Fig. 8, a block diagram is shown, which shows the NPU 126 according to Fig. 1 of alternative embodiment.The NPU of Fig. 8 126 is similar with the NPU of Fig. 2 126 in many aspects.However, the NPU 126 of Fig. 8 includes multiplexer (mux) in AFU 212 802, wherein AFU 212 have control input 803.The width (as unit of position) of accumulator 202 is more than the width of data word.It is multiple There is the multiple input of the data word widths part of the output 217 for receiving accumulator 202 with device 802.In one embodiment In, the width of accumulator 202 is 41, and NPU 126 is configured as the result word 133 of output 16；Thus, for example, multiplexing Device 802 (or multiplexer 3032 and/or multiplexer 3037 of Figure 30) has the output 217 for being respectively used to receive accumulator 202 Position [15：0], position [31：16] with position [47：32] three inputs.Preferably, the carry-out bit (example not provided by accumulator 202 Such as position [47：41]) it is forced to be set as off bit.

In response to write-in ACC instructions (the write-in ACC instructions at 3 to 5 place of address of example Fig. 9 described as follows etc.), sequencer 128 Generation value is to control one of the word (for example, 16) that multiplexer 802 selects accumulator 202 in control input 803.It is preferred that Ground, multiplexer 802 also have for receive activation primitive circuit (for example, element 3022 in Figure 30,3024,3026,3018, 3014 and the one or more of output 3016) input, wherein these activation primitive circuits generate the width as data word Output.In response to the instruction that AFU output orders etc. are written at the address 4 of Fig. 4, sequencer 128 is in control input 803 Generate a value with control selection one of these activation primitive circuit outputs of multiplexer 802 rather than accumulator 202 word its One of.

Referring now to Fig. 9, a table is shown, which shows the program storage 129 of the NNU 121 for being stored in Fig. 1 In and the program that is executed by the NNU 121.The exemplary process of Fig. 9 is similar with the program of Fig. 4 in many aspects.Specifically, ground The instruction at 0 to 2 place of location is identical.However, the instruction at the address 3 of Fig. 4 and 4 is replaced by write-in ACC instructions in fig.9, It indicates that 512 NPU 126 accumulate it three rows of the output 217 of device 202 133 write back data RAM 122 as a result (at this It is row 16 to 18 in example).That is, write-in ACC instruction instruction sequencers 128 output valve within the first clock cycle is 16 Data address ram 123 and writing commands, within the second clock period output valve be 17 data address ram 123 and write Enter order, and the data address ram 123 and writing commands that output valve is 18 within the third clock cycle.Preferably, it is written The execution of ACC instructions may be Chong Die with the execution of other instructions so that write-in ACC instructions are practical to hold within three clock cycle Row, wherein for one clock cycle of each behavior of write-in data RAM 122.In one embodiment, the specified activation of user The value of output 2956 fields of order in function 2934 and (Figure 29 A's) control register 127, to complete accumulator 202 Data RAM 122 or weight RAM 124 is written in expectations section.Optionally, write-in ACC instructions can be optionally by accumulator 202 Subset write back, rather than the full content of accumulator 202 is write back.In one embodiment, such as below for Figure 29 to Figure 31 It is described in more detail, standard type accumulator 202 can be write back.

Referring now to figure 10, a sequence diagram is shown, which illustrates execution of the NNU 121 to the program of Fig. 9.Figure 10 when Sequence figure is similar with the sequence diagram of Fig. 5, and clock 0 to 512 is identical.However, at clock 513-515,512 NPU The AFU 212 of each NPU 126 in 126 executes one of the write-in ACC instructions at 3 to 5 place of address of Fig. 9.Specifically, exist At clock 513, each NPU 126 in 512 NPU 126 is by the position [15 of the output 217 of accumulator 202：0] as a result Corresponding word in the row 16 of 133 write back data RAM 122；At clock 514, each NPU 126 in 512 NPU 126 By the position [31 of the output 217 of accumulator 202：16] the corresponding word in the row 17 of 133 write back data RAM 122 as a result；And At clock 515, each NPU 126 in 512 NPU 126 is by the position [40 of the output 217 of accumulator 202：32] conduct As a result the corresponding word in the row 18 of 133 write back data RAM 122.Preferably, position [47：41] it is forced to be set as zero.

Shared AFU

Referring now to figure 11, a block diagram is shown, which shows the embodiment of the NNU 121 of Fig. 1.In the embodiment of Figure 11 In, neuron is divided into two parts, i.e. activation primitive cell mesh and the parts ALU (parts ALU also include shift register portion), And each activation primitive cell mesh is by multiple ALU partial sharings.In fig. 11, the parts ALU refer to NPU 126, and are shared Activation primitive cell mesh refer to AFU 1112.The embodiment of this and Fig. 2 are contrasted, for example, in the embodiment of fig. 2, Each neuron includes the AFU 212 of its own.Thus, for example, in one embodiment, the NPU 126 of the embodiment of Figure 11 (parts ALU) includes accumulator 202, ALU 204, multiplexing register 208 and the register 205 of Fig. 2, but does not include AFU 212. In the embodiment in figure 11, NNU 121 includes 512 NPU 126 as example；It is contemplated, however, that with other quantity The other embodiments of NPU 126.In the example of fig. 11, this 512 NPU 126 are grouped into 64 groups and (are claimed in fig. 11 For group 0 to 63), and each group has 8 NPU 126.

NNU 121 further includes row buffer 1104 and is coupled in multiple total between NPU 126 and row buffer 1104 Enjoy AFU 1112.The width (as unit of position) of row buffer 1104 is identical as the row of data RAM 122 or weight RAM 124, For example, 512 words.For 126 groups of each NPU, there are an AFU 1112, that is, each AFU 1112 has corresponding 126 groups of NPU；Therefore, in the embodiment in figure 11, there are 64 AFU 1112 corresponding with 64 126 groups of NPU.In group 8 NPU 126 in each NPU 126 share corresponding AFU 1112.Contemplate AFU 1112 with different number and The other embodiment of NPU 126 with different number in every group.For example, it is contemplated to two, four or 16 NPU in group The other embodiments of 126 shared AFU 1112.

The motivation of shared AFU 1112 is to reduce the size of NNU 121.Size reduction is using reduced performance as cost And obtain.That is, for example, following Figure 12 show like that, according to shared rate may need longer several clocks come The result 133 for generating 126 arrays of entire NPU, in this case, due to 8：1 shared rate, thus need seven it is additional Clock cycle.It is however generally that with cumulative and required clock number is generated (for example, there are 512 for each neuron The layer of connection needs 512 clocks) it compares, aforementioned additional clock number (such as 7) is relatively fewer.Therefore, relatively small It can be one worthwhile that performance, which influences (for example, increasing the centesimal calculating time) for the reduction of the size of NNU 121, Compromise.

In one embodiment, each NPU 126 includes AFU 212, and wherein AFU 212 is relatively simple for executing Activation primitive, so that these simple AFU 212 can be relatively small and therefore can be comprised in each NPU 126； And shared or complicated AFU 1112 executes relative complex activation primitive, therefore relatively prominently it is more than simple AFU 212.In such embodiments, only the in the case of of sharing the complicated activation primitive of complexity AFU 1112 is needed just to need specified The additional clock cycle is wanted, and is not needed then in the case of the specified activation primitive executed by the configurations of simple AFU 212.

2 and Figure 13 referring now to figure 1 shows that two sequence diagrams, these sequential illustrate journeys of the NNU 121 to Fig. 4 of Figure 11 The execution of sequence.The sequence diagram of Figure 12 is similar with the sequence diagram of Fig. 5, and clock 0 to 512 is identical.However, at clock 513, fortune Calculation is different from the operation described in figure of the sequential of Fig. 5, this is because the NPU 126 of Figure 11 shares AFU 1112；That is, in group NPU 126 shares AFU 1112 associated with the group, and Figure 11 shows to share.

Every a line of the sequence diagram of Figure 13 is corresponding with the continuous clock period indicated in first row.Other each row and 64 Different AFU 1112 is corresponding in a AFU 1112 and indicates its operation.In order to simply clearly be illustrated, AFU is only shown 0,1 and 63 operation.The clock cycle of Figure 13 is corresponding with the clock cycle of Figure 12, but shows NPU 126 in different ways To the shared of AFU 1112.As shown in figure 13, at clock 0~512, each AFU 1112 in 64 AFU 1112 is It is inactive, and NPU 126 executes initialization NPU instructions, multiply-accumulate instruction and the instruction of multiply-accumulate rotation.

As shown in Figure 12 and Figure 13 the two, at clock 513, AFU 0 starts pair (with 0 associated AFU 1112 of group) The value 217 of the accumulator 202 of NPU 0 (organizing first NPU 126 in 0) executes specified activation primitive, and AFU's 0 is defeated The word 0 of row buffer 1104 will be stored to by going out.Equally at clock 513, each AFU 1112 starts to corresponding NPU 126 The accumulator 202 of first NPU 126 in group executes specified activation primitive.Therefore, as shown in figure 13, at clock 513, AFU 0 starts the specified activation primitive of the execution of accumulator 202 to NPU 0 will be stored to the word of row buffer 1104 to generate 0 result；AFU 1 starts the specified activation primitive of the execution of accumulator 202 to NPU 8 will be stored to row buffer to generate The result of 1104 word 8；The rest may be inferred, and AFU 63 starts to execute specified activation primitive to the accumulator 202 of NPU 504 to produce The raw result by the word 504 for being stored to row buffer 1104.

As shown, at clock 514, AFU 0 starts (to organize in 0 NPU 1 (with 0 associated AFU 1112 of group) Second NPU 126) the value 217 of accumulator 202 execute specified activation primitive, and the output of AFU 0 will be stored To the word 1 of row buffer 1104.Equally at clock 514, each AFU 1112 starts to second in 126 groups of corresponding NPU The accumulator 202 of a NPU 126 executes specified activation primitive.Therefore, as shown in figure 13, at clock 514, AFU 0 starts The result of the word 1 of row buffer 1104 will be stored to generate by executing specified activation primitive to the accumulator 202 of NPU 1； AFU 1 starts the specified activation primitive of the execution of accumulator 202 to NPU 9 will be stored to the word of row buffer 1104 to generate 9 result；The rest may be inferred, and AFU 63 starts the specified activation primitive of the execution of accumulator 202 to NPU 505 and will be stored up with generating It deposits to the result of the word 505 of row buffer 1104.As shown, until this pattern continues until the clock cycle 520, AFU 0 (with 0 associated AFU 1112 of group) starts the accumulator to NPU 7 (organizing the 8th (the last one) NPU 126 in 0) 202 value 217 executes specified activation primitive, and the output of AFU 0 will be stored to the word 7 of row buffer 1104.Equally At clock 520, each AFU 1112 starts to execute the accumulator 202 of the 8th NPU 126 in 126 groups of corresponding NPU Specified activation primitive.Therefore, as shown in figure 13, at clock 520, AFU 0 starts to execute the accumulator 202 of NPU 7 Specified activation primitive will be stored to the result of the word 7 of row buffer 1104 to generate；AFU 1 starts to the cumulative of NPU 15 Device 202 executes specified activation primitive will be stored to the result of the word 15 of row buffer 1104 to generate；The rest may be inferred, AFU 63 start the specified activation primitive of the execution of accumulator 202 to NPU 511 will be stored to the word of row buffer 1104 to generate 511 result.

At clock 521, once 512 results of whole associated with 512 NPU 126 have all generated and writing line Buffer 1104, row buffer 1104 are begun to its content write-in data RAM 122 or weight RAM 124.In this way, 64 The AFU 1112 in each group in 126 groups of NPU is carried out a part for the instruction of the activation primitive at the address 3 of Fig. 4.

As following for example described in more detail for Figure 29 A to Figure 33, the implementation of AFU 1112 is shared between 204 groups of ALU Example (embodiment etc. in Figure 11) is combined with integer ALU 204 and can be particularly advantageous.

MTNN is instructed with MFNN frameworks

Referring now to figure 14, a block diagram is shown, the block diagram show to neural Network Mobility (MTNN) framework instruction 1400 and The framework instructs the operation of the part of the NNU 121 relative to Fig. 1.MTNN instructions 1400 include operation code (opcode) field 1402, src1 fields 1404, src2 fields 1406, gpr fields 1408 and instant field 1412.MTNN instructions 1400 are framework Instruction, that is, the instruction is included in the instruction set architecture of processor 100.Preferably, instruction set architecture is by opcode field 1402 Predetermined value and MINN instruction 1400 be associated, to distinguish MTNN instructions 1400 and other instructions in instruction set architecture. The operation code 1402 of MTNN instructions 1400 may include or can not include the preamble being such as common in x86 frameworks (prefix)。

Instant field 1412 provides the value for 1434 specified function 1432 of control logic to NNU 121.Preferably, letter Number 1432 is provided as the real time operation number of the microcommand 105 of Fig. 1.The function 1432 that can be executed by NNU 121 include but It is not limited to write-in data RAM 122, write-in weight RAM 124, write-in program memory 129, write-in control register 127, opens Program in beginning executive memory 129, completes executive memory at the program in pause executive memory 129 The request of program in 129 notifies (such as interruption) and resets NNU 121.Preferably, NNU instruction set includes its result table Show the completed instruction of NNU programs.Optionally, NNU instruction set includes the specific instruction for generating and interrupting.Preferably, to NNU 121 reseted including in addition to the content of data RAM 122, weight RAM 124, program storage 129 maintain it is complete it is motionless other than, Effectively NNU 121 is forced to return to the state of reseting (for example, empty internal state machine and set it to idle state).In addition, The internal register of accumulator 202 etc. will not be influenced by function is reseted, and must be by clear emptying, such as be made It is emptied with the initialization NPU instructions at the address 0 of Fig. 4.In one embodiment, function 1432 may include directly executing letter Number, wherein the first source register includes microoperation (such as seeing the microoperation 3418 of Figure 34).This directly executes function instruction NNU 121 directly execute specified microoperation.In this way, framework program can directly control NNU 121 to execute operation, rather than will instruction Write-in program memory 129 simultaneously refers in the follow-up instruction indicated in 121 executive memories 129 of NNU or by means of MTNN Enable the execution of 1400 (or MFNN instructions 1500 of Figure 15).Figure 14 shows the example of the function 1432 of write-in data RAM 122.

Gpr fields 1408 specify a GPR in general-purpose register file 116.In one embodiment, each GPR is equal It is 64.As shown, the value from selected GPR is provided to NNU 121 by general-purpose register file 116, NNU 121 will The value is used as address 1422.Address 1422 selects the row for the memory specified in function 1432.In data RAM 122 or weight In the case of RAM 124, in addition address 1422 selects a data block, the size of the data block is the media register in select row Twice of the size of the position of (for example, 512).Preferably, the position is on 512 bit boundaries.In one embodiment, it is multiplexed Device selects address 1422 (or address 1422 in the case of following MFNN instructions 1400) or from the ground of sequencer 128 Location 123/125/131 is to be provided to 122/ weight RAM of data RAM, 124/ program storages 129.In one embodiment, such as Described more fully below, data RAM 122 is dual-port so that NPU 126 can be read with media register 118/ Write-in data RAM 122 is concurrently read out/is written to data RAM 122.In one embodiment, for similar mesh , weight RAM 124 is also dual-port.

Src1 fields 1404 and the media register in src2 fields 1406 respectively specified media register file 118.? In one embodiment, each media register 118 is 256.It is selected as shown, media register file 118 will come from The concatenated data (for example, 512) of media register be provided to data RAM 122 (or weight RAM 124 or program storage 129), with the select row 1428 specified by writing address 1422 and the position specified by address 1422 in select row 1428 is written. Advantageously, by executing a series of MTNN instructions 1400 (and following MFNN instructions 1500), the frame executed on processor 100 Structure program can fill the row of the row and weight RAM 124 of data RAM 122, and by it is all as described herein (for example, Fig. 4 and Fig. 9's) the program write-in program memory 129 of program etc., so that NNU 121 at a very rapid rate executes data and weight Operation, to realize artificial neural network.In one embodiment, framework program directly controls NNU 121 rather than writes program Enter program storage 129.

In one embodiment, quantity, that is, Q of 1400 specified starting source register and source register of MTNN instructions, rather than refer to Fixed two source registers (for example, 1404 and 1406).The 1400 instruction processor 100 of MTNN instructions of this form will be appointed as NNU 121 is written in the media register 118 of stock register and next Q-1 subsequent media register 118, that is, is written Specified data RAM 122 or weight RAM 124.Preferably, MTNN instructions 1400 are translated to write-in by instruction translator 104 The microcommand of quantity needed for all Q specified media registers 118.For example, in one embodiment, when MTNN instructs 1400 Starting source register is appointed as in the case that MR4 and Q be 8, MTNN instructions 1400 are translated to four by instruction translator 104 Microcommand, wherein the first microcommand be written MR4 and MR5, the second microcommand be written MR6 and MR7, third microcommand be written MR8 with MR9, and MR10 and MR11 is written in the 4th microcommand.Being 1024 from media register 118 to the data path of NNU 121 and In non-512 alternative embodiments, MTNN instructions 1400 are translated to two microcommands by instruction translator 104, wherein first is micro- MR8 to MR11 is written in instruction write-in MR4 to MR7, the second microcommand.Contemplate following similar embodiment, wherein MFNN instructions The quantity of 1500 specified starting destination registers and destination register, so that each MFNN instructions 1500 can be read Access is according to the data block more than single medium register 118 in the row of RAM 122 or weight RAM 124.

Referring now to figure 15, a block diagram is shown, the block diagram show from neural network movement (MTNN) framework instruction 1500 and The framework instructs the operation of the part of the NNU 121 relative to Fig. 1.MFNN instructions 1500 include opcode field 1502, dst words Section 1504, gpr fields 1508 and instant field 1512.MFNN instructions 1500 instruct for framework, i.e. the instruction is contained in processing In the instruction set architecture of device 100.Preferably, which instructs 1500 by the predetermined value of opcode field 1502 and MFNN It is associated, to distinguish MFNN instructions 1500 and other instructions in instruction set architecture.The operation code 1502 of MFNN instructions 1500 It may include or can not include such as being common in preamble in x86 frameworks.

Instant field 1512 provides the value for 1434 specified function 1532 of control logic to NNU 121.Preferably, letter Number 1532 is provided as the real time operation number of the microcommand 105 of Fig. 1.The function 1532 that can be executed by NNU 121 include but It is not limited to read data RAM 122, reads weight RAM 124, reading program memory 129 and read status register 127.Figure 15 shows to read the example of the function 1532 of data RAM 122.

Gpr fields 1508 specify a GPR in general-purpose register file 116.As shown, general-purpose register file Value from selected GPR is provided to NNU 121 by 116, and wherein NNU 121 is by the value as address 1522 and to be similar to The mode of the address 1422 of Figure 14 operates, to select the row for the memory specified in function 1532, and in data RAM 122 or In the case of weight RAM 124, in addition one data block of selection, the data block size are the media deposits in select row for address 1522 The size of the position of device (for example, 256).Preferably, the position is on 256 bit boundaries.

Dst fields 1504 specify the media register in media register file 118.As shown, media register text Part 118 receives data (for example, 256) to selected from data RAM 122 (or weight RAM 124 or program storage 129) Media register, this digital independent is from the select row 1528 specified by address 1522 and the address 1522 in select row 1528 Specified position.

NNU internal RAMs port configures

Referring now to figure 16, a block diagram is shown, which shows the embodiment of the data RAM 122 of Fig. 1.Data RAM 122 Including memory array 1606, read port 1602 and write-in port 1604.Memory array 1606 keeps data word, and D rows are preferably arranged to as described above, often capable N number of word.In one embodiment, memory array 1606 includes 64 horizontal cloth The array (wherein each unit have 128 width and 64 height) for the static ram cell set is to provide width 8192 and the 64KB data RAM 122 with 64 rows, and 122 occupied chip areas substantially 0.2 of data RAM Square millimeter.It is contemplated, however, that other embodiments.

Read port 1602 is preferably coupled to NPU 126 and media register 118 using multiplex mode.(more accurately, matchmaker Body register 118 can be coupled to read port 1602 via result bus, and wherein result bus can also provide data to Reorder buffer and/or the result forwarding bus to other execution units 112.) NPU 126 and media register 118 be shared Read port 1602, to be read out to data RAM 122.Port 1604 is written it is also preferred that being coupled to NPU using multiplex mode 126 and media register 118.NPU 126 and the shared write-in port 1604 of media register 118, data RAM is written 122.Hence it is advantageous to which media register 118 can be while NPU 126 be just read out from data RAM 122 concurrently Data RAM 122 is written or NPU 126 can be while media register 118 be read out from data RAM 122 It is written in parallel in data RAM 122.The performance that can be advantageously provided improved in this way.For example, NPU 126 can read data RAM 122 (such as continuously carrying out calculating), while data RAM 122 can be written in more data words by media register 118.Again For example, data RAM 122 can be written in result of calculation by NPU 126, while media register 118 is read from data RAM 122 Result of calculation.In one embodiment, data RAM 122, while NPU 126 can be written in a line result of calculation by NPU 126 Also data line word is read from data RAM 122.In one embodiment, memory array 1606 is configured to store device block (bank).When NPU 126 accesses data RAM 122, all memory blocks are all activated to access memory array 1606 entire row；And when media register 118 accesses data RAM 122, only specified memory block can be swashed It is living.In one embodiment, the width of each memory block is 128, and the width of media register 118 is 256, because This activates two memory blocks when for example, media register 118 accessing every time.In one embodiment, port 1602/1604 One of them is read/write port.In one embodiment, the both read/write port of port 1602/1604.

The advantages of wheel-turning device ability of NPU 126 as described herein, is：Be limited or highly-utilized in order to ensure NPU 126 And require framework program while NPU 126 is executed and calculated that can persistently provide data to number (via media register 118) It is compared according to RAM 122 and the memory array needed for 122 retrieval results of data RAM, which helps to make number It is substantially reduced according to the row of the memory array 1606 of RAM 122, thus keeps the array relatively much smaller.

Internal RAM buffer

Referring now to figure 17, a block diagram is shown, which shows the embodiment of the weight RAM 124 and buffer 1704 of Fig. 1. Weight RAM 124 includes memory array 1706 and port 1702.Memory array 1706 keeps weight word, and as described above W rows are preferably arranged to, often row has N number of word.In one embodiment, memory array 1706 includes 128 horizontally disposed Static ram cell array (wherein each unit have 64 width and 2048 height) to provide width as 8192 Position and the 2MB weights RAM 124 with 2048 rows, and substantially 2.4 squares of 124 occupied chip areas of weight RAM Millimeter.It is contemplated, however, that other embodiments.

Port 1702 is preferably coupled to NPU 126 and buffer 1704 using multiplex mode.NPU 126 and buffer 1704 It is read via the port 1702 and weight RAM 124 is written.Buffer 1704 is further coupled to the media register 118 of Fig. 1 so that Media register 118 is read by buffer 1704 and weight RAM 124 is written.Hence it is advantageous to read in NPU 126 While taking or be written weight RAM 124, media register 118 can also be written in parallel in or read buffer 1704 (still If the current positive execution of NPU 126, preferably shelves NPU 126, to avoid in the same of 1704 access weight RAM 124 of buffer When access weight RAM 124).In this way can advantageously improving performance, especially because media register 118 is for weight Reading and writing for RAM 124 is much smaller for reading and writing for weight RAM 124 relative to NPU 126.For example, at one In embodiment, the read/writes 8192 of NPU 126 1 times (a line), and the width of media register 118 is 256 and every 1,400 two media registers 118 of write-in of a MTNN instructions, i.e., 512.Therefore, 16 MTNN instructions are executed in framework program In the case that 1400 to fill buffer 1704, NPU 126 and framework program pin clash access weight RAM 124 Time is only less than about 6 the percent of the time.In another embodiment, MTNN instructions 1400 are translated to two by instruction translator 104 A microcommand 105, wherein buffer 1704 is written in individual data register 118 by each microcommand 105, in this case, The frequency that NPU 126 and framework program pin clash access weight RAM 124 even smaller.

In the embodiment comprising buffer 1704, multiple MTNN are needed to refer to using framework program write-in weight RAM 124 Enable 1400.One or more MTNN instruct 1400 specified functions 1432 the data block specified in buffer 1704 is written, then MTNN instructs 1400 specified functions 1432 to indicate NNU 121 by the specified of the content write-in weight RAM 124 of buffer 1704 Row, the size of wherein data block are twice of the digit of media register 118 and these data blocks naturally in buffer Alignment in 1704.In one embodiment, for specified function 1432 with the specified data block to buffer 1704 into row write In each MTNN instructions 1400 entered, including the bitmask with position corresponding with each data block of buffer 1704 (bitmask).Data from two specified source registers 118 are written into the phase being provided in bitmask in buffer 1704 In each data block for answering position.This can be useful for the duplicate data value in the row of weight RAM 124.For example, being Buffer 1704 (and row of subsequent weight RAM 124) is zeroed, program designer can be that source register loads zero It is worth and is arranged all positions of bitmask.In addition, bitmask enables program designer to be only written the choosing in buffer 1704 Data block is determined, to retain other data past data in the block.

In one embodiment comprising buffer 1704, needed using framework program reading weight RAM 124 multiple MFNN instructions 1500.Initial MFNN instructs 1500 specified functions 1532 to load buffer from the nominated bank of weight RAM 124 1704, subsequent one or more MFNN instructs 1500 specified functions 1532 reading the specified data block of buffer 1704 to mesh Ground register, the size of wherein data block is the digit of media register 118, and these data blocks are naturally in buffer Alignment in 1704.Following other embodiments are contemplated, wherein weight RAM 124 includes multiple buffers 1704, is increased to pass through The addressable quantity of framework program is further reduced between NPU 126 and framework program to access weight when NPU 126 being added to execute The competition of RAM 124 can increase in this way in NPU 126 without being able to carry out during the clock cycle of access weight RAM 124 The possibility of the access of buffer 1704.

Although Figure 16 describes dual port data RAM 122, but it is envisioned that weight RAM 124 is also other realities of dual-port Apply example.In addition, although Figure 17 describes buffer for weight RAM 124, but it is envisioned that data RAM 122 also has and caching The other embodiments of the similar associated buffer of device 1704.

Dynamically configurable NPU

Referring now to figure 18, a block diagram is shown, which shows the dynamically configurable NPU 126 of Fig. 1.The NPU of Figure 18 126 is similar with the NPU of Fig. 2 126 in many aspects.However, the NPU 126 of Figure 18 is dynamically configurable in two different configurations Operation in one of them.In being configured at first, the operation of the NPU 126 of Figure 18 is similar to the NPU 126 of Fig. 2.That is, At first in configuration (referred to herein as " width " configuration or " single " configuration), the ALU 204 of NPU 126 is to single wide data Word and single wide weight word (for example, 16) execute operation to generate single wide result.In contrast, configured at second ( Referred to herein as " narrow " configuration or " double " configurations) in, NPU 126 is to two narrow data words and two corresponding narrow weight word (examples Such as, 8) operation is executed to generate two corresponding narrow results.In one embodiment, the configuration of NPU 126 is (wide or narrow ) carried out by initializing NPU instructions (for example, instruction of the address 0 in following Figure 20).Optionally, the configuration can also be by MTNN instructions realize that the function 1432 of wherein MTNN instructions is specified to configure NPU 126 to the configuration (wide or narrow).It is preferred that Ground instructs or determines the MTNN instruction filling configuration registers of configuration (wide or narrow) by program storage 129.For example, configuration is posted The output of storage is provided to ALU 204, AFU 212 and generates the logic of multiplexing register control signal 213.It is general and Speech, element and the element of same reference numerals in Fig. 2 of the NPU 126 of Figure 18 execute similar function, and in order to understand figure 18, reference should be carried out.However, the existing embodiment (include with Fig. 2 do not exist together) for Figure 18 illustrates.

The NPU 126 of Figure 18 include two register 205A and 205B, two 3 input multiplexing register 208A and 208B, 204, two accumulator 202A of ALU and 202B and two AFU 212A and 212B.Each register 205A/205B is respectively provided with The half (such as 8) of the width of the register 205 of Fig. 2.Each register 205A/205B receives corresponding narrow from weight RAM 124 Weight word 206A/B206 (such as 8) simultaneously outputs it the behaviour that 203A/203B is provided to ALU 204 in subsequent clock period It counts and selects logic 1898.It is similar with the mode of register 205 of the embodiment of Fig. 2 in the case where NPU 126 is wide configuration Ground, register 205A/205B actually operate together with receive from weight RAM 124 wide weight word 206A/206B (such as 16)；And in the case where NPU 126 is narrow configuration, register 205A/205B is actually operating independently, to respectively The narrow weight word 206A/206B (such as 8) from weight RAM 124 is received, so that NPU 126 is actually two independent Narrow NPU.However, the identical carry-out bit of weight RAM 124 all couples and is provided to register 205A/205B, and with NPU 126 Configuration it is unrelated.For example, the register 205A of NPU 0 receives byte 0, the register 205B of NPU 0 receives byte 1, NPU's 1 Register 205A receives byte 2, and the register 205B of NPU 1 receives byte 3, and so on the register 205B of NPU 511 connects Receive byte 1023.

Each multiplexing register 208A/208B is respectively provided with the half (such as 8) of the width of the register 208 of Fig. 2.Multiplexing Register 208A selects it to input one of 207A, 211A and 1811A to be stored in its register and in subsequent clock week It is provided in phase on output 209A, and is multiplexed register 208B and it is selected to input one of 207B, 211B and 1811B with storage There are be provided to operand selection logic 1898 in its register and in subsequent clock period on output 209B.Input 207A Narrow data word (such as 8) is received from data RAM 122, and inputs 207B and receives narrow data word from data RAM 122.? In the case that NPU 126 is wide configuration, similarly with the mode of the multiplexing register 208 of the embodiment of Fig. 2, it is multiplexed register 208A/208B is actually operated together to receive the wide data word 207A/207B (such as 16) from data RAM 122；? In the case that NPU 126 is narrow configuration, multiplexing register 208A/208B is actually operating independently, and is come to respectively receive The narrow data word 207A/207B (such as 8) of data RAM 122, so that NPU 126 is actually two individual narrow NPU. However, the identical carry-out bit of data RAM 122 all couples and is provided to multiplexing register 208A/208B, and match with NPU 126 It sets unrelated.For example, the multiplexing register 208A of NPU 0 receives byte 0, the multiplexing register 208B of NPU 0 receives byte 1, NPU 1 multiplexing register 208A receives byte 2, and the multiplexing register 208B of NPU 1 receives byte 3, and so on NPU's 511 answers Byte 1023 is received with register 208B.

Input 211A receives the output 209A of the multiplexing register 208A of adjacent NPU 126, and inputs 211B and receive phase The output 209B of the multiplexing register 208B of adjacent NPU 126.As shown, the multiplexing that input 1811A receives adjacent NPU 126 is posted The output 209B of storage 208B, and input the output 209A that 1811B receives the multiplexing register 208A of current NPU 126.? In N number of NPU 126 shown in FIG. 1, NPU 126 shown in Figure 18 is denoted as NPU J.That is, NPU J are N number of NPU Representative example.Preferably, the input 211A of the multiplexing register 208A of NPU J receives the multiplexing deposit of 126 example J-1 of NPU The output 209A of device 208A, and the input 1811A of the multiplexing register 208A of NPU J receives the multiplexing of 126 example J-1 of NPU The output 209B of register 208B, and the output 209A of the multiplexing register 208A of NPU J is provided to 126 example J+ of NPU Both input 211B of the multiplexing register 208B of the input 211A and 126 example J of NPU of 1 multiplexing register 208A； And the input 211B of the multiplexing register 208B of NPU J receives the output of the multiplexing register 208B of 126 example J-1 of NPU The input 1811B of the multiplexing register 208B of 209B, NPU J receive the output of the multiplexing register 208A of 126 example J of NPU 209A, and the output 209B of the multiplexing register 208B of NPU J is provided to the multiplexing register of 126 example J+1 of NPU Both input 211B of the multiplexing register 208B of the input 1811A and 126 example J+1 of NPU of 208A.

Control input 213 controls which of these three inputs input and is re-used register 208A/208B selections to store up There are provided on respective output 209A/209B in respective register and in follow-up.In NPU 126 by (for example, as following Description passes through the multiply-accumulate instruction at the address 1 of Figure 20) instruction will from the case that data RAM 122 loads a line, no matter NPU 126 is to be in narrow configuration in wide configuration, and control input 213 all controls each multiplexing register 208A/208B from number Respective narrow data word 207A/207B (such as 8) is selected in corresponding narrow word according to the select row of RAM 122.

It is designated as by (for example, being instructed as described below by the multiply-accumulate rotation at the address of Figure 20 2) in NPU 126 In the case of being rotated to the value of the data line of previous receipt, if NPU 126 is in narrow configuration, 213 control of input is controlled Make each corresponding input 1811A/1811B of multiplexing register 208A/208B selections.In this case, multiplexing register 208A/ 208B actually independent operations so that NPU 126 is actually two individual narrow NPU.In this way, such as more detailed below for Figure 19 Thin described, the multiplexing register 208A and 208B of N number of NPU 126 is operated entirely as the wheel-turning device of 2N narrow words.

In the case where NPU 126 is indicated as rotating the value of the data line of previous receipt, if at NPU 126 It is configured in width, then controls each corresponding input 211A/211B of multiplexing register 208A/208B selections of 213 control of input.This In the case of, actually integrally seemingly the NPU 126 is that single wide NPU 126 is grasped like that multiplexing register 208A/208B Make.In this way, with for the mode described in Fig. 3 similarly, multiplexing the register 208A and 208B of N number of NPU 126 is entirely as N The wheel-turning device of wide word works.

ALU 204 includes operand selection logic 1898, wide multiplier 242A, narrow multiplier 242B, wide 2 inputoutput multiplexer 1896A, narrow 2 inputoutput multiplexer 1896B, wide adder 244A and narrow adder 244B.In fact, ALU 204 includes operation Number selection logics 1898, width ALU 204A (including wide multiplier 242A, width multiplexer 1896A and width adder 244A) and narrow ALU 204B (including narrow multiplier 242B, narrow multiplexer 1896B and narrow adder 244B).Preferably, wide multiplier 242A will Two wide word is multiplied, and similar with the multiplier of Fig. 2 242 (such as 16 × 16 multipliers).Narrow multiplier 242B is by two A narrow word is multiplied (such as generating 8 × 8 multipliers of 16 results).When NPU 126 is narrow configuration, by means of operand Logic 1898 is selected, wide multiplier 242A acts essentially as narrow multiplier so that two narrow word is multiplied so that NPU 126 is actually As two narrow NPU.Preferably, wide adder 244A is by the output of the output of wide multiplexer 1896A and wide accumulator 202A 217A is added is supplied to wide accumulator 202A to generate summation 215A, is similar to the adder 244 of Fig. 2.Narrow adder The output of narrow multiplexer 1896B is added with the output 217B of narrow accumulator 202B to generate summation 215B to be supplied to by 244B Narrow accumulator 202B.In one embodiment, narrow accumulator 202B has 28 width, to avoid to up to 1024 16 Position product loses precision when carrying out cumulative.When NPU 126 is wide configuration, narrow multiplier 242B, narrow multiplexer 1896B, it is narrow plus Musical instruments used in a Buddhist or Taoist mass 244B, narrow accumulator 202B and narrow AFU 212B are preferably inactive to reduce energy consumption.

As described in more detail below, operand selects the selection operation number from 209A, 209B, 203A and 203B of logic 1898 To be provided to other elements of ALU 204.Preferably, operand selection logic 1898 also executes other functions, such as executes band The sign extended of the data word and weight word of value of symbol.For example, if NPU 126 is narrow configuration, operand selects logic 1898 by narrow data word and weight word before being supplied to wide multiplier 242A, and the narrow data word and weight character number are extended to The width of wide word.Similarly, narrow data/weight word is made to pass through (via wide multiplexer 1896A if ALU 204 is indicated as Skip wide multiplier 242A), then operand selection logic 1898 by narrow data/weight word be supplied to wide adder 244A it Before, narrow data/weight character number is extended to the width of wide word.Preferably, there is also hold in the ALU 204 of the NPU 126 of Fig. 2 The logic of row sign extended function.

Wide multiplexer 1896A receives the output of wide multiplier 242A and selects the operation of logic 1898 from operand Number, and one of them is selected to be supplied to wide adder 244A from these inputs, and narrow multiplexer 1896B receives narrow multiplication The output of device 242B and the operand that logic 1898 is selected from operand, and these is selected to input one of them to be supplied to Narrow adder 244B.

Configuration and ALU 204 of the operand that operand selection logic 1898 is provided depending on NPU 126 are based on NPU The functions of 126 instructions just executed and the arithmetic executed and/or logical operation.For example, if instruction instruction ALU 204 It is wide configuration to execute multiply-accumulate and NPU 126, then operand selects logic 1898 will be as the string of output 209A and 209B The wide word connect is provided to an input of wide multiplier 242A and will be provided to as the wide word of the concatenation of output 203A and 203B Another input, and narrow multiplier 242B is inactive so that NPU 126 is used as the single width NPU similar with the NPU 126 of Fig. 2 126.And if instruction instruction ALU 204 executes multiply-accumulate and NPU 126 and is in narrow configuration, operand selects logic Version after the extension of narrow data word 209A or after broadening is provided to one of wide multiplier 242A input and by narrow power by 1898 Version is provided to another input after the extension of weight word 203A；In addition, operand selection logic 1898 provides narrow data word 209B One to narrow multiplier 242B inputs and narrow weight word 203B is provided to another input.To extend or widening narrow word, such as The narrow word tape symbol of fruit, then operand selection logic 1898 is to narrow word progress sign extended；And if narrow word not tape symbol, is grasped It counts the high position for selecting logic 1898 to be zero for narrow word entry value.

For another example if NPU 126 is in wide configuration and instruction ALU 204 is instructed to execute the cumulative of weight word, it is wide Multiplier 242A is skipped, and the concatenation for exporting 203A and 203B is provided to wide multiplexer by operand selection logic 1898 1896A, to be supplied to wide adder 244A.And if NPU 126 is narrow configuration and instruction ALU 204 is instructed to execute weight word It is cumulative, then wide multiplier 242A is skipped and version after the extension for exporting 203A is provided to by operand selection logic 1898 Wide multiplexer 1896A, to be supplied to wide adder 244A；And narrow multiplier 242B is skipped and operand selection logic Version after the extension for exporting 203B is provided to narrow multiplexer 1896B by 1898, to be supplied to narrow adder 244B.

For another example if NPU 126 is wide configuration and instruction ALU 204 is instructed to execute the cumulative of data word, it is wide to multiply Musical instruments used in a Buddhist or Taoist mass 242A is skipped and the concatenation for exporting 209A and 209B is provided to wide multiplexer by operand selection logic 1898 1896A, to be supplied to wide adder 244A.And if NPU 126 is narrow configuration and instruction ALU 204 is instructed to execute data word It is cumulative, then wide multiplier 242A is skipped and operand selection logic 1898 provides version after the extension for exporting 209A To wide multiplexer 1896A, to be supplied to wide adder 244A；And narrow multiplier 242B is skipped and operand selection is patrolled It collects 1898 and version after the extension for exporting 209B is provided to narrow multiplexer 1896B, to be supplied to narrow adder 244B.Weight/number Cumulative according to word can contribute to execute average calculating operation, and wherein these average calculating operations are for the certain artificial of image procossing etc. The pond layer of Application of Neural Network.

Preferably, NPU 126 further includes：Second wide multiplexer (not shown), for skip wide adder 244A in order to Load wide add up using narrow data/weight word after extension using wide data/weight word or under narrow configuration in the case where width is configured Device 202A；And the second narrow multiplexer (not shown), for skipping narrow adder 244B, in order to utilize narrow number under narrow configuration According to/narrow accumulator the 202B of weight word load.Preferably, ALU 204 further includes that wide and narrow comparator/multiplexer combination (is not shown Go out), the wherein comparator/multiplexer combination receives corresponding accumulator value 217A/217B and corresponding multiplexer 1896A/ 1896B is exported, to select maximum value between accumulator value 217A/217B and data/weight word 209A/B/203A/B, such as with Under it is for example described in more detail for Figure 27 and 28, such operation for certain artificial neural networks application pond layer in.This Outside, operand select logic 1898 be configured to supply value for zero operand (for adding zero or for removing accumulator) and carry The operand for being one for value (for multiplying one).

Narrow AFU 212B receive the output 217B of narrow accumulator 202B and execute activation primitive to it to generate narrow result 133B, and width AFU 212A receive the output 217A of wide accumulator 202A and execute activation primitive to it to generate wide result 133A.When NPU 126 is narrow configuration, wide AFU 212A correspondingly consider the output 217A of wide accumulator 202A and are executed to it Activation primitive to generate narrow result (such as 8), this as it is following for example for Figure 29 A to Figure 30 it is described in more detail.

From described above it can be found that advantageously, single NPU 126 is when for narrow configuration effectively as two narrow NPU And operate, therefore up to twice of the handling capacity of handling capacity when substantially wide configuration is provided for smaller word.For example it is assumed that Neural net layer has 1024 neurons, and each neuron receives 1024 narrow input from previous layer (and has narrow weight Word), to generate 1,000,000 connections.Compared with the NNU 121 of the NPU 126 with 512 wide configurations, have 512 narrow The NNU 121 of the NPU 126 of configuration (vs514 clock of about 1026 clocks) can handle four times within the time of substantially half Connection number (1,000,000 connection vs256K connection), although processing is narrow word rather than width word.

In one embodiment, the dynamically configurable NPU 126 of Figure 18 includes similar with multiplexing register 208A and 208B 3 input multiplexing registers to replace register 205A and 205B, to realize for a line for being received from weight RAM 124 The wheel-turning device of weight word, this is similar to a certain degree with the mode described in the embodiment for Fig. 7 but using for described in Figure 18 Dynamically configurable mode.

Referring now to figure 19, a block diagram is shown, which shows that the NNU's 121 of Fig. 1 of the embodiment according to Figure 18 is N number of The embodiment of the arrangement of the 2N multiplexing register 208A/208B of NPU 126, to illustrate this 2N multiplexing register 208A/ 208B is as the operation for the wheel-turning device from the data RAM 122 of Fig. 1 data line words 207 received.In the reality of Figure 19 It applies in example, as shown, N is 512 so that NNU 121 has 1024 multiplexing register 208A/ for being denoted as 0 to 511 208B is corresponded to 512 NPU 126 (being actually 1024 narrow NPU).Two narrow NPU in NPU 126 be denoted as A and B, and in each multiplexing register 208, show that corresponding narrow NPU's is specified.More specifically, for the multiplexing of NPU 126 0 Register 208A specifies 0-A, and the multiplexing register 208B for being NPU 126 0 specifies 0-B, is the multiplexing register of NPU 126 1 208A specifies 1-A, and the multiplexing register 208B for being NPU 126 1 specifies 1-B, is the multiplexing register 208A of NPU 126 511 Specified 511-A, and specify 511-B, these values also to correspond to as described below for the multiplexing register 208B of NPU 126 511 The narrow NPU of Figure 21.

Corresponding narrow data word 207A in the wherein a line for the D rows that each multiplexing register 208A receives data RAM 122, And the corresponding narrow data word 207B in the wherein a line for the D rows that each multiplexing register 208B receives data RAM 122.Also It is to say, multiplexing register 0A receives the narrow data word 0 of 122 rows of data RAM, and multiplexing register 0B receives 122 rows of data RAM Narrow data word 1, multiplexing register 1A receive the narrow data word 2 of 122 rows of data RAM, and multiplexing register 1B receives data RAM The narrow data word 3 of 122 rows, and so on, multiplexing register 511A receives the narrow data word 1022 of 122 rows of data RAM, and It is multiplexed the narrow data word 1023 that register 511B receives 122 rows of data RAM.In addition, multiplexing register 1A is inputted at it on 211A The output 209A of multiplexing register 0A is received, multiplexing register 1B inputs the output of reception multiplexing register 0B on 211B at it 209B, and so on, multiplexing register 511A is inputted at it and is received the output 209A for being multiplexed register 510A on 211A, and multiplexing is posted Storage 511B is inputted at it and is received the output 209B for being multiplexed register 510B on 211B, and is multiplexed register 0A in its input The output 209A of multiplexing register 511A is received on 211A, multiplexing register 0B is inputted on 211B at it and received multiplexing register The output 209B of 511B.Each multiplexing register 208A/208B receives control input 213, and wherein 213 control of control input is Select data word 207A/207B or selection rotation after input 211A/211B, also or selection rotation after input 1811A/ 1811B.Finally, multiplexing register 1A is inputted at it and is received the output 209B for being multiplexed register 0B on 1811A, multiplexing register 1B It is inputted at it and receives the output 209A for being multiplexed register 1A on 1811B, and so on, multiplexing register 511A is inputted at it The output 209B of multiplexing register 510B is received on 1811A, multiplexing register 511B is inputted on 1811B at it and received multiplexing deposit The output 209A of device 511A, and it is multiplexed the output 209B that register 0A receives multiplexing register 511B on its input 1811A, It is multiplexed the output 209A that register 0B receives multiplexing register 0A on its input 1811B.Each multiplexing register 208A/208B Reception controls input 213, and wherein 213 control of control input is data word 207A/207B to be selected or selects defeated after rotating Enter 211A/211B, also or selection rotation after input 1811A/1811B.As described in more detail below, in an operation mode, In the first clock cycle, each multiplexing register 208A/208B selection data word 207A/207B of 213 control of control input with Storage is to register and in being subsequently provided to ALU 204；And in subsequent clock period (M-1 clock cycle as escribed above) In, input 1811A/1811B is to store to posting after each multiplexing register 208A/208B selections rotation of 213 control of control input Storage and in being subsequently provided to ALU 204.

Referring now to Figure 20, a table is shown, which shows in the program storage 129 of the NNU 121 for being stored in Fig. 1 simultaneously The program executed by the NNU 121, the wherein NNU 121 have the NPU 126 of the embodiment according to Figure 18.Figure 20's is exemplary Program is similar with the program of Fig. 4 in many aspects.However, will be described below difference.Initialization NPU positioned at address 0 refers to It will be narrow configuration to determine NPU 126.In addition, as shown, counting is appointed as by the multiply-accumulate rotation instruction positioned at address 2 1023, and need 1023 clock cycle.This is because the example of Figure 20 assume one layer be of virtually it is 1024 narrow by (such as 8 Position) neuron (NPU), 1024 connection inputs of each narrow neuron with 1024 neurons from previous layer, therefore A total of 1024K connection.8 bit data value is multiplied by suitable by each neuron from each connection input 8 bit data values of reception When 8 weighted values.

Referring now to Figure 21, a sequence diagram is shown, which illustrates the program that NNU 121 executes Figure 20, the wherein NNU 121 include the NPU 126 of Figure 18 operated in narrow configuration.The sequence diagram of Figure 21 is similar with the sequence diagram of Fig. 5 in many aspects； However, will be described below difference.

In the sequence diagram of Figure 21, NPU 126 is narrow configuration, this is because the initialization NPU positioned at address 0 is instructed this A little NPU are initialized as narrow configuration.Therefore, this 512 NPU 126 are operated effectively as 1024 narrow NPU (or neuron), In this 1024 narrow NPU be designated as in row NPU 0-A and NPU 0-B (two narrow NPU of NPU 126 0), NPU 1-A and NPU 1-B (two narrow NPU of NPU 126 1) ..., (two of NPU 126 511 are narrow by NPU 511-A and NPU 511-B NPU).Simply clearly to illustrate, the operation of narrow NPU 0-A, 0-B and 511-B are only shown.Due to multiply-accumulate at address 2 The fact that counting is appointed as 1023 (this needs 1023 clock cycle) by rotation, therefore the row of the sequence diagram of Figure 21 includes more Up to 1026 clock cycle.

At clock 0,1024 NPU respectively execute the initialization directive of Fig. 4, i.e. are assigned to zero shown in Fig. 5 cumulative The initialization directive of device 202.

At clock 1,1024 narrow NPU respectively execute the multiply-accumulate instruction at the address 1 of Figure 20.As shown, narrow NPU 0-A are by the product of the narrow word 0 of the row 17 of data RAM 122 and the narrow word 0 of the row 0 of weight RAM 124 and accumulator 202A Value (i.e. zero) it is cumulative；Narrow NPU 0-B are by the narrow word 1 of the narrow word 1 of the row 17 of data RAM 122 and the row 0 of weight RAM 124 Product and the value (i.e. zero) of accumulator 202B are cumulative；The rest may be inferred, and narrow NPU 511-B are by the narrow word of the row 17 of data RAM 122 1023 is cumulative with the product of the narrow word 1023 of the row 0 of weight RAM 124 and the value (i.e. zero) of accumulator 202B.

At clock 2, the first time that 1024 narrow NPU respectively execute the multiply-accumulate rotation instruction of the address 2 of Figure 20 changes Generation.As shown, narrow NPU 0-A are by the rotation received by the output 209B of the multiplexing register 208B of narrow NPU 511-B The row 1 of narrow data word 1811A (the narrow data word 1023 i.e. received by the data RAM 122) and weight RAM 124 afterwards The product of narrow word 0 is cumulative with the value 217A of accumulator 202A；Narrow NPU 0-B will be from the multiplexing register 208A's of narrow NPU 0-A Export the narrow data word 1811B after the rotation received by 209A (the narrow data word 0 i.e. received by the data RAM 122) and The product of the narrow word 1 of the row 1 of weight RAM 124 is cumulative with the value 217B of accumulator 202B；The rest may be inferred, and narrow NPU 511-B will The narrow data word 1811B after the rotation received by output 209A from the multiplexing register 208A of narrow NPU 511-A is (i.e. from number According to the narrow data word 1022 received by RAM 122) and weight RAM 124 row 1 narrow word 1023 product and accumulator The value 217B of 202B is cumulative.

At clock 3,1024 narrow NPU respectively execute second of the instruction of the multiply-accumulate rotation at the address 2 of Figure 20 Iteration.As shown, narrow NPU 0-A are by the wheel received by the output 209B of the multiplexing register 208B of narrow NPU 511-B The row 2 of narrow data word 1811A (the narrow data word 1022 i.e. received by the data RAM 122) and weight RAM 124 after turning Narrow word 0 product and accumulator 202A value 217A it is cumulative；Narrow NPU 0-B will be from the multiplexing register 208A of narrow NPU 0-A Output 209A received by rotation after narrow data word 1811B (the narrow data words i.e. received by the data RAM 122 1023) and the value 217B of the product of the narrow word 1 of the row of weight RAM 124 2 and accumulator 202B are cumulative；The rest may be inferred, narrow NPU 511-B will be from the narrow data word 1811B after the rotation received by the output 209A of the multiplexing register 208A of narrow NPU 511-A The product of the narrow word 1023 of the row 2 of (the narrow data word 1021 i.e. received by the data RAM 122) and weight RAM 124 with it is tired Add the value 217B of device 202B cumulative.As shown in the ellipsis of Figure 21, in each clock week of next 1021 clock cycle It is persistently carried out according to this in phase, until clock 1024.

At clock 1024,1024 narrow NPU respectively execute the of the instruction of the multiply-accumulate rotation at the address 2 of Figure 20 1023 iteration.As shown, narrow NPU 0-A are received the output 209B of the multiplexing register 208B from narrow NPU 511-B Narrow data word 1811A's (the narrow data word 1 i.e. received by the data RAM 122) and weight RAM 124 after the rotation arrived The product of the narrow word 0 of row 1023 is cumulative with the value 217A of accumulator 202A；Narrow NPU 0-B will be from the multiplexing register of NPU 0-A Narrow data word 1811B (the narrow data i.e. received by the data RAM 122 after rotation received by the output 209A of 208A Word 2) and weight RAM 124 row 1023 narrow word 1 product and accumulator 202B value 217B it is cumulative；The rest may be inferred, narrow NPU 511-B will be from the narrow data word 1811B after the rotation received by the output 209A of the multiplexing register 208A of NPU 511-A The product of the narrow word 1023 of the row 1023 of (the narrow data word 0 i.e. received by the data RAM 122) and weight RAM 124 with it is tired Add the value 217B of device 202B cumulative.

At clock 1025, the respective AFU 212A/212B of NPU 1024 narrow execute the activation letter at the address 3 of Figure 20 Number instruction.Finally, at clock 1026, this 1024 narrow NPU are each by by its narrow result 133A/133B write back datas RAM The corresponding narrow word of 122 row 16, the write-in AFU output orders at address 4 to execute Figure 20, the i.e. narrow result of NPU 0-A 133A is written into the narrow word 0 of data RAM 122, and the narrow result 133B of NPU 0-B is written into the narrow word 1 of data RAM 122, according to this Analogize, the narrow result 133 of NPU 511-B is written into the narrow word 1023 of data RAM 122.In fig. 22, also in block diagram form It shows above in relation to the operation described in Figure 21.

Referring now to Figure 22, a block diagram is shown, which shows that the NNU 121 of Fig. 1, the wherein NNU 121 include Figure 18's NPU 126 is to execute the program of Figure 20.NNU 121 includes that 512 NPU 126 are 1024 narrow NPU, receive the input of its address The 123 data RAM 122 and weight RAM 124 for receiving its address input 125.Although being not shown, in clock 0, this 1024 narrow NPU execute the initialization directive of Figure 20.As shown, at clock 1,1024 8 bit data words of row 17 by from Data RAM 122 reads and is provided to 1024 narrow NPU.In clock 1 to 1024,1024 8 weight words point of row 0 to 1023 It is not read from weight RAM 124 and is provided to 1024 narrow NPU.Although it is not shown, but in 1,1024 narrow NPU pairs of clock plus The multiply-accumulate operation corresponding with the execution of weight word of the data word of load.In clock 2 to 1024, the multiplexing deposit of 1024 narrow NPU Device 208A/208B is as the operation of 1024 8 word wheel-turning devices with by the data character wheel of the row 17 of the previously loaded data RAM 122 Go to adjacent narrow NPU, and narrow NPU to after rotating accordingly data word and load from weight RAM 124 corresponding narrow Weight word executes multiply-accumulate operation.Although being not shown, activation is executed in 1025,1024 narrow AFU 212A/212B of clock Instruction.In 1026,1024 narrow NPU of clock by its corresponding 1024 8 result 133A/133B write back datas RAM's 122 Row 16.

It can be found that for example, embodiment of the embodiment of Figure 18 compared to Fig. 2 can have advantage, this is because Figure 18 Embodiment provides flexibility for program designer, to make in the case where the specific application being just modeled needs certain accuracy It is calculated with wide data word and weight word (such as 16), and in the case where the application needs certain accuracy using narrow Data word and weight word (such as 8) are calculated.From the point of view of an angle, for the application of narrow data, Figure 18's Embodiment of the embodiment compared to Fig. 2, with additional slender vowel (such as multiplexing register 208B, register 205B, narrow ALU 204B, narrow accumulator 202B, narrow AFU 212B) it is used as cost, it is possible to provide and twice of handling capacity, these additional slender vowels make The area increase about 50% of NPU 126.

Three pattern NPU

Referring now to Figure 23, a block diagram is shown, which shows the NPU of dynamically configurable Fig. 1 according to alternative embodiment 126.The NPU 126 of Figure 23 not only can be configured to wide configuration and narrow configuration, can also configure in third configuration (referred to herein as " funnel (funnel) " is configured).The NPU 126 of Figure 23 is similar with the NPU of Figure 18 126 in many aspects.However, in Figure 18 Wide adder 244A is replaced in the NPU 126 of Figure 23 by 3 input width adder 2344A, wherein the wide adder of 3 inputs 2344A receives the third addend 2399 as version after the extension of the output of narrow multiplexer 1896B.For operating with Figure 23 The program of the NNU 121 of NPU 126 is similar with the program of Figure 20 in many aspects.However, the initialization NPU instructions at address 0 These NPU 126 are initialized as funnel configuration rather than narrow configuration.In addition, the multiply-accumulate rotation instruction of address 2 is counted as 511 rather than 1023.

In the case where funnel configures, NPU's 126 operates and with multiplying at the address 1 of narrow configuration execution such as Figure 20 Operation in the case of method accumulated instruction is similar in the following areas：NPU 126 receives data word 207A/207B two narrow and two Narrow weight word 206A/206B；Data word 209A and weight word 203A are multiplied to produce wide multiplexer 1896A by wide multiplier 242A Selected product 246A；And data word 209B and weight word 203B are multiplied to produce narrow multiplexer by narrow multiplier 242B The selected product 246B of 1896B.However, width adder 2344A by (wide multiplexer 1896A is selected) product 246A and Both (wide multiplexer 1896B is selected) product 246B/2399 are added with the value 217A of wide accumulator 202A, and narrow adder 244B and narrow accumulator 202B are inactive.In addition, configuring the multiply-accumulate rotation at the address 2 for executing such as Figure 20 with funnel When instruction, control input 213 makes multiplexing register 208A/208B rotate two narrow word (such as 16), that is to say, that multiplexing is posted Storage 208A/208B selects it to input 211A/211B accordingly, the same just as width configuration.However, width multiplier 242A will be counted It is multiplied to produce the selected product 246A of wide multiplexer 1896A according to word 209A and weight word 203A；Narrow multiplier 242B will be counted It is multiplied to produce the selected product 246B of narrow multiplexer 1896B according to word 209B and weight word 203B；And wide adder 2344A By both (wide multiplexer 1896A is selected) product 246A and (wide multiplexer 1896B is selected) product 246B/2399 with The value 217A of wide accumulator 202A is added, and narrow adder 244B and narrow accumulator 202B are inactive as described above.Finally, with When activation primitive at the address 3 of funnel configuration execution Figure 20 etc. instructs, wide AFU 212A are to obtained summation 215A Activation primitive is executed to generate narrow result 133A, and narrow AFU 212B are inactive.In this way, the narrow NPU generations for being only denoted as A are narrow As a result 133A, and it is invalid to be denoted as narrow result 133B caused by the narrow NPU of B.Therefore, the result row write back (such as is schemed The row 16 indicated by instruction at 20 address 4) comprising cavity, this is because only narrow result 133A is effective, and narrow knot Fruit 133B is invalid.Therefore, the Fig. 2 and figure for connecting data input is handled within each clock cycle with each neuron 18 embodiment compares, and in concept, in each clock cycle, each neuron (NPU 126 of Figure 23) handles two Data input is connected, i.e., two narrow data words are multiplied by corresponding weight and by the two product accumulations.

Embodiment for Figure 23 it can be found that produced and write back data RAM 122 or weight RAM 124 result word The quantity of (neuron output) is the subduplicate half of the quantity of received data input (connection), and the knot write back Fruit row has a cavity, i.e., every a narrow word the result is that invalid, more specifically, the narrow NPU results for being denoted as B do not have meaning. Therefore, the embodiment of Figure 23 is especially effective for the neural network with continuous two layers, for example, neuron possessed by first layer Quantity is that (such as 1024 neurons possessed by first layer are completely connected to 512 nerves of the second layer for twice of the second layer Member).In addition, other execution units 122 (such as media units, such as x86AVX units) in the case of necessary can be to dispersion (have cavity) result row, which executes, compresses operation (pack operation) so that its close (not having cavity), with After when NNU 121 just executes other calculating associated with data RAM 122 and/or other rows of weight RAM 124 It is continuous to calculate.

Mix NNU operations：Convolution ability and pond ability

It is according to the advantages of 121 NNU of embodiment described herein, which can be concurrently to be similar at association The mode that reason device executes oneself internal processes operates and to be distributed to the execution similar to the execution of the execution unit of processor The mode of the framework instruction (or the microcommand come from framework instruction translation) of unit operates.Framework instruction has by including NNU Framework program performed by 121 processor.In this way, NNU 121 is operated in a mixed manner, this mode of operation is advantageous, Because it provides the ability maintained to the high usage of NNU 121.For example, Figure 24 to Figure 26 shows that NNU 121 executes convolution The operation of operation, in this operation, the utilization rate of NNU are high, and Figure 27 to Figure 28 shows that NNU 121 executes the behaviour of pond operation Make, wherein these convolution algorithms and pond operation are convolutional layer, pond layer and such as image procossing is (such as edge detection, sharp Change, be fuzzy, identification/classification) etc. other numerical datas to calculate application required.However, the hybrid operation of NNU 121 is not limited to Convolution or pond operation are executed, but composite character can also be used for executing other operations, such as above in relation to described in Fig. 4 to Figure 13 The multiply-accumulate operation of traditional neural network and activation primitive operation etc..That is, processor 100 is (more specifically, reservation station 108) it issues MTNN instructions 1400 and MFNN to NNU 121 and instructs 1500, wherein in response to these instructions, NNU 121 is by data Memory 122/124/129 is written and is read from the memory 122/124 being written by NNU 121 as a result, at the same time, (instruct) program of write-in program memory 129 via MTNN 1400 by processor 100 in response to executing, NNU 121 is to storage Device 122/124/129 is read and writen.

Referring now to Figure 24, a block diagram is shown, which shows that the NNU 121 of Fig. 1 is used to execute the data of convolution algorithm The example of structure.The block diagram includes the data RAM 122 and weight RAM of convolution kernel 2402, data array 2404 and Fig. 1 124.Preferably, (for example, image pixel) data array 2404, which is maintained at, is attached to the system storage of processor 100 (not Show) in and MTNN instructions 1400 are executed by processor 100 by are loaded on the weight RAM 124 of NNU 121.Convolution algorithm It is the operation that the first matrix and the second matrix are carried out to convolution, wherein the second matrix is referred to herein as convolution kernel.Such as at this Described in the context of invention, convolution kernel is the matrix of coefficient, and wherein these coefficients are alternatively referred to as weight, parameter, element or value. Preferably, convolution kernel 2402 is the static data for the framework program that processor 100 is just executing.

Data array 2404 is the two-dimensional array of data value, and each data value (such as image pixel value) is data RAM The size (such as 16 or 8) of the word of 122 or weight RAM 124.In this example, data value is 16 words, and NNU 121 are configured as configuration NPU 512 wide 126.In addition, in embodiment, as described in more detail below, NPU 126 includes using In the multiplexing register (multiplexing register 705 of Fig. 7 etc.) for receiving weight word 206 from weight RAM 124, with to from weight The data line value that RAM 124 is received executes whole wheel-turning device operation.In this example, data array 2404 be 2560 row × The pel array of 1600 rows.As shown, when data array 2404 and convolution kernel 2402 are carried out convolution by framework program, this Structure program divides data array 2402 for 20 data blocks, wherein each data block is 512 × 400 data matrix 2406.

In this example, convolution kernel 2402 is 3 × 3 matrixes being made of coefficient, weight, parameter or element.First row coefficient It is denoted as C0,0, C0,1 and C0,2；Second row coefficient is denoted as C1,0, C1,1 and C1,2；And third row coefficient is marked It is shown as C2,0, C2,1 and C2,2.For example, the convolution kernel that can be used for executing edge detection has following coefficient：0,1,0,1, -4, 1,0,1,0.For another example the convolution kernel that can be used for carrying out image Gaussian Blur has following coefficient：1,2,1,2,4,2,1,2, 1.In this case, division usually is executed to final accumulated value, wherein divisor is the absolute value of each element of convolution kernel 2402 Summation (being in this example 16).In another example divisor is the quantity of the element of convolution kernel 2402.In another example divisor is will to roll up It overstocks and is retracted to value in desired range of values, and the divisor is according to the element value of convolution kernel 2402, expected range and just quilt The range of the input value of the matrix of convolution algorithm is executed to determine.

As shown in figure 24 and described in more detail for Figure 25, data are written in the coefficient of convolution kernel 2402 by framework program RAM 122.Preferably, continuous nine row (number of elements in convolution kernel 2402) of data RAM 122 often row all words with Behavior primary sequence is written into the different elements of convolution kernel 2402.That is, as shown, each word of a line has been written into First coefficient C0,0；Next line has been written into the second coefficient C0,1；Next line has been written into third coefficient C0,2；Next line quilt again It is written with the 4th coefficient C1,0；The rest may be inferred, and each word of the 9th row has been written into the 9th coefficient C2, and 2.In order to data array The data matrix 2406 of 2404 data block carries out convolution, such as described in more detail especially for Figure 26 below, and NPU 126 is by suitable Sequence repeats to read nine rows of the coefficient that convolution kernel 2402 is kept in data RAM 122.

As shown in figure 24 and described in more detail for Figure 25, weight is written in the value of data matrix 2406 by framework program RAM 124.When NNU programs execute convolution, matrix of consequence is write back into weight RAM 124.Preferably, such as below for Figure 25 more It is described in detail, framework program is by the first data matrix 2406 write-in weight RAM 124 and starts NNU 121, and works as NNU 121 When carrying out convolution with convolution kernel 2402 to the first data matrix 2406, weight is written in the second data matrix 2406 by framework program RAM 124 so that NNU 121 once completes, to the convolution of the first data matrix 2406, to start to the second data matrix 2406 execute convolution.In this way, between framework program travels to and fro between two regions of weight RAM 124, to ensure that NNU 121 is abundant It uses.Therefore, the example of Figure 24 shows the first data matrix 2406A and the second data matrix 2406B, wherein the first data matrix 2406A is corresponding with the first data block of 0 to 399 row of row of weight RAM 124 is occupied, the second data matrix 2406B with occupy Second data block of the row 500 to 899 of weight RAM 124 is corresponding.In addition, as shown, NNU 121 writes the result of convolution The row 900~1299 and 1300~1699 of weight RAM 124 is returned, subsequent framework program reads these results from weight RAM 124. The data value for the data matrix 2406 being maintained in weight RAM 124 is denoted as " Dx, y ", wherein " x " is weight RAM 124 Line number, " y " are the words or columns of weight RAM 124.Thus, for example, the data word 511 in row 399 is denoted as in fig. 24 D399,511, which is received by the multiplexing register 705 of NPU 511.

Referring now to Figure 25, a flow chart is shown, which illustrates the operation of the execution framework program of processor 100 of Fig. 1, The framework program will be used for the data array 2404 of Figure 24 to the NNU 121 to the execution convolution of convolution kernel 2402.Flow begins In box 2502.

At box 2502, processor 100 (operating in the framework program on processor 100) is by the convolution kernel of Figure 24 2402 for shown in Figure 24 and in a manner of described to be written data RAM 122.In addition, variable N is initialized as being worth by framework program 1.Variable N indicates the current data block just handled by NNU 121 in data array 2404.In addition, framework program is by variable NUM_ CHUNKS is initialized as value 20.Flow enters box 2504.

At box 2504, as shown in figure 24, weight RAM is written in the data matrix 2406 of data block 1 by processor 100 124 (such as data matrix 2406A of data block 1).Flow enters box 2506.

At box 2506, processor 100 is used for the specified function 1432 that program storage 129 is written Convolution program, is written the program storage 129 of NNU 121 by MTNN instructions 1400.Processor 100 is then opened using for specified The MTNN instructions 1400 of the function 1432 of the dynamic execution to program, to start NNU convolution programs.It is more detailed below for Figure 26 A Ground describes the example of NNU convolution programs.Flow enters decision block 2508.

At decision block 2508, framework program determines whether the value of variable N is less than NUM_CHUNKS.If so, flow Into box 2512；Otherwise enter box 2514.

At box 2512, as shown in figure 24, weight RAM is written in the data matrix 2406 of data block N+1 by processor 100 124 (such as data matrix 2406B of data block 2).Hence it is advantageous to when NNU 121 is executing convolution to current data block When, weight RAM 124 is written in the data matrix 2406 of subsequent data chunk by framework program so that once complete current data block (weight RAM 124 is written) in convolution, and NNU 121 can immediately begin to execute the convolution to next data block.Flow into Enter box 2514.

At box 2514, processor 100 is determined as (in the case of data block 1 since box 2506, in data In the case of block 2~20 since box 2518) the NNU programs that are currently running have completed.Preferably, processor 100 carry out the determination by executing MFNN instructions 1500 to read the status register 127 of NNU 121.In alternative embodiment In, NNU 121 generates interruption, to indicate that it has completed convolution program.Flow enters decision block 2516.

At decision block 2516, framework program determines whether the value of variable N is less than NUM_CHUNKS.If so, flow Into box 2518；Otherwise enter box 2522.

At box 2518, processor 100 updates convolution program and allows the processor to N+1 volumes of data block Product.More specifically, the row value that the initialization NPU in weight RAM 124 at address 0 is instructed is updated to data by processor 100 The first row (for example, being updated to the row 0 of data matrix 2406A or the row 500 of data matrix 2406B) of matrix 2406, and more New output row (such as being updated to row 900 or 1300).Then processor 100 starts updated NNU convolution programs.Flow enters Box 2522.

At box 2522, result of the processor 100 from the NNU convolution programs of 124 read block N of weight RAM.Stream Journey enters decision block 2524.

At decision block 2524, framework program determines whether the value of variable N is less than NUM_CHUNKS.If so, flow Into box 2526；Otherwise flow terminates.

At box 2526, framework program makes the increase by 1 of N.Flow returns to decision block 2508.

Referring now to Figure 26 A, show that the program listing of NNU programs, wherein the NNU programs are right using the convolution kernel 2402 of Figure 24 Data matrix 2406 executes convolution and is write back weight RAM 124.The program will pass through the instruction cycles body at 1 to 9 place of address To recycle certain number.The number of the execution of each NPU 126 loop body is specified in initialization NPU instructions at address 0, in Figure 26 A Example in, with the line number in the data matrix 2406 of Figure 24 correspondingly, loop count 400, and recycle end ( Location 10) at recursion instruction so that current cycle count value is successively decreased, and if result non-zero, if the recursion instruction cause to return The control on the top (instruction i.e. at return address 1) of loop body.Initialization NPU instructions also reset accumulator 202.It is preferred that Ground, the recursion instruction at address 10 also reset accumulator 202.Optionally, as described above, multiply-accumulate instruction at address 1 It can specify and reset accumulator 202.

Execution each time for the loop body of program, 512 NPU 126 are concurrently to 3 × 3 convolution kernels 2402 and data 512 corresponding 3 × 3 submatrixs of matrix 2406 execute 512 convolution.Convolution is the element of convolution kernel 2402 and corresponding submatrix The summation of nine products of interior corresponding element.In the embodiment of Figure 26 A, 512 corresponding respective origins of 3 × 3 submatrix (central element) is data word Dx+1, the y+1 of Figure 24, and wherein y (column number) is the number of NPU 126, and x (row number) is to work as By the read row number of multiply-accumulate instruction of the address 1 of the program of Figure 26 A, (same, which compiles in preceding weight RAM 124 It number is initialized by the initialization NPU instruction of address 0, it is incremental at each multiply-accumulate instruction of address 3 and 5, and by Decrement commands at address 9 are updated).Therefore, for each cycle of program, 512 NPU 126 calculate 512 volumes It accumulates, and 512 convolution results is write back to the nominated bank of weight RAM 124.In the present specification, edge is omitted for the sake of simplicity It manages (edge handling), it should be noted that the whole turn features using these NPU 126 will cause two row in row From a vertical edge of (for example, image) in the case of image procossing data matrix 2406 to another vertical edge (such as Or vice versa from left side edge to right side edge) it generates around (wrapping).It is illustrated now for loop body.

Address 1 is the row 0 for specifying data RAM 122 and implicitly uses the multiplication of the row of present weight RAM 124 Accumulated instruction, the row of wherein present weight RAM 124 preferably remain in sequencer 128 (and initial by the instruction positioned at address 0 Cancellation to pass through loop body for the first time).That is, the instruction positioned at address 1 makes each NPU 126 from data RAM 122 Row 0 read its corresponding word, from present weight RAM 124 row read its corresponding word, and to the two words execute it is multiply-accumulate Operation.Thus, for example, C0,0 and Dx, 5 are multiplied (row that wherein " x " is present weight RAM 124) by NPU 5, by result and tire out Add the value 217 of device 202 to be added, and summation is write back into accumulator 202.

Address 2 is so that the row of data RAM 122 is incremented by (being incremented to row 1) and then from data RAM for specified The 122 multiply-accumulate instruction for being incremented by rear address and reading row.The instruction also specified multiplexing register 705 by each NPU 126 Interior value wheel goes to adjacent NPU 126, and described value is in response to the instruction in address 1 and just from weight RAM in this case 124 2406 values of data line matrix read.In the embodiment of Figure 24 to 26, NPU 126 is configured as that register will be multiplexed 705 value rotates to the left, i.e., goes to NPU J-1 from NPU J wheels, rather than rotated from NPU J as described in above in relation to Fig. 3,7 and 19 To NPU J+1.It should be appreciated that being configured as in the embodiment rotated to the right in NPU 126, framework program can be by convolution kernel 2042 coefficient value is by different order write-in data RAM 122 (such as being rotated around its central series) to realize similar convolution knot Fruit.In addition, when needed, framework program can execute convolution kernel 2402 on additional pretreatment (such as transposition (transposition)).In addition, count value 2 is specified in instruction.Therefore, the instruction positioned at address 2 makes each NPU 126 from data The row 1 of RAM 122 reads its corresponding word, the word after rotation is received to multiplexing register 705, and execute multiplication to the two words Accumulating operation.Since count value is 2, which also makes each NPU 126 repeat aforementioned operation.That is, sequencer 128 make the row address 123 of data RAM 122 increase (increasing to row 2), and each rows of the NPU 126 from data RAM 122 2 read its corresponding word, the word after rotation are received to multiplexing register 705, and execute multiply-accumulate operation to the two words. Thus, for example, it is assumed that the behavior 27 of present weight RAM 124, after executing the instruction at address 2, NPU 5 by C0,1 and D27, 6 product and C0,2 and the product accumulation of D27,7 are to its accumulator 202.Therefore, the instruction at address 1 and address 2 is completed Afterwards, C0,0 and D27,5 product, the product and C0 of C0,1 and D27,6,2 and D27,7 product will be together with from previously passed All other accumulated value of loop body is added to accumulator 202 together.

Address 3 and the operation performed by the instruction at 4 are similar with the instruction at 2 with address 1, however by means of weight RAM 124 row is incremented by indicator, these instructions execute operation, and connecing to data RAM 122 to the next line of weight RAM 124 Three rows (at once 3 to 5) to get off execute operation.That is, for example for NPU 5, after completing the instruction at address 1 to 4, The product of C0,0 and D27,5, the product of C0,1 and D27,6, C0,2 and the product of D27,7, the product of C1,0 and D28,5, C1,1 The product of product and C1,2 and D28,7 with D28,6 will be together with all other accumulated value one from previously passed loop body It rises and is added to accumulator 202.

The operation performed by instruction at address 5 and 6 is similar with the instruction at 4 with address 3, however these instructions are to weight The next line of RAM 124 and next three row (at once 6 to 8) of data RAM 122 execute operation.That is, for example For NPU 5, after the instruction for completing address 1 to 6, C0,0 and D27,5 product, the product of C0,1 and D27,6, C0,2 with The product of D27,7, the product of C1,0 and D28,5, the product of C1,1 and D28,6, C1,2 and D28,7, C2,0 and D29,5 multiply Product, C2,1 and the product and C2,2 of D29,6 and the product of D29,7 will be together with from all other of previously passed loop body Accumulated value is added to accumulator 202 together.That is, after completing the instruction at address 1 to 6 and assuming that loop body is opened The behavior 27 of weight RAM 124 when the beginning, then NPU 5 following 3 × 3 submatrix will for example be rolled up using convolution kernel 2402 Product：

More generally, after completing the instruction at address 1 to 6, each NPU 126 in 512 NPU 126 uses volume Product core 2402 has carried out convolution to following 3 × 3 submatrix：

Wherein r is the row address value of weight RAM 124 when loop body starts, and n is the number of NPU 126.

Instruction at address 7 makes the value 217 of accumulator 202 pass through AFU 212.It is from data that should make size by function The word of the size (as unit of position, i.e. 16 in this example) for the word that RAM 122 and weight RAM 124 is read passes through.It is preferred that Ground, as described in more detail below, user may specify how many position is decimal place in output format, such as carry-out bit.Optionally, refer to Determine division activation primitive, and non-designated by activation primitive, wherein such as herein for described in Figure 29 A and Figure 30, the division Activation primitive is for example removed the value 217 of accumulator 202 divided by one using one of " divider " 3014/3016 of Figure 30 Number.For example, convolution kernel 2402 with coefficient (above-mentioned 1/16th coefficient of Gaussian Blur core etc.) the case where Under, the instruction of activation primitive at address 7 can specify division activation primitive (such as divided by 16), and non-designated pass through function.It can Selection of land, framework program can execute these coefficients divided by 16 before by 2402 coefficient of convolution kernel write-in data RAM 122 Operation, and the value for example using the data binary point 2922 of Figure 29 A as described below for convolution kernel 2402 is correspondingly adjusted The position of whole binary point.

The current value institute in weight RAM 124 by output row register is written in the output of AFU 212 by the instruction at address 8 Specified row, the wherein current value are by the instruction initialization at address 0 and by means of the incremental indicator in instruction each It is incremented by when passing through cycle.

As according to fig. 2 in 4 to Figure 26 the example with 3 × 3 convolution kernels 2402 it was determined that NPU 126 every about three Clock cycle reads weight RAM 124 to read the row of data matrix 2406, and every about 12 clock cycle by convolution knot Weight RAM 124 is written in fruit matrix.Furthermore, it is assumed that including the write-in and reading buffer that the buffer 1704 of Figure 17 waits Embodiment is read out with NPU 126 and is written concurrently, and processor 100 is read out and is written to weight RAM 124, makes It obtains buffer 1704 and write-once and primary reading is executed to weight RAM 124 every about 16 clock cycle, to be respectively written into Data matrix 2406 and reading convolution results matrix.Therefore, the approximately half of bandwidth of weight RAM 124 is held by NNU 121 Hybrid mode consumption used in row convolution kernel operation.Although this example includes 3 × 3 convolution kernels 2402, may be used other big Small convolution kernel, such as 2 × 2,4 × 4,5 × 5,6 × 6,7 × 7,8 × 8 equal matrix, in this case, NNU programs will change Become.In the case where convolution kernel is larger, because of larger (such as the program of Figure 26 A of counting of the rotation version of multiply-accumulate instruction Address 2,4 and 6 at and larger convolution kernel needed for extra-instruction), thus NPU 126 reads the time of weight RAM 124 Percentage it is smaller, therefore the percentage that the bandwidth of weight RAM 124 is consumed is also smaller.

Optionally, framework program configures after the row for no longer needing input data matrix 2406 to this NNU programs to A little rows are override, rather than by convolution results write back not the going together of weight RAM 124 (such as row 900~1299 and 1300~ 1699).For example, in the case of 3 × 3 convolution kernel, framework program data matrix 2406 is written the row 2 of weight RAM 124~ 401, rather than by 2406 writing line 0~399 of data matrix, and NPU programs are configured as since the row 0 of weight RAM 124 And just incrementally, weight RAM 124 is written in convolution results by per pass loop body.In this way, NNU programs are only to no longer needing The row wanted is override.For example, after first time is by loop body (or more specifically, executing the load weight at address 1 After the instruction of the row 0 of RAM 124), the data of row 0 can be written, but the data of row 1~3 are needed for passing through for the second time Loop body by loop body because without because be written for the first time；Similarly, after at second by loop body, row 1 Data can be written, but the data of row 2~4 are needed for third time through loop body because without because of for the second time passing through Loop body and be written；The rest may be inferred.In such embodiments, the height of each data matrix 2406 (data block) can be compared with (for example, 800 rows) greatly, to obtain less data block.

Optionally, configure NNU programs to will be on the convolution kernel 2402 of convolution results write back data RAM 122 for framework program The row of side's (for example, row 8 top), rather than convolution results are write back into weight RAM 124, and when NNU 121 is (such as using following The address of the row 2606 of the last write-in of the data RAM 122 of Figure 26 B) write-in result when, framework program is from number These results are read according to RAM 122.This alternative solution is single port in weight RAM 124 and data RAM 122 is both-end It can be advantageous in the embodiment of mouth.

Operation from according to fig. 24 to the NNU 121 of the embodiment of Figure 26 A is it can be found that the program of Figure 26 A is held every time Row needs about 5000 clock cycle, and therefore, the convolution of entire 2560 × 1600 data array 2404 of Figure 24 needs about 100000 clock cycle, hence it is evident that less than the execution required clock periodicity of same task in a conventional manner.

Referring now to Figure 26 B, a block diagram is shown, which shows that the control according to the NNU 121 of Fig. 1 of one embodiment is posted Certain fields of storage 127.Status register 127 includes：Field 2602 is used to indicate in weight RAM 124 the last quilt The address for the row that NPU 126 is written；Field 2606 is used to indicate the row that the last time is written by NPU 126 in data RAM 122 Address；Field 2604 is used to indicate in weight RAM 124 the last time by the address of the rows read of NPU 126；And field 2608, the last time is used to indicate in data RAM 122 by the address of the rows read of NPU 126.In this way so that being implemented in processing Framework program on device 100 can be determined when it is read out and/or is written to data RAM 122 and/or weight RAM 124 The progress of NNU 121.Using this ability, (or as described above will together with being override as described above to input data matrix As a result write-in data RAM 122) selection, the data array 2404 of Figure 24 can for example be treated as described below 5 512 × 1600 data block rather than 20 512 × 400 data blocks.Processor 100 is since row 2 by the one 512 × 1600th data block Weight RAM 124 is written, and opens NNU programs (the initialization weight that the program is 0 with 1600 cycle count and value The outputs of RAM 124 row).When NNU 121 executes NNU programs, processor 100 monitors position/ground of the output of weight RAM 124 Location is had to which (1) (instructing 1500 using MFNN) reads to have in weight RAM 124 by what NNU 121 (since row 0) was written The rows of convolution results, (2) are imitated once having had read effective convolution results by second 512 × 1600 data matrix 2406 (since row 2) overriding is in these effective convolution as a result, so that when NNU 121 completes NNU for the one 512 × 1600th data block When program, processor 100 can update NNU programs immediately as needed and be again turned on NNU programs with handle second 512 × 1600 data blocks.The processing is repeated a further three times for remaining three 512 × 1600 data blocks, to realize the usury of NNU 121 With rate.

Advantageously, in one embodiment, such as described in more detail below for Figure 29 A, 29B and 30, AFU 212 have pair The value 217 of accumulator 202 is effectively carried out the ability of effective division.For example, making the activation of the value 217 divided by 16 of accumulator 202 Function NNU instructions can be used for above-mentioned Gaussian Blur matrix.

Although the convolution kernel 2402 used in the example of Figure 24 is the small-sized static applied to entire data array 2404 Convolution kernel, but in other embodiments, the convolution kernel can be such as be common in convolutional neural networks have and data array 2404 different data is worth the large-scale matrix of associated unique weight.When NNU 121 is in this way in use, framework program Data matrix can be positioned in data RAM 122 and by convolution kernel by the location swap of data matrix and convolution kernel It is positioned in weight RAM 124, and can be relatively fewer by the line number handled by the specific execution to NNU programs.

Referring now to Figure 27, a block diagram is shown, which shows the weight RAM 124 in Fig. 1 filling showing for input data Example, wherein executing pond operation to the input data by the NNU 121 of Fig. 1.Performed by the pond layer of artificial neural network Pond operation is by obtaining the subregion of input matrix or submatrix and calculating maximum value or the average value of these submatrixs and contract Subtract the dimension of input data matrix (for example, image after image or convolution), and these maximum values or average value become result square Battle array or pond matrix.In the example of Figure 27 and Figure 28, pond operation calculates the maximum value of each submatrix.Pond operation for Such as execute object classification or detection artificial neural network it is particularly useful.In general, pond operation is effectively to be checked The factor of first prime number of submatrix reduce the size of input matrix, and particularly with the element of the respective dimensions of submatrix It counts to make input matrix reduce on each dimension direction.In the example of Figure 27, input data is stored in weight RAM 124 Row 0 to 1599 in wide word (such as 16) 512 × 1600 matrixes.In figure 27, word is with the column locations mark where it Show, for example, the word positioned at 0 row 0 of row is denoted as D0,0；Word positioned at 0 row 1 of row is denoted as D0,1；Word positioned at 0 row 2 of row indicates For D0,2；The rest may be inferred, and D0 is denoted as positioned at the word of 0 row 511 of row, and 511.Equally, the word positioned at 1 row 0 of row is denoted as D1, and 0；Position It is denoted as D1,1 in the word of 1 row 1 of row；It is denoted as D1,2 positioned at 1 row of row, 2 word；The rest may be inferred, and the word positioned at 1 row 511 of row is denoted as D1,511；The rest may be inferred, and D1599 is denoted as positioned at the word of 1599 row 0 of row, and 0；Word positioned at 1599 row 1 of row is denoted as D1599, 1；Word positioned at 1599 row 2 of row is denoted as D1599,2；The rest may be inferred, and D1599 is denoted as positioned at the word of 1599 row 511 of row, and 511.

Referring now to Figure 28, the program listing of NNU programs, the wherein input data matrix of the NNU program executions 27 are shown Pond operation and write back weight RAM 124.In the example of Figure 28, pond operation calculates each in input data matrix The maximum value of 4 × 4 submatrixs.The program keeps the cycle body circulation of the instruction at 1 to 10 place of address multiple.Initialization at address 0 The number of the execution loop bodies of each NPU 126 is specified in NPU instructions, such as in the example of Figure 28, the loop count of loop body Be 400, and the recursion instruction recycled at end (address 11) makes current cycle count value successively decrease, if after successively decreasing the result is that Nonzero value then controls the top (instruction i.e. at return address 1) for returning to loop body.Input data square in weight RAM 124 Battle array is actually considered as 400 mutual exclusion groups being made of four adjacent rows by NNU programs, at once 0~3, row 4~7, row 8~11, according to This analogizes, until row 1596~1599.Each group being made of four adjacent rows includes 128 4 × 4 submatrixs, i.e., by group Four rows and four adjacent column rows (i.e. row 0~3, row 4~7, row 8~11, so on up to row 508~511) intersect institute 4 × 4 submatrixs of the element of formation.In 512 NPU 126, every 4th NPU 126 in this 512 NPU 126 (that is, 128 NPU 126) pond operation is executed to corresponding 4 × 4 submatrix, and other 3/4ths NPU 126 is not used then. More specifically, NPU 0,4,8, so on up to NPU 508 pond operation respectively is executed to its corresponding 4 × 4 submatrix, In 4 × 4 submatrixs leftmost column number correspond to NPU number, and the row of lower section correspond to present weight RAM 124 row Value, as described in more detail below, the value by the initialization directive at address 0 are initialized as zero and in each repetitive cycling bodies Increase by 4.4x4 submatrix groups numbers (i.e. input data matrix in 400 iteration of loop body and the input data matrix of Figure 27 1600 rows divided by 4) corresponding.Initialization NPU instructions also reset accumulator 202.Preferably, the recursion instruction at address 11 Accumulator 202 is reset.Optionally, the maxwacc instructions of address 1 are specified resets accumulator 202.

For each iteration of the loop body of program, used 128 NPU 126 are concurrently to input data matrix 128 corresponding 4 × 4 submatrixs in current four rows group execute 128 pond operations.More specifically, pond operation determine 4 × Maximum value element in 16 elements of 4 submatrixs.In the embodiment of Figure 28, in used 128 NPU 126 Each NPU y, the lower left side element of 4 × 4 submatrixs are element Dx, the y of Figure 27, and wherein x is present weight when loop body starts The row number of RAM 124, instructing reading by the maxwacc of the address 1 of the program of Figure 28, (this row number is also by address 0 NPU instruction initialization is initialized, and is incremented by when executing the maxwacc instructions at address 3,5 and 7 every time).Therefore, for journey Each cycle of sequence, used 128 NPU 126 are by the corresponding maximum value of corresponding 128 4 × 4 submatrixs of current line group Element writes back the nominated bank of weight RAM 124.It is illustrated below for the loop body.

It is maxwacc instructions at address 1, for implicitly using the row of present weight RAM 124, which to preferably remain in In sequencer 128 (and being initialized as zero by the instruction positioned at address 0 for passing through loop body for the first time).At address 1 Instruction makes each NPU 126 read its corresponding word from the current line of weight RAM 124, by the value 217 of the word and accumulator 202 into Row compares, and the maximum of the two values is stored in accumulator 202.Thus, for example, NPU 8 determines the value of accumulator 202 217 and data word Dx, 8 (wherein " x " be present weight RAM 124 row) in maximum value, and the maximum value write back cumulative Device 202.

It is maxwacc instructions at address 2, the value being used in the specified multiplexing register 705 by each NPU 126 rotates To adjacent NPU 126, described value is only to be read from weight RAM 124 in response to the instruction at address 1 in this case A line input data matrix value.In the embodiment of Figure 27 to Figure 28, as described in above in relation to Figure 24 to 26, NPU 126 by with It is set to and rotates the value of multiplexer 705 to the left, i.e., go to NPU J-1 from NPU J wheels.In addition, count value 3 is specified in instruction.Therefore, Instruction at address 2 makes each NPU 126 receive the word after rotation to multiplexing register 705 and determines the word after rotating and tire out Add the maximum value in the value 217 of device 202, is then repeated two more times the operation.That is, each NPU 126 three times will wheel Word after turning receives into multiplexing register 705 and determines the maximum value in the value 217 of the word and accumulator 202 after rotating.Cause This, for example it is assumed that the behavior 36 of the present weight RAM 124 when loop body starts is executing address 1 and 2 by taking NPU 8 as an example After the instruction at place, NPU 8 by its accumulator 202 store loop body start when accumulator 202 and four weight RAM Maximum value in 124 word D36,8, D36,9, D36,10 and D36,11.

The operation performed by maxwacc instructions at address 3 and 4 and the operation class performed by the instruction at address 1 and 2 Seemingly, however 124 rows of weight RAM is utilized to be incremented by indicator, the instruction maxwacc at address 3 and 4 is to the next of weight RAM 124 Row executes operation.That is, it is assumed that the row of the present weight RAM 124 when loop body starts is 36, by taking NPU 8 as an example, After the instruction for completing 1 to 4 place of address, NPU 8 will be stored in its accumulator 202 accumulator 202 of loop body when starting with And in word D36,8, D36,9, D36,10, D36,11, D37,8, D37,9, D37,10 and D37,11 of eight weight RAM 124 Maximum value.

Operation class performed by the instruction of the performed operation and 3 to 4 place of address of maxwacc instructions at 5 to 8 place of address Seemingly, however the instruction at 5 to 8 place of address executes operation to next two row of weight RAM 124.That is, it is assumed that cycle The row of present weight RAM 124 when body starts are 36, and by taking NPU 8 as an example, after the instruction for completing address 1 to 8, NPU 8 will be Accumulator 202 when storage cycle starts in its accumulator 202 and 16 weight RAM 124 word D36,8, D36,9, D36, 10、D36,11、D37,8、D37,9、D37,10、D37,11、D38,8、D38,9、D38,10、D38,11、D39,8、D39,9、 Maximum value in D39,10 and D39,11.That is, it is assumed that the row of the present weight RAM 124 when loop body starts is 36, By taking NPU 8 as an example, after completing the instruction at address 1 to 8, NPU 8 will determine the maximum value of following 4 × 4 submatrixs：

More specifically, after completing the instruction at address 1 to 8, each NPU 126 in used 128 NPU 126 will be true The maximum value of fixed following 4 × 4 submatrixs：

Wherein r is the row address value of the weight RAM 124 when loop body starts, and n is the number of NPU 126.

Instruction at address 9 makes the value 217 of accumulator 202 pass through AFU 212.It is this so that size is from power by function The word of the size (as unit of position, i.e. 16 in this example) for the word that weight RAM 124 is read passes through.Preferably, as it is following more It is described in detail, user may specify how many position is decimal place in output format, such as carry-out bit.

The value 217 of accumulator 202 is written in weight RAM 124 by the current of output row register for instruction at address 10 The specified row of value, the wherein current value are initialized by the instruction at address 0, and by means of the incremental indicator in instruction every It is secondary to be incremented by when passing through loop body.More specifically, weight is written in the wide word (such as 16) of accumulator 202 by the instruction at address 10 RAM 124.Preferably, such as described in more detail below for Figure 29 A and Figure 29 B, instruction is small according to output binary system by 16 positions It is written as several points 2916 are specified.

It can be found that the row that weight RAM 124 is written by being iterated to loop body includes with invalid data Cavity.That is, the wide word 1 to 3 of result 133,5 to 7,9 to 11, the rest may be inferred, until wide word 509 to 511 is all invalid Or it is not used.In one embodiment, AFU 212 includes multiplexer, and the wherein multiplexer makes it possible to result being compacted to In the adjacent words of row buffer (row buffer 1104 of Figure 11 etc.), to write back 124 rows of output weight RAM.Preferably, Number of words in the specified each cavity of activation primitive instruction, and the number of words in cavity is used to control the compression result of multiplexer.? In one embodiment, empty number can be designed to 2~6 value, with 3 × 3,4 × 4,5 × 5,6 × 6 or 7 × 7 submatrixs to pond Output compressed.Optionally, the framework program being implemented on processor 100 reads generated dilute from weight RAM 124 The result row dredged and (there is cavity), and use (the framework pressure using being instructed such as x86SSE of other execution units 112 The media units etc. tightly instructed) execute substrate hold-down function.Advantageously, with similar to parallel mode above-mentioned and using NNU 121 Mixing essence, the framework program that is implemented on processor 100 can monitor weight RAM's 124 with read status register 127 The last writing line (such as field 2602 of Figure 26 B) compresses it to read generated loose line and writes back weight RAM 124 same a line so that get out next layer (such as convolutional layer or traditional neural network layer (i.e. multiplication as neural network Cumulative layer) etc.) input data matrix.In addition, although 4 × 4 submatrix of embodiment pair as described herein executes pond operation, It can modify to the NNU programs of Figure 28, with other sizes to 3 × 3,5 × 5,6 × 6 or 7 × 7 submatrixs etc. Submatrix executes pond operation.

It is also found that the result line number of write-in weight RAM 124 is a quarter of the line number of input data matrix.Most Afterwards, in this example, and data RAM 122 is not used.However, alternatively, data RAM 122 can be used, without the use of weighing Weight RAM 124, to execute pond operation.

In the example of Figure 27 and Figure 28, pond operation calculates the maximum value of subregion.However, the program of Figure 28 can be repaiied Be changed to for example by with sumwacc instruct and (be added weight word with the value 217 of accumulator 202) substitution maxwacc instruct and incite somebody to action Activation primitive instruction at address 9 is changed to accumulation result (preferably via reciprocal multiplication as described below) divided by each sub-district First prime number (being in this example 16) in domain, to calculate the average value of subregion.

From the operation of the NNU 121 of the embodiment of according to fig. 27 and Figure 28 it can be found that executing the program of Figure 28 each time Pond operation is executed to entire 512 × 1600 data matrix of Figure 27 with about 6000 clock cycle, which can Considerably less than traditional approach executes the clock periodicity needed for similar tasks.

Optionally, framework program configures the row of the result write back data RAM 122 of pond operation NNU programs to, and It is non-to result back into weight RAM 124, and when NNU 121 (such as is write using the last time of the data RAM 122 of Figure 26 B The address of the row 2606 entered) when writing the result into data RAM 122, framework program reads result from data RAM 122.It is this Alternative solution can be advantageous in weight RAM 124 is single port and embodiment that data RAM 122 is dual-port.

Fixed point arithmetic with the binary point that user provides, full precision fixed point is cumulative, the reciprocal value that user specifies, The random rounding-off of accumulator value and selecting activation/output function

Generally, due to which the hardware cell for executing arithmetical operation in digital computing system distinguishes integer and floating number It executes and arithmetical operation thus is commonly divided into " integer " unit and " floating-point " unit.Floating number has amplitude (magnitude) (or mantissa) and index, usually also symbol.Index is radix (radix) point (being usually binary point) relative to amplitude Position instruction.In comparison, integer does not have index, and only has amplitude, usually also symbol.The advantages of floating point unit Be to enable program designer that the number obtained in the different value out of very a wide range of is used to carry out work, and hardware be then It is responsible for the exponential quantity of adjustment number when needing, is adjusted without program designer.For example it is assumed that two floating numbers 0.111×10²⁹With 0.81 × 10³¹It is multiplied.Although (floating point unit typically operates in the floating number based on 2, makes here It is decimal fraction or the example based on 10.) floating point unit automatically be responsible for mantissa is multiplied, index is added, with Result standardization is returned into value .8911 × 10 afterwards⁵⁹.For another example assuming that same two floating numbers are added.Floating point unit is being added The preceding automatic responsible binary fraction point alignment by mantissa is to generate value as .81111 × 10³¹Summation.

However, complexity associated with floating point unit and the size generated therewith, energy consumption, the clock cycle often instructed Increase and/or the extension of cycle time be well-known.In fact, for this reason, many devices are (for example, embedded The microprocessor of formula processor, microcontroller and relatively low cost and/or low-power) do not include floating point unit.Show from above-mentioned Example is it can be found that some complicated floating point units include：For executing index meter associated with floating add and multiplication/division Calculation logic (i.e. to the index of operand execute addition/subtraction operation to generate the resultant exponent value of floating-point multiplication/division Adder, for determining that the index of operand subtracts each other the subtraction of the binary point alignment shift amount to determine floating add Device), for realizing the shift unit of the binary fraction point alignment of mantissa in floating add and for floating point result into rower The shift unit of standardization processing.In addition, flow enter box unit usually require execute floating point result rounding-off operation logic, Between integer data format and floating-point format and different floating-point precision formats (such as amplification precision, double precision, single precision, half precision) Between convert logic, leading zero with leading one detector and the special floating number of processing (such as outlying observation, nonumeric and nothing Poor value etc.) logic.

Additionally, there are following disadvantages：Because needing the increase for the numerical space being verified in design, floating point unit is just Exactness verification significantly becomes extremely complex, so as to extend product development cycle and Time To Market.In addition, as described above, floating Point arithmetic means the storage and use of independent mantissa field and exponent field to calculating involved each floating number, this may The amount of memory space needed for increasing and/or accuracy is reduced in the case of storing integer in the memory space of given equivalent. Many in these disadvantages can be avoided by using the integer unit of arithmetical operation is executed to integer.

Program designer often writes the processing decimal i.e. program of non-integer.This program can be on following processor It executes, wherein these processors do not have floating point unit, although there is floating point unit, the integer unit of processor to be held Capable integer instructions can be faster.In order to utilize Potential performance advantage associated with integer unit, program designer is to fixed point (fixed-point) number uses known fixed point arithmetic.Such program includes being implemented in integer unit to handle integer or whole The instruction of number data.Software knows that data are decimals, and includes for executing operation to integer data to cope with data reality The instruction (for example, alignment shift) for the fact that be decimal on border.Substantially, fixed point software manually performs performed by floating point unit Some or all functions.

As used herein, " fixed point " number (or value or operand or input or output) is a number, the number Bit of storage be interpreted to embrace the position (referred herein as " decimal place ") of the fractional part for indicating the fixed-point number.The storage of fixed-point number It deposits position to be contained in memory or register, such as 8 or 16 words in memory or register.In addition, the storage of fixed-point number Position is deposited all for indicating an amplitude, and in some cases, one of position is used for indicating symbol, but fixed-point number does not have There is the bit of storage of the index for indicating the number.In addition, specifying the quantity or binary system of the decimal place of the fixed-point number in storage Scaling position, this is different from the bit of storage of fixed-point number, and for fixed-point number set (such as the input operation belonging to fixed-point number The set etc. of the set of several set, accumulated value or the output result of the array of processing unit) referred in a manner of shared or is global Show quantity or the binary point position of decimal place.

Advantageously, embodiment described herein in, ALU is integer unit, but activation primitive unit is calculated comprising floating-point Art hardware assists or accelerates.In this way so that the parts ALU become smaller and faster, to be used conducive in given crystal grain space More ALU.This means that there are more neurons in per unit crystal grain space, to particularly advantageous in neural network unit.

Advantageously, moreover, the floating number of index bit of storage is required for each floating number on the contrary, describing following implementation Example, wherein fixed-point number is indicated using the instruction of the quantity of the bit of storage as decimal place for whole digital collections, however, should Instruction is located in individually shared memory space, and globally instruction is for a series of entire set (such as the collection of the input of operations Close, set, the set of output of the accumulated values of operations a series of) all numbers in decimal place quantity.Preferably, NNU User can be quantity that digital collection specifies decimal bit of storage.Although it will thus be appreciated that the (example in many contexts In the mathematics as), term " integer " refers to signed integer, that is, does not have the number of fractional part, but in this context In, term " integer " can refer to the number with fractional part.In addition, in the context of this article, term " integer " be intended to Floating number distinguishes, and for floating number, a part for the digit of respective memory space is used for indicating the finger of floating number Number.Similarly, integer arithmetic operation (multiplication of integers or addition or comparison etc. that integer unit executes) assumes operand not With index, therefore, the integer element (integer multiplier, integer adder, integer comparator etc.) of integer unit does not wrap Containing for handling the logic of index, such as mantissa need not be shifted for addition or comparison operation to make binary point pair Together, index need not be added for multiplying.

In addition, embodiment as described herein includes large-scale hardware integer accumulator with right in the case where not losing accuracy A series of big integer arithmetic is added up (for example, 1000 magnitudes is multiply-accumulate).NNU is enable to avoid processing floating in this way Points, while accumulated value can be made to maintain full precision again, without making it be saturated or generating inaccurate result because of overflow.Such as with Under it is described in more detail, once result is added to full precision accumulator by this series of integers operation, fixed-point hardware auxiliary is carried out must The scaling and saturation arithmetic wanted, with the expectation number of the decimal place of the quantity of accumulated value decimal place and output valve specified using user Full precision accumulated value is converted to output valve by the instruction of amount.

As described in more detail below, accumulator value is being compressed from full precision form so as to as the defeated of activation primitive Enter or when for passing through, it is preferable that activation primitive unit selectively can execute random rounding-off to the accumulator value.Most Afterwards, according to the different demands to given layer of neural network, NPU can selectively receive to indicate to apply different activation primitives And/or many various forms of accumulator values of output.

Referring now to Figure 29 A, a block diagram is shown, which shows the embodiment of the control register 127 of Fig. 1.Control deposit Device 127 may include multiple control registers 127.As shown, control register 127 includes following field：Configuration 2902, band symbol Number 2912, tape symbol weight 2914, data binary point 2922, weight binary point 2924, ALU functions 2926, rounding control 2932, activation primitive 2934, inverse 2942, shift amount 2944, output RAM 2952, output binary system are small Several points 2954 and output order 2956.127 value of control register can utilize the instruction of MTNN instruction 1400 and NNU programs Both (initialization directive etc.) is written.

It is narrow configuration, wide configuration or funnel configuration that the value of configuration 2902, which specifies NNU 121, as described above.Configuration 2902 Mean the size of input word received from data RAM 122 and weight RAM 124.It is defeated in narrow configuration is configured with funnel The size for entering word is narrow (such as 8 or 9), and in width configures, and the size that inputs word is wide (such as 12 or 16 Position).In addition, configuration 2902 means the size of output result 133 identical with input word size.

If tape symbol data value 2912 if true, indicate that the data word that is received from data RAM 122 is signed value, If if false, indicating that these data words are not signed values.If tape symbol weighted value 2914 if true, table from weight RAM The 124 weight words received are signed values, if if false, indicating that these weight words are not signed values.

Binary point of the value instruction of data binary point 2922 from the data words received of data RAM 122 Position.Preferably, the value of data binary point 2922 indicates the position position from the right side of binary point position Quantity.In other words, how much positions in the least significant bit of 2922 designation date word of data binary point are decimal places, i.e., On the right side of binary point.Similarly, the value instruction of weight binary point 2924 is received from weight RAM 124 The position of the binary point of weight word.Preferably, the case where ALU functions 2926 are multiply-accumulate or output accumulator Under, the digit on the right side of the binary point for the value that NPU 126 will remain in accumulator 202 is determined as data binary fraction The summation of point 2922 and weight binary point 2924.Thus, for example, if the value of data binary point 2922 is 5 And the value of weight binary point 2924 is 3, then the value in accumulator 202 has 8 positions on the right side of binary point.? ALU functions 2926 be with/maximum accumulator with data/weight word or by data/weight word in the case of, NPU 126 will It is maintained at the digit on the right side of the binary point of the value of accumulator 202 and is identified as the power of data binary point 2922/ Weight binary point 2924.In an alternative embodiment, as described in below for Figure 29 B, specify single accumulator binary system small Several points 2923, and non-designated respective data binary point 2922 and weight binary point 2924.

The specified function executed by the ALU 204 of NPU 126 of ALU functions 2926.As described above, ALU functions 2926 can be with Including but not limited to：Data word 209 is multiplied with weight word 203 and product and accumulator 202 add up；By accumulator 202 with Weight word 203 is added；Accumulator 202 is added with data word 209；Accumulator 202 and the maximum value in data word 209；It is cumulative Device 202 and the maximum value in weight word 203；Export accumulator 202；Pass through data word 209；Pass through weight word 203；Output zero Value.In one embodiment, ALU functions 2926 are specified by NNU initialization directives, and by ALU 204 in response to executing instruction (not shown) and use.In one embodiment, ALU functions 2926 by a other NNU instruction (it is such as above-mentioned multiply-accumulate and Maxwacc instructions etc.) it is specified.

Rounding control 2932 specifies the form being rounded used in (in Figure 30) rounder 3004.In one embodiment, Assignable rounding mode includes but not limited to：It is unrounded, be rounded up to most recent value and random rounding-off.Preferably, processor 100 include (Figure 30's) the random order source 3003 for generating random order 3005, wherein these random orders 3005 be sampled and For executing random rounding-off to reduce the possibility for generating rounding-off biasing.In one embodiment, rounding bit 3005 be 1 and In the case that the viscous position (sticky) is zero, if the random order 3005 of sampling is true, NPU 126 is rounded up to, if with Seat in the plane 3005 is false, then NPU 126 is not rounded up to.In one embodiment, random order source 3003 is based on to processor 100 The sampling of random characteristic electron (thermal noise etc. of semiconductor diode or resistance) generates random order 3005, but also sets Other embodiments are thought.

The function of the specified value 217 suitable for accumulator 202 of activation primitive 2934 is to generate the output 133 of NPU 126.Such as Upper described and described more fully below, activation primitive 2934 includes but not limited to：S type functions；Hyperbolic tangent function；Soft plus letter Number；Correction function；Divided by two specified power side；The reciprocal value that user specifies is multiplied by realize equivalent division；By entirely tiring out Add device；And conduct normal size as described in more detail below passes through accumulator.In one embodiment, activation primitive by The instruction of NNU activation primitives is specified.Optionally, activation primitive is specified by initialization directive, and in response to output order (such as Fig. 4 Write-in AFU output orders at middle address 4) and apply, in this embodiment, the activation primitive instruction positioned at the address of Fig. 43 is returned Enter output order.

The specified value 217 with accumulator 202 of 2942 values reciprocal is multiplied to realize the value of the division of the value 217 of accumulator 202. That is, 2942 values reciprocal are appointed as the inverse of actually desired divisor by user.This is for example rolled up with as described herein It is useful that product or pond operation, which are combined,.Preferably, such as described in more detail below for Figure 29 C, user is by 2942 values reciprocal It is appointed as two parts.In one embodiment, control register 127 includes a field (not shown), and the wherein field makes User can specify a progress division in multiple built-in divider values, and the sizableness of these built-in divider values is in common volume The size of product core, such as 9,25,36 or 49.In such embodiments, AFU 212 can store falling for these built-in divisors Number, for being multiplied with 202 value 217 of accumulator.

Shift amount 2944 specifies the shift unit of AFU 212 that the value 217 of accumulator 202 is moved to right to the power with realization divided by two The digit of side.The combination of the convolution kernel for the power side that this and size are two can also be useful.

The value of output RAM 2952 specifies which of data RAM 122 and weight RAM 124 to receive output result 133。

Export the position of the binary point of the value instruction output result 133 of binary point 2954.Preferably, defeated Go out position positional number of the value instruction of binary point 2954 from the right side of the binary point position of output result 133 Amount.In other words, how much positions in the least significant bit of output binary point 2954 instruction output result 133 are decimal places, It is located on the right side of binary point.Values of the AFU 212 based on output binary point 2954 is (in most cases, The value of value, weight binary point 2924, the value of activation primitive 2934 also based on data binary point 2922 and/or The value of configuration 2902) it is converted to execute rounding-off, compression, saturation and size.

The various aspects of 2956 control output result 133 of output order.In one embodiment, AFU 212 utilizes standard The concept of size, wherein normal size are twice of the width size (as unit of position) specified by configuration 2902.Therefore, example Such as, if configuration 2902 means that the size of the input word received from data RAM 122 and weight RAM 124 is 8, standard Size is 16；In another example, if what configuration 2902 meant to receive from data RAM 122 and weight RAM 124 The size for inputting word is 16, then normal size is 32.As described herein, the size of accumulator 202 is larger (for example, narrow tired It is 28 to add device 202B, and wide accumulator 202A is 41) with maintain intermediate computations (such as be respectively 1024 and 512 NNU Multiply-accumulate instruction) full precision.In this way, the value 217 of accumulator 202 is more than (as unit of position) normal size, and AFU 212 (such as below for CCS 3008 described in Figure 30) are directed to the major part of activation primitive 2934 (except through full accumulator) The value 217 of accumulator 202 is compressed to the value with normal size by value downwards.The first predetermined value instruction of output order 2956 AFU 212 executes specified activation primitive 2934 to generate size (that is, half of normal size) identical as word is originally inputted Internal result is simultaneously exported the inside result as output result 133.The second predetermined value of output order 2956 indicates AFU 212 Specified activation primitive 2934 is executed to generate the inside result and general that size is twice (that is, normal size) being originally inputted word The lower half of the inside result is exported as output result 133；And the third predetermined value instruction AFU 212 for exporting order 2956 will The first half of the inside result of normal size is exported as output result 133.As described in above in relation to Fig. 8 to Figure 10, output life 2956 the 4th predetermined value is enabled to indicate that (its width is referred to AFU 212 by configuration 2902 by the original least significant word of accumulator 202 It is fixed) it is exported as output result 133；5th predetermined value indicates AFU 212 using the original intermediate effective word of accumulator 202 as defeated Go out the output of result 133；And the 6th predetermined value instruction AFU 212 using the original most significant word of accumulator 202 as output knot Fruit 133 exports.As described above, exporting the inside result of 202 size of full accumulator or normal size for example for making processor 100 Other execution units 112 be able to carry out the activation primitive of soft very big activation primitive etc. for can be advantageous.

Although the field of Figure 29 A (and Figure 29 B and Figure 29 C) is described as being located in control register 127, at it In its embodiment, one or more fields can be located at the other parts of NNU 121.Preferably, many fields may be embodied in In NNU instructions itself, and it is decoded with generation (Figure 34's) by sequencer 128 for controlling ALU 204 and/or AFU 212 Microoperation 3416.In addition, these fields may be embodied in (Figure 34's) microoperation 3414 being stored in media register 118 Interior, microoperation 3414 controls ALU 204 and/or AFU 212.In such embodiments, making for initialization NNU instructions can be made With minimizeing, and in other embodiments, removal initialization NNU instructions.

As described above, NNU instructions can be specified to memory operand (such as from data RAM 122 and/or weight The word of RAM 124) or rotation after operand (such as from multiplexing register 208/705) execute ALU operations.Implement at one Example in, NNU instruction operand can also be appointed as activation primitive register output (such as Figure 30 register output 3038).In addition, making data RAM 122 or the current row address of weight RAM 124 pass as described above, NNU instructions can specify Increase.In one embodiment, NNU instructions can specify the instant signed integer difference (delta for being added into current line Value) to realize with the value increasing or decreasing other than one.

Referring now to Figure 29 B, a block diagram is shown, which shows according to the control register 127 of Fig. 1 of alternative embodiment Embodiment.The control register 127 of Figure 29 B is similar with the control register 127 of Figure 29 A；However, the control register of Figure 29 B 127 include accumulator binary point 2923.Accumulator binary point 2923 indicates the binary fraction of accumulator 202 Point position.Preferably, the value of accumulator binary point 2923 indicates the position position from the right side of binary point position Set quantity.In other words, how many position is decimal in the least significant bit of the expression of accumulator binary point 2923 accumulator 202 Position is located on the right side of binary point.In this embodiment, accumulator binary point 2923 is clearly specified, rather than It is implicitly determined as described in the embodiment above in relation to Figure 29 A.

Referring now to Figure 29 C, a block diagram is shown, which shows the figure with two section stores according to one embodiment The embodiment of the inverse 2942 of 29A.First part 2962 is shift value, indicates that user wants 217 phase of value with accumulator 202 The quantity 2962 of repressed leading zero in the true reciprocal value multiplied.The quantity of leading zero is an immediately proceeding on the right side of binary point Continuous zero quantity.Second part 2694 is repressed reciprocal 2964 value of leading zero, i.e., after removing all leading zeroes True reciprocal value.In one embodiment, the quantity 2962 for being suppressed leading zero is stored as 4, and leading zero is suppressed The value of inverse 2964 be then stored as 8 not signed values.

In order to be illustrated by way of example, it is assumed that user it is expected for the value 217 of accumulator 202 to be multiplied by the inverse for 49.With 49 binary representation reciprocal that 13 decimal places indicate is 0.0000010100111 (there are five leading zeroes for tool).This In the case of, user is by 2962 entry value 5 of quantity of suppressed leading zero, by repressed 2964 entry value reciprocal of leading zero 10100111.It is in (Figure 30's) multiplier " divider A " 3014 reciprocal that the value 217 of accumulator 202 and leading zero is repressed After the value of inverse 2964 is multiplied, generated product is moved to right according to the quantity 2962 of suppressed leading zero.It is such Embodiment can indicate that the value of inverse 2942 realizes pinpoint accuracy advantageous by using relatively small number of position.

Referring now to Figure 30, a block diagram is shown, which illustrates in greater detail the embodiment of the AFU 212 of Fig. 2.AFU 212 Including：The control register 127 of Fig. 1；Eurymeric converter (positive form converter, PFC) and output binary system are small Number point alignment device (output binary point aligner, OBPA) 3002, is used to receive the value 217 of accumulator 202； Rounder 3004 is used to receive the value 217 of accumulator 202 and the instruction of digit that OBPA 3002 is removed；Random order source 3003, as described above, it is used to generate random order 3005；First multiplexer 3006 is used to receive PFC and OBPA's 3002 Output and the output of rounder 3004；Normal size compressor reducer (compressor to canonical size, CCS) and full With device 3008, it is used to receive the output of the first multiplexer 3006；Digit selector and saturator 3012, be used to receive CCS and The output of saturator 3008；Corrector 3018 is used to receive the output of CCS and saturator 3008；Multiplier 3014 reciprocal, Output for receiving CCS and saturator 3008；Right shift device 3016 is used to receive the output of CCS and saturator 3008； Tanh (tanh) module 3022, is used to receive the output of digit selector and saturator 3012；S patterns block 3024 is used In the output for receiving digit selector and saturator 3012；Soft plus module 3026, is used to receive digit selector and saturator 3012 Output；Second multiplexer 3032 is used to receive tanh module 3022, S patterns block 3024, soft plus module 3026, school Positive device 3018, multiplier 3014 reciprocal, the output of right shift device 3016 and CCS and saturator 3008 pass through normal size Output 3028；Symbol restorer 3034 is used to receive the output of the second multiplexer 3032；Size converter and saturator 3036, it is used to receive the output of symbol restorer 3034；Third multiplexer 3037 is used to receive size converter and saturation The output of device 3036 and the output 217 of accumulator；And output register 3038, it is used to receive the output of multiplexer 3037 And its output is the result 133 of Fig. 1.

PFC and OBPA 3002 receives the value 217 of accumulator 202.Preferably, as described above, the value 217 of accumulator 202 is Full precision value.That is, accumulator 202 keeps accumulated value with enough storage digits, the wherein accumulated value is by whole A series of summation for products generated by integer multiplier 242 that number adder 244 is generated, without giving up multiplier 242 Any position in the summation of each product or adder so that precision will not be lost.Preferably, accumulator 202 at least has enough Digit come keep NNU 121 can be programmed execution product accumulation maximum quantity.For example, the program with reference to figure 4 carries out Illustrate, the maximum quantity that NNU 121 can be programmed the product accumulation of execution under wide configuration is 512, and the bit wide of accumulator 202 Degree is 41.For another example the program with reference to figure 20 illustrates, under narrow configuration NNU 121 can be programmed execution product it is tired The maximum quantity added is 1024, and the bit width of accumulator 202 is 28.Usually, full precision accumulator 202 has at least Q Position, wherein Q is M and log₂The summation of P, wherein M are the bit widths of the integer multiplication of multiplier 242 (for example, for narrow multiplier It is 16 for 242, or for wide multiplier 242 is 32), and P is can be added to the integer of accumulator 202 to multiply Long-pending maximum allowable quantity.Preferably, the maximum quantity of product accumulation is the programming specification via the program designer of NNU 121 Specified.In one embodiment, it is assumed that loading data/weight word 206/207 from data/weight RAM 122/124 In the case of the previous multiply-accumulate instruction (such as instruction at the address 1 of Fig. 4) of capable one, sequencer 128 tires out multiplication The maximum value of the counting of NNU instructions (such as instruction at the address 2 of Fig. 4) is added for example to force to be set as 511.

Advantageously, by including that there is sufficiently large bit width to be held with the full precision value to allowed cumulative maximum quantity The cumulative accumulator 202 of row, can simplify the design of 204 parts ALU of NPU 126.In particular, can alleviate to using in this way Logic will make small-sized accumulator generate overflow come the demand for making the summation of the generation of integer adder 244 be saturated, the wherein logic, and And it will need to keep track the binary point position of accumulator to determine whether to have occurred overflow to know whether need Saturation.In order to by example to handling the overflow of non-full precision accumulator with non-full precision accumulator but with saturation logic Design the problem of illustrate, it is assumed that there are following situations.

(1) range of the value of data word is between 0 and 1, and all bit of storage are all used for storing decimal place.Weight word The range of value is between -8 and+8, and all bit of storage other than three positions are all used for storing decimal place.For double The range of the accumulated value of the input of bent tangent activation primitive is between -8 and 8, and all storages other than three positions Position is all used for storing decimal place.

(2) bit width of accumulator is non-full precision (for example, the only bit width of product).

(3) assume that accumulator is full precision, then final accumulated value will be between -8 and 8 (for example,+4.2)；However, Product in sequence before " point A " is more commonly inclined to as just, and more often more commonly tendency is negative to the product after point A.

In this case, it is possible to obtain inaccurate result (that is, result other than+4.2).This is because before point A Certain points at, when accumulator value should be when being value (for example,+8.2) of bigger, accumulator may be saturated to maximum value+8, Cause to lose remaining+0.2.Accumulator is it is even possible that more product accumulations maintain saturation value, more so as to cause loss Mostly positive value.Therefore, the end value of accumulator may be that the end value in the case of having full precision bit width than accumulator is (i.e. small In+4.2) smaller value.

PFC 3002 is converted into eurymeric in the case where the value 217 of accumulator 202 is negative, and generates extra order to refer to Show that original value is just or negative, which passes through the assembly line of AFU 212 downwardly together with described value.Be converted to eurymeric simplification The subsequent arithmetic of AFU 212.For example, the operation so that only positive value can just be input to tanh module 3022 and S types Module 3024, thus these modules can be simplified.In addition, simplifying rounder 3004 and saturator 3008.

3002 right shifts of OBPA or the positive offset of scaling, so that itself and the output binary system specified in control register 127 Decimal point 2954 is aligned.Preferably, shift amount is calculated as a difference by OBPA 3002, and the difference is the value from accumulator 202 217 decimal digits is (for example, by accumulator binary point 2923 is specified or 2922 and of data binary point The summation of weight binary point 2924) decimal digits of output that subtracts is (for example, by 2954 institute of output binary point It is specified).Thus, for example, if the binary point 2923 of accumulator 202 is 8 (as above-described embodiments) and exports Binary point 2954 is 3, then the positive offset is moved to right 5 and is provided to multiplexer 3006 and house to generate by OBPA 3002 Enter the result of device 3004.

Rounder 3004 executes rounding-off to the value 217 of accumulator 202.Preferably, rounder 3004 generates PFC and OBPA Version after the rounding-off of positive offset caused by 3002, and version after the rounding-off is provided to multiplexer 3006.Rounder 3004 Rounding-off is executed according to above-mentioned rounding control 2932, as described in context, which may include using random The random rounding-off of position 3005.It is multiple at its that multiplexer 3006 is based on rounding control 2932 (may include being rounded at random as described herein) One (version after the positive offset i.e. from PFC and OBPA 3002 or the rounding-off from rounder 3004) is selected in input, and The value of selection is supplied to CCS and saturator 3008.Preferably, it if rounding control 2932 is specified without rounding-off, is multiplexed Device 3006 selects the output of PFC and OBPA 3002, otherwise selects the output of rounder 3004.In contemplated other embodiments In, AFU 212 executes additional rounding-off.For example, in one embodiment, when digit selector 3012 is to CCS and saturator 3008 Carry-out bit when being compressed (as be described hereinafter), low-order bit of the digit selector 3012 based on loss is rounded.In another example at one In embodiment, the product (as be described hereinafter) of multiplier 3014 reciprocal is rounded.In another example in one embodiment, size turns Parallel operation 3036 is rounded when being converted to output size appropriate (as be described hereinafter), this may relate to lose low when rounding-off determines Component level.

The output valve of multiplexer 3006 is compressed to normal size by CCS 3008.Thus, for example, if NPU 126 is narrow Configuration or funnel configure 2902, then the output valve of 28 multiplexers 3006 is compressed to 16 by CCS 3008；And if NPU 126 be wide configuration 2902, then the output valve of 41 multiplexers 3006 is compressed to 32 by CCS 3008.However, being compressed to Before normal size, if value is more than the maximum value that canonical form can be expressed before compression, before saturator 3008 will make compression Value is saturated to the maximum value that canonical form can be expressed.For example, left if being located at the effective canonical form position of highest before compression in value The arbitrary position of side has value 1, then saturator 3008 is saturated to maximum value (for example, being saturated to all 1).

Preferably, tanh module 3022, S patterns block 3024 and soft plus module 3026 all include look-up table, such as can Programmed logic array (PLA) (PLA), read-only memory (ROM), combinational logic gate etc..In one embodiment, in order to simplify and reduce The size of these modules 3022/3024/3026 provides the input value with 3.4 forms to these modules, i.e. three integer-bits and Four decimal places namely input value have on the right side of binary point and have three there are four position and on the left of binary point A position.It is because in the extreme place of the input value range (- 8 ,+8) of 3.4 forms, output valve progressively close to it to select these values Min/max.It is contemplated, however, that for example, by using 4.3 forms or 2.5 forms binary point is placed on different location Other embodiments.Digit selector 3012 selects the position for meeting 3.4 formal standards in the output of CCS and saturator 3008, this is related to And compression processing, that is, certain positions can be lost, this is because canonical form has more digit.However, in selection/compression Before the output valve of CCS and saturator 3008, if value is more than the maximum value that 3.4 forms can be expressed, saturator before compression 3012 make compression before value be saturated to the maximum value that 3.4 forms can be expressed.For example, if being located at 3.4 forms before compression in value Arbitrary position on the left of most significant bit has value 1, then saturator 3012 is saturated to maximum value (for example, being saturated to all 1).

What tanh module 3022, S patterns block 3024 and soft plus module 3026 exported CCS and saturator 3008 3.4 form values execute corresponding activation primitive (as described above) to generate result.Preferably, tanh module 3022 and S types Module 3024 the result is that 7 of 0.7 form as a result, i.e. zero integer-bit and seven decimal places namely input value are in binary system There are seven positions on the right side of decimal point.Preferably, soft plus module 3026 the result is that 7 of 3.4 forms as a result, i.e. its form with should The input form of module 3026 is identical.Preferably, tanh module 3022, S patterns block 3024 and soft plus module 3026 is defeated Go out to be scaled up to canonical form (such as adding leading zero if necessary) and is aligned to have output 2954 value of binary point Specified binary point.

Version after the correction of the output valve of the generation CCS of corrector 3018 and saturator 3008.That is, if CCS and The output valve (its such as aforementioned symbol is moved down with assembly line) of saturator 3008 is negative, then corrector 3018 exports zero；Otherwise, school Positive device 3018 is inputted value output.Preferably, the output of corrector 3018 is canonical form and has output binary fraction Binary point specified by 2954 values of point.

Multiplier 3014 reciprocal is specified by user specified in the output of CCS and saturator 3008 and reciprocal value 2942 Numerical value is multiplied, and to generate the product of its normal size, the wherein product is actually the output and conduct of CCS and saturator 3008 The quotient of the divisor reciprocal of 2942 values reciprocal.Preferably, the output of multiplier 3014 reciprocal is canonical form and has by exporting Binary point specified by the value of binary point 2954.

Right shift device 3016 is specified the output of CCS and saturator 3008 by user specified in displacement magnitude 2944 Digit is shifted, to generate the quotient of its normal size.Preferably, the output of right shift device 3016 is canonical form and has The binary point specified by value by output binary point 2954.

Multiplexer 3032 selects being properly entered specified by the value of activation primitive 2934, and selection is provided to symbol and is restored Device 3034, wherein the symbol restorer 3034 are in the case where the value 217 of original accumulator 202 is negative value by multiplexer 3032 Eurymeric output is converted to minus, such as is converted to two complement forms.

Size converter 3036 is based on the value above in relation to the output order 2956 described in Figure 29 A, by symbol restorer 3034 output is converted to size appropriate.Preferably, the output of symbol restorer 3034 has by output binary point The specified binary point of 2954 value.For the first predetermined value of output order 2956, size converter 3036 give up the position of the top half of the output of symbol restorer 3034.In addition, if the output of symbol restorer 3034 is just simultaneously And it is more than to configure maximum value or output that 2902 specified word sizes can be expressed can express to bear and being less than word size Minimum value, then saturator 3036 output it and be saturated to the maximum/minimum that the word size can be expressed respectively.For Two predetermined values and third predetermined value, size converter 3036 make the output of symbol restorer 3034 pass through.

Output or accumulator 202 of the multiplexer 3037 based on 2956 selection size converter and saturator 3036 of output order Output 217 to be supplied to output register 3038.More specifically, the first predetermined value and second for output order 2956 are pre- Definite value, (its size is referred to the low word of the output of the selection size converter of multiplexer 3037 and saturator 3036 by configuration 2902 It is fixed).For third predetermined value, multiplexer 3037 selects the high-word of size converter and the output of saturator 3036.For Four predetermined values, multiplexer 3037 select the low word of the value 217 of original accumulator 202；For the 5th predetermined value, multiplexer 3037 select the middle word of the value 217 of original accumulator 202；And for the 6th predetermined value, multiplexer 3037 selects original 202 value 217 of accumulator high-word.It is preferred that height of the AFU 212 in the value 217 of original accumulator 202 The high-order filling zero of position word.

Referring now to Figure 31, the example of the operation of the AFU 212 of Figure 30 is shown.As shown, configuration 2902 is arranged to NPU 126 narrow configuration.In addition, signed number is true with the value of tape symbol weight 2914 according to 2912.In addition, data binary point 2922 value indicates that the binary point for the word of data RAM 122 is positioned as having 7 on the right side of binary point, and And NPU 126 one of them example value of the first data word received is illustrated as 0.1001110.In addition, weight binary system is small The value of several points 2924 indicates that the binary point for the word of weight RAM 124 is positioned as having 3 on the right side of binary point Position, and NPU 126 one of them example value of the first weight word received is illustrated as 00001.010.

First data word and 16 products (this product and the initial zero value of accumulator 202 are cumulative) of the first weight word are shown It is 000000.1100001100.Due to data binary point 2912 be 7 and weight binary point 2914 be 3, The binary point of implicit accumulator 202 is decided to be have 10 positions on the right side of binary point.The narrow configuration the case where Under, in the exemplary embodiment, the width of accumulator 202 is 28.In this example, it shows performing all ALU operations (examples As in Figure 20 all 1024 it is multiply-accumulate) after the value 217 of accumulator 202 be 000000000000000001.1101010100。

The value of output binary point 2954 indicates that the binary point of output is positioned in binary point Right side have 7 positions.Therefore, after by OBPA 3002 and CCS 3008, the value 217 of accumulator 202 is zoomed in and out, It is rounded and is compressed to the value of canonical form, i.e., 000000001.1101011.In this example, output binary point position Indicate 7 decimal places, and the binary point position of accumulator 202 indicates 10 decimal places.Therefore, OBPA 3002 is counted Difference 3 is calculated, and it is zoomed in and out by the way that the value 217 of accumulator 202 is moved to right 3.This is expressed as adding up in Figure 31 The value 217 of device 202 loses 3 least significant bits (binary number 100).In addition, in this example, the value table of rounding control 2932 Show using random rounding-off, and assumes that sampling random order 3005 is true in this example.Therefore, from the description above, minimum to have Effect position is rounded up to, this is because the rounding bit of the value 217 of accumulator 202 is (by the scaling of the value 217 of accumulator 202 Most significant bit in 3 positions being moved out of) it is 1, and viscous position (is moved out of by the scaling of the value 217 of accumulator 202 The boolean of 2 least significant bits in 3 positions or operation result) it is 0.

In this example, the expression of activation primitive 2934 will use S type functions.Therefore, 3012 selection criteria shape of digit selector So that the input of S patterns block 3024 has, there are three integer-bit and four decimal places for the position of formula value, as described above, i.e. as shown in the figure Value 001.1101.S patterns block 3024 exports the value that is arranged in a standard, i.e., shown in value 000000000.1101110.

2956 specified first predetermined values are ordered in the exemplary output, that is, export the word size represented by configuration 2902, In this case it is narrow word (8).Therefore, size converter 3036 is converted to standard S type output valves with implicit binary system 8 amounts of decimal point, the binary point, which is positioned on the right side of the binary point, 7 positions, to as schemed institute Show and generates output valve 01101110.

Referring now to Figure 32, the second example of the operation of the AFU 212 of Figure 30 is shown.The example of Figure 32 is shown in activation primitive 2934 indicate the operation of the AFU212 in the case of making the value 217 of accumulator 202 pass through with normal size.As shown, configuration 2902 are arranged to the narrow configuration of NPU 126.

In this example, the width of accumulator 202 is 28 positions, and the binary point of accumulator 202 is positioned For have on the right side of binary point 10 positions (as described above, this is because according to one embodiment data, binary point 2912 and the summation of weight binary point 2914 be 10, or because according to another embodiment, accumulator binary fraction Point 2923 is clearly designated as having value 10).In this example, Figure 32 shows the accumulator after executing all ALU operations 202 value 217, i.e., 000001100000011011.1101111010.

In this example, output binary point 2954 value indicate output binary point be positioned as two into There are 4 positions on the right side of decimal point processed.Therefore, after by OBPA 3002 and CCS 3008, as shown, accumulator 202 Value 217 is saturated and is compressed to canonical form value 111111111111.1111, which is received big as standard by multiplexer 3032 Small passes through value 3028.

In this example, two output orders 2956 are shown.First output 2956 specified second predetermined values of order, that is, export The low word of canonical form size.(mean that normal size is 16 since the size indicated by configuration 2902 is narrow word (8) Position), therefore 3036 selection criteria size of size converter generates 8 place values as shown in the figure by the least-significant byte of value 3028 11111111.Second output 2956 specified third predetermined values of order, the i.e. high-word of outputting standard form size.Therefore, size 3036 selection criteria size of converter generates 8 place values 11111111 as shown in the figure by the most-significant byte of value 3028.

Referring now to Figure 33, the third example of the operation of the AFU 212 of Figure 30 is shown.The example of Figure 33 is shown in activation primitive 2934 indicate the operation of the AFU 212 in the case of making the value 217 of entire original accumulator 202 pass through.As shown, matching Set the 2902 wide configurations (for example, 16 input words) for being arranged to NPU 126.

In this example, accumulator 202 is 41 bit wides, and the binary point of accumulator 202 is positioned in two Have on the right side of system decimal point 8 positions (as described above, this is because according to one embodiment, data binary point 2912 Summation with weight binary point 2914 is 8, or because according to another embodiment, accumulator binary point 2923 Clearly it is designated as that there is value 8).In this example, Figure 33 shows the value of the accumulator 202 after performing all ALU operations 217 i.e. 001000000000000000001100000011011.11011110.

In this example, three output orders 2956 are shown.First output order, 2956 specified 4th predetermined value, that is, export The low word of 202 value of original accumulator；Second 2956 specified 5th predetermined value of output order, that is, export original accumulator 202 The middle word of value；And 2956 specified 6th predetermined value of third output order, that is, export a high position for 202 value of original accumulator Word.Since the size indicated by configuration 2902 is wide word (16), Figure 33 is shown, 2956 are ordered in response to the first output, Multiplexer 3037 selects 16 place values 0001101111011110；In response to the second output order 2956, the selection of multiplexer 3,037 16 Place value 0000000000011000；And in response to third output order 2956, multiplexer 3037 selects 16 place values 0000000001000000。

As described above, advantageously, NNU 121 executes operation to integer data rather than floating data.This is conducive to simplify each NPU 126 or at least 204 parts ALU.For example, ALU 204 is without including in floating-point realization in order to by the multiplier of multiplier 242 Index be added needed for adder.Similarly, ALU 204 is without including in floating-point realization in order to by the addend of adder 234 Binary fraction point alignment needed for shift unit.It will be appreciated by those skilled in the art that floating point unit is generally extremely complex；Cause This, these only for the simplification of ALU 204 example, and using allowing users to specified associated binary decimal point Hardware fixed point auxiliary can realize other simplification by instant integer embodiment.Compared to the embodiment of floating-point, ALU 204 The fact that be integer unit, can advantageously generate the NPU 126 of smaller (and faster), this is further conducive to will be large-scale 126 arrays of NPU are integrated into NNU 121.The part of AFU 212 can be based on needed for (preferably user specifies) accumulated value Decimal place quantity and output valve needed for decimal place quantity, come handle accumulator 202 value 217 scaling and saturation.Have Sharp ground, as described in being directed to the embodiment of Figure 11, the increase of any additional complexity and incident size, AFU 212 Power in fixed-point hardware auxiliary and/or time loss, can be by way of sharing AFU 212 between 204 parts ALU Shared, such as this is because the quantity of AFU 1112 can be reduced using shared embodiment.

Advantageously, embodiment as described herein is enjoyed many associated with the complexity of reduction of hardware integer arithmetic unit Benefit (compared to floating point arithmetic unit is used), while still providing the calculation for the decimal number of binary point (i.e. with) Art operation.The advantages of floating-point arithmetic, is：It may fall that (the value range is actually in a very wide value range for single value Be limited only in the size of index range, and the size may be very big) in any position data, arithmetical operation is provided.? That is each floating number has its potential unique exponential quantity.However, embodiment as described herein is recognized and is utilized as follows The fact, i.e.,：There are certain applications, wherein in such applications, input data height is parallel, and its value is in relative narrower In the range of so that " index " of all parallel values can be identical.Therefore, these embodiments allow users to once be directed to institute Some input values and/or accumulated value specify binary point position.Similarly, by recognizing and utilizing the class of parallel output Like range property, these embodiments allow users to once be directed to the specified binary point position of all output valves.People Artificial neural networks are an examples of this application, but the embodiment of the present invention can also be used for executing the calculating of other application. It is specified by being directed to the primary specified binary point position of input rather than carrying out this for each individual input number, compared to floating Point realizes that embodiment can efficiently use memory space (for example, it is desired to less memory), and/or make With precision is promoted in the case of the memory of similar quantity, this is because the position for index in floating-point realization can be used to specify The bigger precision of amplitude.

In addition advantageously, these embodiments are recognized may undergo during cumulative to a series of big integer arithmetics execution Potential precision lose (for example, overflow or lose less important decimal place), and provide a solution, mainly The form for avoiding precision from losing using a sufficiently large accumulator.

The direct execution of NNU microoperations

Referring now to Figure 34, a block diagram is shown, which shows the processor 100 of Fig. 1 and the part details of NNU 121. NNU 121 includes the flow line stage 3401 of NPU 126.Include realizing with the flow line stage 3401 that stage registers separate The combinational logic of the operation of NPU 126 as described herein, such as boolean logic gate, adder, multiplier, compare multiplexer Device etc..Flow line stage 3401 receives microoperation 3418 from multiplexer 3402.Microoperation 3418 flows downward to flow line stage 3401 and control a combination thereof logic.Microoperation 3418 is the set of position.Preferably, microoperation 3418 includes data RAM's 122 The position of storage address 123, the position of the storage address 125 of weight RAM 124, program storage 129 storage address 131 Position, the control position of signal 213/713 of multiplexing register 208/705, multiplexer 802 control signal 803 position and Many fields etc. of (such as Figure 29 A to Figure 29 C) control register 217.In one embodiment, microoperation 3418 includes about 120 positions.Multiplexer 3402 receives microoperations from three different sources, and select one of them as being supplied to flow line stage 3401 microoperation 3418.

One microoperation source of multiplexer 3402 is the sequencer 128 of Fig. 1.Sequencer 128 will connect from program storage 129 The NNU instruction decodings received, and generate the first microoperation 3416 inputted for being provided to multiplexer 3402 in response.

Second microoperation source of multiplexer 3402 is from the reservation station 108 of Fig. 1 to receive microcommand 105 and from GPR 116 and media register 118 receive operand decoder 3404.Preferably, as described above, microcommand 105 is instruction translation Device 104 is generated in response to instructing the translation of 1400 and MFNN instructions 1500 to MTNN.Microcommand 105 may include word immediately Section, the immediate field specify specific function (being instructed specified by 1400 or MFNN instructions 1500 by MTNN), such as start and stop Program in executive memory 129 directly from media register 118 executes microoperation or read/write as described above The memory etc. of NNU.Decoder 3404 is decoded microcommand 105 and generates in response and be provided to multiplexer 3402 The microoperation 3412 of second input.Preferably, in response to certain functions 1432/ of MTNN instruction 1400/MFNN instructions 1500 1532, decoder 3404 is without generating such as write-in control register 127 sent downwards along assembly line 3401, starting to execute journey Program in sequence memory 129, waits for the program in program storage 129 complete at the program in pause executive memory 129 At the microoperation 3412 for executing, reading from status register 127 and reset NNU 121 etc..

The third microoperation source of multiplexer 3402 is media register 118 itself.Preferably, as above in relation to Figure 14 institutes It states, MTNN instructions 1400 may specify a function and be provided to multiplexer from media register 118 to indicate that NNU 121 is directly executed The microoperation 3414 of 3402 third input.Directly executing can be special by the microoperation 3414 that framework media register 118 provides Do not contribute to be tested (such as inherent self-test (BIST)) and debugging to NNU 121.

Preferably, decoder 3404 generates the mode indicators 3422 for controlling the selection of multiplexer 3402.When MTNN refers to Enable a 1400 specified functions bring into operation the program from program storage 129 when, the generation of decoder 3404 makes multiplexer 3402 3422 value of mode indicators that microoperation 3416 is selected from sequencer 128, until mistake occurs or until decoder 3404 Until the MTNN instructions 1400 for encountering the specified function program out of service from program storage 129.When MTNN instructs 1400 It is specified be used to indicate NNU 121 directly execute from the function for the microoperation 3414 that media register 118 provides when, decoder 3404 Generate 3422 value of mode indicators for making multiplexer 3402 select microoperation 3414 from specified media register 118.Otherwise, it decodes Device 3404 generates 3422 value of mode indicators for making the selection of multiplexer 3402 select microoperation 3412 from decoder 3404.

Variable bit rate neural network unit

There may be following situations：NNU 121 runs program, and the processing of processor 100 one is waited for subsequently into idle state The thing to be treated before it can execute next program a bit.For example, it is assumed that with for similar described in Fig. 3 to Fig. 6 A Situation, continuous operation is twice to multiply-accumulate activation primitive program (alternatively referred to as Feedforward Neural Networks network layers program) by NNU 121 Or more time.Program the time it takes is run compared to NNU 121, processor 100 obviously takes longer time NNU Program runs the weighted value write-in weight RAM 124 of 512KB sizes used next time.In other words, NNU 121 can be opposite Program is run in the short time, subsequently into idle state, simultaneous processor 100 is completed next weighted value weight is written RAM 124 uses so that next secondary program is run.This situation is visually shown in Figure 36 A described more fully below.At this Under kind situation, it can be advantageous to so that NNU 121 is run with slow rate and spend the longer time to execute program, so that NNU Energy expenditure needed for 121 operation programs is dispersed in longer time, this can tend to make NNU 121 or even processor 100 maintain lower temperature.This situation is referred to as mitigation pattern and visually shows in Figure 36 B described more fully below Go out.

Referring now to Figure 35, a block diagram is shown, which shows the processor 100 with variable bit rate NNU 121.The processing Device 100 is similar with the processor of Fig. 1 100 in many aspects, and the element with same reference numerals is identical.Figure 35's Processor 100 further includes the clock generation logic 3502 for the functional unit for being coupled to processor 100, these functional units instruct Pickup unit 101, command cache 102, instruction translator 104, renaming unit 106, reservation station 108, NNU 121, Other execution units 112, memory sub-system 114, general register 116 and media register 118.Clock generation logic 3502 include the clock generator of phaselocked loop (PLL) etc., generates the clock with master clock rate or master clock frequency Signal.For example, master clock rate can be 1GHz, 1.5GHz, 2GHz etc..Clock rate indicates the period of clock signal per second Number, such as the concussion number between high low state.Preferably, duty ratio of the clock signal with balance, the i.e. half in period are It is high and the other half is low；Optionally, there is clock signal unbalanced duty ratio, wherein clock signal to be in the time of high state It is longer than it and is in the time of low state, or vice versa.Preferably, PLL can be configured as generating clock rate when having multiple The master clock signal of rate.Preferably, processor 100 includes power management module, is based on many factors adjust automatically master clock Rate, these factors include the operation temperature of the processor 100 dynamically detected, utilization rate and from system software (such as Operating system, BIOS) instruction expected performance and/or energy-saving index order.In one embodiment, power management module packet Include the microcode of processor 100.

Clock generation logic 3502 further includes clock distribution network or Clock Tree.Clock Tree distributes master clock signal to place The functional unit of device 100 is managed, i.e., clock signal 3506-1 is distributed to instruction pickup unit 101 as shown in figure 35, clock is believed Number 3506-2 is distributed to command cache 102, and clock signal 3506-10 is distributed to instruction translator 104, clock is believed Number 3506-9 is distributed to renaming unit 106, clock signal 3506-8 is distributed to reservation station 108, by clock signal 3506-7 Distribution distributes clock signal 3506-4 to other execution units 112 to NNU 121, and clock signal 3506-3 is distributed to depositing Reservoir subsystem 114 distributes clock signal 3506-5 to general register 116, and by clock signal 3506-6 distribute to Media register 118, these signals are commonly referred to as clock signal 3506.Clock Tree includes node or line, is used for master clock Signal 3506 is transmitted to its corresponding functional unit.Moreover it is preferred that clock generation logic 3502 includes clock regtster, when Clock buffer (especially for node farther out) need when that cleaner clock signal is provided regenerate master clock signal, And/or promote the voltage level of master clock signal.In addition, each functional unit can also include the period of the day from 11 p.m. to 1 a.m of its own when needed Zhong Shu, sub-clock tree regenerate and/or are promoted the corresponding master clock signal 3506 received by it.

NNU 121, which includes clock, reduces logic 3504, receives and mitigates indicator 3512, receives master clock signal 3506- 7 and generate auxiliary clock signal in response.Auxiliary clock signal have following clock rate, the clock rate with it is main when Clock rate rate is identical, or the clock rate reduces relative to master clock rate and is programmed into mitigation in the case of mitigation pattern Amount in indicator 3512, to potentially provide the benefit in terms of heat.Clock reduce logic 3504 in many aspects with when Clock generation logic 3502 is similar, and clock, which reduces logic 3504, has clock distribution network or Clock Tree, the clock distribution network Or Clock Tree distributes auxiliary clock signal to each box of NNU 121, is such as expressed as distributing clock signal 3508-1 to NPU Clock signal 3508-2 is distributed to sequencer 128 and is distributed clock signal 3508-3 to interface logic by 126 array 3514, these signals are collectively or individually known as auxiliary clock signal 3508.Preferably, as shown in Figure 34, NPU 126 is wrapped Multiple flow line stages 3401 are included, flow line stage includes receiving auxiliary clock signal 3508-1 for reducing logic 3504 from clock Flow line stage register.

NNU 121 further includes the interface logic for receiving master clock signal 3506-7 and auxiliary clock signal 3508-3 3514.Interface logic 3514 is coupled to the lower part of 100 front end of processor (for example, reservation station 108, media register 118 and logical With register 116) between each box of NNU 121, these boxes, which are clock, reduces logic 3504, data RAM 122, power Weight RAM 124, program storage 129 and sequencer 128.Interface logic 3514 includes data RAM buffer 3522, weight RAM Buffer 3524, the decoder 3404 of Figure 34 and mitigation indicator 3512.Indicator 3512 is mitigated to keep specifying NPU 126 Array will with how slowly speed execute NNU program instructions value.Preferably, it mitigates indicator 3512 and specifies divider value N, clock Logic 3504 is reduced by master clock signal 3506-7 divided by the divider value to generate auxiliary clock signal 3508 so that auxiliary clock signal Rate be 1/N.Preferably, the value of N can be programmed to any of multiple and different predetermined values, so that clock reduces logic 3504 generate the auxiliary clock signal 3508 with multiple and different rates, wherein the multiple different rates are both less than master clock speed Rate.

In one embodiment, it includes clock divider circuitry that clock, which reduces logic 3504, to by master clock signal 3506-7 divided by the value for mitigating indicator 3512.In one embodiment, clock reduce logic 3504 include clock gate (for example, With door), which gates master clock signal 3506-7 using enable signal, wherein master clock signal 3506-7's It is once true per only enable signal of N number of period.It is, for example, possible to use comprising producing for counting the circuit of the up to counter of N Raw enable signal.When the output of adjoint logic detection to counter is matched with N, logic generates in auxiliary clock signal 3508 True value pulse and redesign number device.Preferably, the value for mitigating indicator 3512 can instruct that (MTNN of such as Figure 14 refers to by framework Enable 1400 etc.) it is programmed.Preferably, it as described in more detail for Figure 37, only brings into operation NNU programs in instruction NNU 121 Before, mitigation value is programmed by the framework program run on processor 100 to be mitigated in indicator 3512.

Weight RAM buffer 3524 is coupled between weight RAM 124 and media register 118, for cache them it Between data transmission.Preferably, one or more embodiment classes of the buffer 1704 of weight RAM buffer 3524 and Figure 17 Seemingly.Preferably, the slave media register 118 in weight RAM buffer 3524 receives the part of data with master clock rate Master clock signal 3506-7 as clock, and slave weight RAM 124 in weight RAM buffer 3524 receives the part of data Using the auxiliary clock signal 3508-3 with auxiliary clock rate as clock, wherein the auxiliary clock rate can be mitigated according to being programmed into Value (according to NNU 121 being operated under mitigation pattern or general modfel) in indicator 3512 and relative to master clock speed Rate is reduced or is not reduced.In one embodiment, as described in above in relation to Figure 17, weight RAM 124 is single port, and can be by Media register 118 is via weight RAM buffer 3524 and by both the NPU 126 of Figure 11 or row buffer 1104 to arbitrate Mode (arbitrated fashion) accesses.In an alternative embodiment, as described in above in relation to Figure 16, weight RAM 124 is Dual-port, and each port can be by media register 118 via weight RAM buffer 3524 and NPU 126 or row buffer Both 1104 access in a parallel fashion.

Equally, data RAM buffer 3522 is coupled between data RAM 122 and media register 118, for caching it Between data transmission.Preferably, one or more embodiments of the buffer 1704 of data RAM buffer 3522 and Figure 17 It is similar.Preferably, the slave media register 118 in data RAM buffer 3522 receives the part of data with master clock speed The master clock signal 3506-7 of rate is as clock, and the slave data RAM 122 in data RAM buffer 3522 receives the portion of data Divide using the auxiliary clock signal 3508-3 with auxiliary clock rate as clock, wherein the auxiliary clock rate can be slow according to being programmed into With the value (according to NNU 121 being operated under mitigation pattern or general modfel) in indicator 3512 relative to master clock Rate is reduced or is not reduced.In one embodiment, as described in above in relation to Figure 17, data RAM 122 is single port, and can By media register 118 via data RAM buffer 3522 and by both the NPU 126 of Figure 11 or row buffer 1104 with secondary Sanction mode accesses.In an alternative embodiment, as described in above in relation to Figure 16, data RAM 122 is dual-port, and each port can By via data RAM buffer 3522 media register 118 and both NPU 126 or row buffer 1104 in a parallel fashion It accesses.

Preferably, no matter data RAM 122 and/or weight RAM 124 is single port or dual-port, interface logic 3514 All including data RAM buffer 3522 and weight RAM buffer 3524 to provide the synchronization between master clock domain and auxiliary clock domain. Preferably, data RAM 122, weight RAM 124 and program storage 129 include respectively static state RAM (SRAM), the wherein static state RAM includes corresponding reading enable signal, write-in enable signal and memory selection signal.

As described above, NNU 121 is the execution unit of processor 100.Execution unit is the execution framework instruction of processor The microcommand (microcommand 105 etc. that the framework instruction 103 in Fig. 1 translates into) being translated into or execution framework instruction 103 The functional unit of itself.Execution unit is received from the general register (GPR 116 and media register 118 etc.) of processor Operand.Execution unit generates the result that can be written into general register in response to executing microcommand or framework instruction.Framework The example of instruction 103 is respectively to instruct 1400 and MFNN instructions 1500 for the MTNN described in Figure 14 and Figure 15.Microcommand is realized Framework instructs.More specifically, the overall execution pair for one or more microcommands that execution unit is translated into framework instruction The input of framework instruction executes the operation of framework instruction, to generate result defined in framework instruction.

Referring now to Figure 36 A, a sequence diagram is shown, which is illustrated to have and be grasped with master clock rate in general modfel The operation example of the processor 100 of the NNU 121 of work.In sequence diagram, the process of time is from left to right.Processor 100 is just Framework program is run with master clock rate.More specifically, the front end of processor 100 is (for example, instruction pickup unit 101, instruction height Fast buffer 102, instruction translator 104, renaming unit 106 and reservation station 108) with master clock rate pickup, decoding and hair Cloth framework is instructed to NNU 121 and other execution units 112.

Initially, framework program executes framework instruction (for example, MTNN instructions 1400), and processor front end 100 refers to the framework It enables and is distributed to NNU 121 to indicate NNU programs that NNU 121 brings into operation in its program storage 129.Before this, framework program Framework instruction is executed will be used to specify the value write-in of master clock rate to mitigate indicator 3512, even if NNU is in general modfel. There is master clock signal 3506 more specifically, being programmed into the value mitigated in indicator 3512 and clock being made to reduce logic 3504 and generate Master clock rate auxiliary clock signal 3508.Preferably, in this case, clock reduces the clock regtster of logic 3504 It is simple to promote master clock signal 3506.In addition before this, framework program executes framework instruction so that data RAM 122 and power is written Weight RAM 124 simultaneously will be in NNU program write-in programs memory 129.In response to starting the MTNN instructions 1400, NNU of NNU programs 121 start to execute NNU programs with master clock rate, this is because mitigating indicator 3512 is programmed with main rate value.? After starting the operations of NNU 121, framework program continues to execute framework instruction with master clock rate, including and is mainly instructed with MTNN 1400 write-ins and/or reading data RAM 122 and weight RAM 124, with for the example next time or calling of NNU programs or operation It prepares.

As shown in the example in Figure 36 A, paired data RAM 122 complete compared to framework program and weight RAM's 124 writes Enter/read the time it takes, NNU 121 completes NNU programs within the significantly less time (such as time of a quarter) Operation.For example, all with master clock rate, NNU 121 may spend about 1000 clock cycle to run NNU programs, and About 4000 clock cycle of framework procedure took run.Therefore, NNU 121 (is in this example phase in the remaining time When the long time, for example, about 3000 master clock rate periods) it is interior in the free time.As shown in the example in Figure 36 A, according to The size of neural network and configuration, which continues to execute another time, and may continue to repeatedly.Because NNU 121 can be place Functional unit relatively large and intensive transistor in device 100 is managed, therefore there may be big calorimetrics by NNU 121, especially with master When clock rate operates.

Referring now to Figure 36 B, a sequence diagram is shown, which, which illustrates, has in mitigation pattern i.e. with than master clock rate The operation example of the processor 100 of the NNU 121 of small rate operation.The sequence diagram of Figure 36 B in many aspects with Figure 36 A when Sequence figure is identical, i.e., processor 100 runs framework program with master clock rate.And in this example it is assumed that the framework journey of Figure 36 B Sequence and NNU programs are identical as the framework program of Figure 36 A and NNU programs.However, before starting NNU programs, framework program executes MTNN instructions 1400, mitigation indicator 3512 is programmed with by wherein MTNN instructions 1400 enables clock reduce the generation tool of logic 3504 There is the value of the auxiliary clock signal 3508 of the auxiliary clock rate less than master clock rate.That is, framework program makes at NNU 121 In the mitigation pattern of Figure 36 B rather than the general modfel of Figure 36 A.Therefore, NPU 126 executes NNU programs with auxiliary clock rate, In under mitigation pattern, the auxiliary clock rate be less than master clock rate.It is programmed in this example it is assumed that mitigating indicator 3512 It is useful for for auxiliary clock rate being appointed as the value of a quarter master clock rate.It can as a result, such as comparing Figure 36 A and Figure 36 B To find out, it is that NNU programs institute is run under general modfel that NNU 121 runs NNU program the time it takes under mitigation pattern Four double-lengths of the time of cost so that the time quantum that NNU 121 is in idle condition is relatively short.Therefore, NNU in Figure 36 B 121 in about Figure 36 A NNU 121 consumptive use in four times of the period of the time needed for program is run under general modfel In the energy of operation NNU programs.Therefore, the rate that NNU 121 runs heat caused by NNU programs in Figure 36 B is about to scheme A quarter in 36A, thus can be with the benefit in terms of heat as described herein.

Referring now to Figure 37, a flow chart is shown, which illustrates the operation of the processor 100 of Figure 35.The flow chart institute The operation shown is identical as above in relation to the operation described in Figure 35, Figure 36 A and Figure 36 B in many aspects.Flow starts from box 3702.

At box 3702, processor 100 executes MTNN instructions 1400 so that weight RAM 124 is written in weight and will be counted According to write-in data RAM 122.Flow enters box 3704.

At box 3704, it is specified less than master indicator 3512 to be programmed with that processor 100 executes MTNN instructions 1400 The value of the rate of clock rate, even if NNU 121 is in mitigation pattern.Flow enters box 3706.

Identical as the mode that Figure 36 B are presented at box 3706, processor 100 executes MTNN instructions 1400 with instruction NNU 121 brings into operation NNU programs.Flow enters box 3708.

At box 3708, NNU 121 brings into operation NNU programs.Concurrently, processor 100 executes MTNN instructions 1400 Weight RAM 124 (and may data RAM 122 be written in new data) is written in new weight, and/or execute MFNN instructions 1500 from data RAM 122 to read result (and may read result from weight RAM 124).The flow side of entrance Frame 3712.

At box 3712, processor 100 executes MFNN and instructs 1500 (such as read status registers 127), with detection NNU 121 has terminated to run its program.Assuming that the good value for mitigating indicator 3512 of framework procedure selection, then such as Figure 36 B institutes Show, NNU 121 runs the access weight RAM 124 that NNU program the time it takes amounts execute framework program with processor 100 And/or part the time it takes of data RAM 122 is roughly the same.Flow enters box 3714.

At box 3714, processor 100 executes MTNN instructions 1400 and is programmed with specified master will mitigate indicator 3512 The value of clock rate, even if NNU 121 is in general modfel.Flow enters box 3716.

It is identical as the mode that similar Figure 36 A are presented at box 3716, processor 100 execute MTNN instructions 1400 with Instruction NNU 121 brings into operation NNU programs.Flow enters box 3718.

At box 3718, NNU 121 starts to run NNU programs with general modfel.Flow ends at box 3718.

Run the time of program (i.e. with the master clock rate of processor) under general modfel as discussed previously with respect to NNU, NNU programs are run under mitigation pattern can make the time dispersion of NNU operation programs, so as to provide the benefit in terms of heat Place.More specifically, when NNU runs program under mitigation pattern, since NNU generates heat with slower rate, and these are hot Amount is by NNU (such as semiconductor device, metal layer and bottom substrate) and the packaging body and cooling scheme of surrounding (for example, heat dissipation Piece, fan) it dissipates, therefore device (such as transistor, capacitance, conducting wire) would be possible to work at a lower temperature.It is general next It says, which also reduces the unit temps in the other parts of processor crystal grain.The relatively low operating temperature of device (is especially theirs Junction temperature) there can be the benefit for reducing leakage current.Further, since the magnitude of current flowed through in the unit interval is less, therefore electricity Sense noise and IR pressure drops noise can also reduce.In addition, lower temperature is unstable for the negative temperature bias of the MOSFET of processor Qualitative (NBTI) and positive bias temperature instability (PBSI) also have positive influences, to lifting device and processor part Reliability and/or service life.Lower temperature can also mitigate Joule heat and electromigration in the metal layer of processor.

About the communication mechanism between the framework program and nand architecture program of NNU shared resources

As described above, by taking Figure 24 to Figure 28 and Figure 35 to 37 as an example, data RAM 122 and weight RAM 124 are shared Resource.Both front ends of NPU 126 and processor 100 shared data RAM 122 and weight RAM 124.More specifically, NPU 126 are read out and write to data RAM 122 and weight RAM 124 with the front end (such as media register 118) of processor 100 Enter.In other words, the framework program that runs on processor 100 and the NNU procedure sharing data RAM on NNU 121 is run on 122 and weight RAM 124, and as described above, in some cases, this needs the flow between framework program and NNU programs It is controlled.The resource-sharing is also applied for program storage 129 to a certain extent, this is because framework program deposits program Reservoir 129 is written, and sequencer 128 is read out program storage 129.Context the embodiment described provides To control the high performance solution of the browsing process between framework program and NNU programs to shared resource.

In the embodiments described herein, NNU programs are also referred to as nand architecture program, and NNU instructions are also referred to as nand architecture instruction, And NNU instruction set (also referred hereinabove as NPU instruction set) is also referred to as nand architecture instruction set.Nand architecture instruction set is different from framework Instruction set.Processor 100 include in the embodiment by framework instruction translation at the instruction translator 104 of microcommand, it is non- Architecture instruction set is also different from microinstruction set.

Referring now to Figure 38, a block diagram is shown, which illustrates in greater detail the sequencer 128 of NNU 121.As described above, Storage address 131 is provided to program storage 129 by sequencer 128, and the nand architecture that sequencer 128 is provided to selection refers to It enables.As shown in figure 38, storage address 131 is maintained in the program counter 3802 of sequencer 128.Sequencer 128 usually with The sequence address of program storage 129 is incremented by, except the nand architecture that non-sequencer 128 encounters cycle or branch instruction etc. refers to It enables, in this case, program counter 3802 is updated to the destination address of control instruction by sequencer 128, that is, is updated to position The address of nand architecture instruction at the target of control instruction.Therefore, specified work as the address 131 for being maintained at program counter 3802 Before be picked for the nand architecture programs executed of NPU 126 nand architecture instruction program storage 129 in address.Favorably Ground, as described in below for Figure 39, the value of program counter 3802 can be by framework program via the NNU of status register 127 Program counter field 3912 and obtain.This enables framework program to be determined relative to number based on the progress of nand architecture program Where digital independent/write-in is carried out according to RAM 122 and/or weight RAM 124.

Sequencer 128 further includes cycle counter 3804, which is combined with nand architecture recursion instruction Ground uses, and wherein nand architecture recursion instruction is such as instruction and the address 11 of Figure 28 for being recycled to 1 at the address 10 of Figure 26 A The instruction etc. for being recycled to 1 at place.In the example of Figure 26 A and Figure 28, at the beginning of the nand architecture at 3804 load address 0 of cycle counter Specified value in beginningization instruction, such as value 400.Sequencer 128 encounters recursion instruction and jumps to target instruction target word (example each time Such as, the maxwacc instructions at the address 1 of the multiply-accumulate instruction at the address 1 of Figure 26 A or Figure 28), sequencer 128 will make Cycle counter 3804 successively decreases.Once cycle counter 3804 reaches zero, then sequencer 128 refers into next sequence nand architecture It enables.In an alternative embodiment, specified in the instruction of 3804 loaded cycle of cycle counter to follow when suffering from recursion instruction for the first time Ring count value, to save the demand initialized to cycle counter 3804 via nand architecture initialization directive.Therefore, it recycles The value of counter 3804 indicates the number for the loop body that also execute nand architecture program.Advantageously, as below for Figure 39 institutes It states, the value of cycle counter 3804 can be obtained via 3914 field of cycle count of status register 127 by framework program.This Framework program is enable to determine to exist relative to data RAM 122 and/or weight RAM 124 based on the progress of nand architecture program Where digital independent/write-in is carried out.In one embodiment, sequencer 128 includes three additional cycle counters to adapt to Nested cycle in nand architecture program, and the value of the other three cycle counter can also be read via status register 127.It follows There is fourth finger one to indicate which is used for instant recursion instruction in this four cycle counters in enabling.

Sequencer 128 further includes iteration count 3806.Iteration count 3806 is used in conjunction with nand architecture instruction, Wherein these nand architecture instruction such as at the address 2 of Fig. 4, Fig. 9, Figure 20 and Figure 26 A multiply-accumulate instruction and Figure 28 in Maxwacc instructions at address 2 etc., these instructions hereinafter referred to as " execution " instruct.In the above examples, each execute refers to Order respectively specifies that iteration count 511,511,1023,2 and 3.When the execution that sequencer 128 encounters specified non-zero iteration count refers to When enabling, iteration count 3806 is loaded the designated value by sequencer 128.In addition, sequencer 128 generates microoperation 3418 appropriate With the logic in the flow line stage 3401 of the NPU 126 of control figure 34 for executing, and iteration count 3806 is made to pass Subtract.If iteration count 3806 is more than zero, sequencer 128 generates microoperation 3418 appropriate to control NPU 126 again Interior logic simultaneously makes iteration count 3806 successively decrease.Sequencer 128 continues operate in this way, until iteration count 3806 Until reaching zero.Therefore, the value of iteration count 3806 indicates that also to execute nand architecture executes instruction interior specified operation (example Such as, accumulator and data/weight word it is multiply-accumulate, be maximized, sum) number.Advantageously, as below for Figure 39 institutes It states, the value of iteration count 3806 can be obtained by framework program via 3916 field of iteration count of status register 127.This Sample enables framework program to be determined relative to data RAM 122 and/or weight RAM 124 based on the progress of nand architecture program Where digital independent/write-in is carried out.

Referring now to Figure 39, a block diagram is shown, which shows the control of NNU 121 and certain words of status register 127 Section.As above in relation to shown in Figure 26 B, these fields include that NPU 126 executes nand architecture program and the weight of the last write-in The address 2602 of RAM row, NPU 126 execute nand architecture program and address 2604, the NPU of the last weight RAM row read 126 execute nand architecture program and the address 2606 of the data RAM row of the last write-in and the execution nand architecture journeys of NPU 126 The address 2608 of sequence and the last data RAM row read.In addition, these fields include NNU program counters 3912, cycle Count 3914 and iteration count 3916.As described above, framework program can (such as pass through MFNN instruction 1500) by Status register Device 127 is read to media register 118 and/or general register 116, and status register 127 includes NNU program counters 3912 3916 field value of field value, 3914 field value of cycle count and iteration count.The value of program counter 3912 reflects the journey of Figure 38 The value of sequence counter 3802.The value of the value reflection cycle counter 3804 of cycle count 3914.The value of iteration count 3916 reflects The value of iteration count 3806.In one embodiment, sequencer 128 is in modification program counter 3802, cycle count every time When device 3804 or iteration count 3806, all more 3912 field value of new program counter, 3914 field value of cycle count and iteration Count 3916 field values so that when framework program reads these field values, these field values are current values.In another embodiment In, when NNU 121 executes the framework instruction for read status register 127, NNU 121 obtains merely program counter 3802 value, the value of the value of cycle counter 3804 and iteration count 3806 simultaneously provide back these values to framework instruction (example Such as it is provided to media register 118 or general register 116).

From above it can be found that the field value of the status register 127 of Figure 39 is characterized by nand architecture program by NNU The information of progress during execution.Be described above nand architecture program progress in some terms, such as program counter 3802 value, the value of cycle counter 3804, the value of iteration count 3806, the power of the last write-in/reading 2602/2604 124 addresses 125 weight RAM, and the last write-in/reading 2606/2608 122 addresses 123 data RAM.It is implemented in place The nand architecture program progress value of Figure 39 can be read from status register 127 by managing the framework program on device 100, and for example by all The framework instruction of such as compare instruction and branch instruction is made decisions using these information.For example, in particular for large-scale number It is determined relative to data RAM 122 and/or weight according to collection and/or the Overlapped Execution example of different nand architecture instructions, framework program RAM 124 which traveling row data/weight write-in/reading, with control relative to data RAM 122 or weight RAM 124 Data flow in and out.The example that decision is carried out using framework program is described in context.

For example, as described in above in relation to Figure 26 A, framework program configures nand architecture program to the result of convolution is write back number According to the row for being located at 2402 top of convolution kernel (such as 8 top of row) in RAM 122, and when NNU 121 is write by using the last time The address 2606 of 122 rows of data RAM entered is come when result is written, framework program reads these results from data RAM 122.

For another example as described in above in relation to Figure 26 B, framework program utilizes the field of the status register 127 from Figure 38 Information come determine nand architecture program to the data array 2404 of Figure 24 with 5 512 × 1600 data blocks execute convolution into Degree.Framework program is by first 512 × 1600 data block write-in weight RAM 124 of 2560 × 1600 data arrays 2404 and opens Beginning nand architecture program, the wherein cycle count of the nand architecture program are the output behavior of 1600 and the weight RAM 124 of initialization 0.When NNU 121 executes nand architecture program, framework program read status register 127 is to determine the weight of the last write-in The row 2602 of RAM 124 so that the framework program can read the effective convolution being written by nand architecture program as a result, and in frame Structure program utilizes the above-mentioned effective convolution knot of next 512 × 1600 data block overriding after having read effective convolution results Fruit so that when NNU 121 completes the nand architecture program to first 512 × 1600 data block, processor 100 can be on demand Nand architecture program is updated immediately and starts again at nand architecture program to handle next 512 × 1600 data block.

In another example, it is assumed that framework program makes NNU 121 execute a series of multiply-accumulate activation primitive fortune of traditional neural networks It calculates, wherein weight is stored in weight RAM 124 and result is written back into data RAM 122.In this case, nand architecture Program once has read the row of weight RAM 124, would not read again.Therefore, framework program can be configured as once current power Weight is begun to by nand architecture program reading/use with the example of execution next time (such as next god for nand architecture program Through network layer) new weight override the weight in weight RAM 124.In this case, framework program reading state is deposited Device 127 is to obtain the address 2604 of the last 2604 rows of weight RAM read, to determine that new weight sets can be written Position in weight RAM 124.

In another example, it is assumed that framework program knows that nand architecture program includes having executing instruction for big iteration count, such as schemes Multiply-accumulate instruction of nand architecture at 20 address 2 etc..In this case, framework program may need to know iteration count 3916, to know that nand architecture instruction could be completed by generally also needing to how many a clock cycle so that the framework program can determine It is fixed next to take which of two or more actions.For example, if the time is very long, framework program can be by control Convey another framework program, operating system etc..Equally, it is assumed that framework program knows that nand architecture program includes having quite The loop body that systemic circulation counts, nand architecture program of Figure 28 etc..In this case, framework program may be needed to know and be followed Ring count 3914, to know that nand architecture program could be completed by generally also needing to how many a clock cycle so that the framework program It can determine next to take which of two or more actions.

In another example, it is assumed that framework program make NNU 121 execute with for described in Figure 27 and Figure 28 will pond data It is stored in weight RAM 124 and results back into the similar pond operation of pond operation of weight RAM 124.However, with The example of Figure 27 and Figure 28 is different, it is assumed that result is written back into 400 rows at the top of weight RAM 124, such as row 1600~1999. In this case, it once nand architecture program has read four rows for carrying out pond in weight RAM 124, would not read again It takes.Therefore, framework program can be configured as, once current four rows data are all by nand architecture program reading/use, beginning to use Data in new data overriding weight RAM 124 are (for example, utilize the weight for executing example next time for nand architecture program It is override, for example to execute the multiply-accumulate activation primitive operation of tradition to data behind pond).In this case, framework journey Sequence read status register 127 is to obtain the address 2604 of the last weight RAM row read, to determine new weight sets The position in weight RAM 124 can be written.

Above-mentioned example can also be by carrying out below for Figure 41~Figure 46 the embodiment described, and wherein NNU 121 is logical It crosses ring bus and is coupled to processing core, the execution unit not as processing core, and system storage is also coupled to The ring bus.In such embodiments, due to NNU 121 (for example, data RAM 122 and weight RAM 124) and core And/or the transmission of data/weight between system storage can be with compared with NNU 121 is the embodiment of the execution unit of core Longer delay is generated, therefore makes framework program based on the process of nand architecture program to determine wherein relative to data RAM 122 and/or weight RAM, 124 reading/writing datas can be particularly advantageous.In addition, enabling NNU 121 with very small The mode of adjustment is come to interrupt core can be beneficial, to manipulate interruption delay associated with core and its operating system, Such embodiment is described below for Figure 47~Figure 53.

Referring now to Figure 40, a block diagram is shown, which shows the embodiment of the part of NNU 121.NNU 121 includes movement Unit 5802, mobile register 5804, data-reusing register 208, weight multiplexing register 705, NPU 126, multiplexer 5806, output unit 5808 and output register 1104.Data-reusing register 208 and multiplexing register 208 and above-mentioned phase Together, it but is modified to extraly receive the input from mobile register 5804 and from additional adjacent NPU 126.? In one embodiment, in addition to the output 209 as described above from J+1, data-reusing register 208 also connects in input 211 Receive the output 209 from NPU J-1 and J-4；Equally, in addition to the output 203 as described above from J+1, weight multiplexing deposit Device 705 also receives the output 203 from NPU J-1 and J-4 in input 711.Output register 1104 is referred to as going with above-mentioned Buffer 1104 is identical with the buffer of output state 1104.Output unit 5808 in many aspects with above-mentioned activation primitive Unit 212/1112 is identical, may include activation primitive (for example, S type functions, hyperbolic tangent function, correction function, soft plus letter Number)；However, these output units 5808 preferably further include the re-quantization unit for carrying out re-quantization to the value of accumulator 202, Embodiment is as described below.NPU 126 is same as described above in many aspects.As described above, it is contemplated to different embodiments, wherein counting Can have all size (for example, 8,9,12 or 16), and multiple word sizes according to word width and weight word width Embodiment can be given to support (for example, 8 and 16).Representative embodiment is illustrated however, being directed to down, wherein keeping Data word in memory 122/124, mobile register 5804, multiplexing register 208/705 and output register 1104 Width and weight word width are 8 words, i.e. byte.

Figure 40 shows the section of NNU 121.NPU 126 shown in for example, is the array of (above-mentioned equal) NPU 126 Representative.Representative NPU 126 refers to the NPU [J] 126 in N number of NPU 126, and wherein J is between 0 and N-1.As described above, N is larger number, and preferably 2 power side.As described above, N can be 512,1024 or 2048.In one embodiment In, N is 4096.Due to a large amount of NPU 126 in array, it is therefore advantageous that each NPU 126 is as small as possible, by NNU 121 Size keep within desired tolerance and/or accommodate more NPU 126 to increase the neural network correlometer that NNU 121 is carried out The acceleration of calculation.

In addition, although mobile unit 5802 and 5804 respective width of mobile register are N number of byte, only show to move A part for dynamic register file 5804.Specifically, show that the output 5824 in mobile register 5804 is posted to the multiplexing of NPU [J] 126 Storage 208/705 provides the part of byte, this is partially shown as mobile register [J] 5804.In addition, although mobile unit 5802 Output 5822 provide N number of byte (to memory 122/124 and mobile register 5804), but only byte J is provided for It is loaded into mobile register [J] 5804, which is then provided to number in output 5824 by byte J It is multiplexed register 705 according to multiplexing register 208 and weight.

In addition, although NNU 121 includes multiple output units 5808, single output unit 5808 is only shown in Figure 40, Fortune is executed to the accumulator output 217 of NPU [J] 126 and multiple NPU 126 (such as above in relation to Figure 11) in NPU groups The output unit 5808 of calculation.Output unit 5808 is referred to as output unit [J/4], this is because in the embodiment of Figure 40, respectively Output unit 5808 is shared by the group of four NPU 126.Equally, although NNU 121 includes multiple multiplexers 5806, Figure 40 In single multiplexer 5806 is only shown, that is, receive the NPU [J] 126 and multiple NPU 126 in its NPU group accumulator output 217 multiplexer 5806.Equally, multiplexer 5806 refers to multiplexer [J/4], this is because the multiplexer 5806 tires out at four Selection one is to be provided to output unit [J/4] 5808 in adding device 202 to export 217.

Finally, although the width of output register 1104 is N number of byte, single 4 byte section (table is only shown in Figure 40 It is shown as output register [J/4] 1104), wherein four NPUs 126 of 4 byte section out of NPU groups including NPU [J] 126 connect Receive four quantization bytes caused by output unit [J/4] 5808.All N number of words in the output 133 of output register 1104 Section is provided to mobile unit 5802, but four in the nybble section of output register [J/4] 1104 are only shown in Figure 40 A byte.In addition, it is such as described in more detail above in relation to Figure 49 and 52 of the earlier application, output register [J/4] 1104 Four bytes in nybble section are provided as input to multiplexing register 208/705.

Although being multiplexed register 208/705 in Figure 40 to be shown as distinguishing with 126 phases of NPU, exist and each NPU 126 associated a pair of corresponding multiplexing registers 208/705, and as it is above for example for Fig. 2 and Fig. 7 as described in, these are answered Can be considered as a part of NPU 126 with register 208/705.

The output 5822 of mobile unit 5802 is coupled to mobile register 5804, data RAM 122 and weight RAM 124, it respectively can be by 5822 write-in of output.The output 5822 of mobile unit 5802, mobile register 5804, data RAM The width of 122 and weight RAM 124 is all N number of byte (for example, N is 4906).Mobile unit 5802 is from five following differences Source receives N number of quantization byte and selects one of them as its input：Data RAM 122, weight RAM 124, mobile deposit Device 5804, output register 1104 and instantaneous value.Preferably, mobile unit 5802 includes interconnection can input execution to it Operation now illustrates these operations to generate multiple multiplexers of its output 5822.

The operation that mobile unit 5802 inputs it execution includes：Input is set to pass through to output；Input is set to rotate specified amount； And extract and compress the specified bytes of input.Operation is specified in the MOVE instructions picked up from program storage 129.One In a embodiment, assignable rotation amount is 8,16,32 and 64 bytes.In one embodiment, direction is rotated to the left, but Contemplate rotation direction to the right or the other embodiments of any direction.In one embodiment, it is pre- to extract and compress operation Determine execution in the input block of size.Block size is by MOVE instructions.In one embodiment, predetermined block size be 16, 32 and 64 bytes, and these blocks are located on the snap border of specified block size.Thus, for example, when MOVE instructions specify block big Small when being 32, (for example, if N is 4096, there are 128 for each 32 block of bytes of N number of byte of the extraction input of mobile unit 5802 A block) in specified bytes, and compress it in corresponding 32 block of bytes (preferably in one end of block).In one embodiment In, NNU 121 further includes N bit mask registers (not shown) associated with mobile register 5804.Specified load mask is posted The MOVE instructions of storage operation can specify the row of data RAM 122 or weight RAM 124 as its source.It is instructed in response to MOVE The operation of specified load mask register, mobile unit 5802 extract position 0 from each word in N number of word of the row of RAM, and by N A position is stored into the corresponding positions of N bit mask registers.It is instructed executing the follow-up MOVE for mobile register 5804 to be written Period, the position in bitmask are used as the write-in enabled/disabled of the respective byte of mobile register 5804.In an alternative embodiment, 64 bitmasks are instructed by INITIALIZE and are specified, and the instruction is for executing the MOVE instructions to specified extraction and substrate hold-down function It is loaded into mask register before；It is instructed in response to MOVE, mobile unit 5802 is extracted by being stored in mask register The byte in (for example, 128 in the block) each block specified by 64 bitmasks.In an alternative embodiment, it is used for specified extraction With the also specified stride of MOVE instructions for compressing operation and offset；It is instructed in response to MOVE, mobile unit 5802 is specified from by deviating Byte start once to be extracted per N number of byte in each piece, wherein N is stride, and the byte-code compression extracted is existed Together.For example, if MOVE instructions are specified, stride is 3 and offset is 2, and mobile unit 5802 is opened in each piece from byte 2 Every three bytes that begin once are extracted.

The neural network unit of ring bus connection

The foregoing describe the embodiments that NNU 121 is the execution unit of processor 100.Following embodiment will now be described, Wherein NNU 121 is located at together with multiple conventional process cores of multi-core processor on ring bus, to add as neural network Fast device is operated, and wherein the neural network accelerator is shared by other cores, to represent the core with than these processing The faster mode that core can execute executes neural network correlation computations.In in many aspects, NNU 121 as peripheral unit that Sample is operated, wherein the program operated in core, which can control NNU 121, executes neural network correlation computations.Preferably, Multi-core processor and NNU 121 are made on single integrated circuit.Since the size of NNU 121 may be quite big, especially The size of quantity and memory 122/124 for NPU 126 is very big (for example, the data RAM 122 with 4096 byte wides With 4096 NPU 126 of weight RAM 124) embodiment, therefore this embodiment can provide following advantage, i.e., will not So that the size of each core is increased with the size of NNU 121, but there is the NNU 121 fewer than core, and these cores are shared NNU 121, this makes integrated circuit can be with smaller, despite using potential lower performance as exchange.

Referring now to Figure 41, a block diagram is shown, which shows processor 100.Processor 100 includes multiple ring stations 4004, Wherein this multiple ring station 4004 connects to each other to form ring bus 4024 in a bi-directional way.The embodiment of Figure 41 includes being expressed as Seven ring stations of 4004-0,4004-1,4004-2,4004-3,4004-M, 4004-D and 4004-U.Processor 100 includes difference It is referred to as 0 4012-0 of core complex, 1 4012-1 of core complex, 2 4012-2 of core complex and core complex 3 Four core complexs 4012 of 4012-3, wherein this four core complexs 4012 include respectively for by core complex 4012 are coupled to four ring stations 4004-0,4004-1,4004-2 and 4004-3 of ring bus 4024.Processor 100 further includes Non-core portion 4016 comprising the ring station 4004-U for being coupled to ring bus 4024 by non-core 4016.Finally, it handles Device 100 includes dynamic random access memory (DRAM) controller that ring bus 4024 is coupled to by ring station 4004-D 4018.Finally, processor 100 includes the NNU 121 that ring bus 4024 is coupled to by ring station 4004-M.Face the U.S. is non- When apply 15366027,15366053 and 15366057 (hereinafter referred to as " Dual Use NNU Memory Array Applications (application of double-purpose NNU memory arrays) " is respectively filed on December 1st, 2016 and complete by reference Text is incorporated into this) described in one embodiment in, as described therein, NNU 121 include memory array, the memory array It is used as memory used in the array of the NPU 126 of NNU 121 (for example, weight RAM 124 of Fig. 1) or is used as core The cache memory that heart complex 4012 is shared is used for example as victim cache device (victim cache) or is used as Last level cache device (LLC) piece.Although the example of Figure 41 includes four core complexs 4012, also contemplate with difference The other embodiments of the core complex 4012 of quantity.For example, in one embodiment, processor 100 includes that eight cores are multiple Zoarium 4012.

Non-core 4016 include the access for the system bus 4022 that can be coupled to peripheral unit for control processor 100 Bus control unit 4014, Video Controller, disk controller, peripheral bus controller (for example, PCI-E) etc..At one In embodiment, system bus 4022 is well-known V4 buses.Non-core 4016 can also include other functional units, such as Power Management Unit and privately owned RAM etc. (for example, nand architecture memory used in the microcode of core 4002).In alternative embodiment In, dram controller 4018 is coupled to system bus, and NNU 121 is via ring bus 4024,4014 and of bus control unit Dram controller 4018 accesses system storage.

Dram controller 4018 controls the DRAM as system storage (for example, asynchronous DRAM or synchronous dram (SDRAM), Double Data Rate synchronous dram, direct Rambus DRAM or the DRAM etc. for reducing delay).Core is compound Body 4012, non-core 4016 and NNU 121 access system storage via ring bus 4024.More specifically, NNU 121 from System storage is by the weight of neural network and digital independent to data RAM 122 and weight RAM 124, and via annular total Line 4024 is by the neural network result writing system memory from data RAM 122 and weight RAM 124.In addition, in conduct When victim cache device is operated, memory array (for example, data RAM 122 or weight RAM 124) is in cache Cache line is expelled to system storage under the control of device control logic.In addition, when being operated as LLC pieces, Memory array and cache control logic fill cache line from system storage, and by Cache row write It returns and expels to system storage.

Four core complexs 4012 include respective LLC pieces 4012-0,4012-1,4012-2 and 4012-3, wherein respectively LLC pieces are coupled to ring station 4004 and are usually individually referred to as LLC pieces 4006 or collectively referenced as (multiple) LLC pieces 4006. Each core 4002 includes cache memory, is coupled to 2 grades of (L2) Caches 4008 of ring station 4004 etc..Respectively Core 4002 can also include 1 grade of Cache (not shown).In one embodiment, core 4002 is x86 instruction set framves Structure (ISA) core, it is contemplated however that core 4002 is other implementations of another ISA (for example, ARM, SPARC, MIPS etc.) core Example.

As shown in figure 41,4012 institute of core complex is integrally formed in LLC pieces 4006-0,4006-1,4006-2 and 4006-3 The LLC 4005 of shared processor 100.Each LLC pieces 4006 include memory array and cache control logic.Such as exist Described in the double-purpose NNU memory arrays application being above incorporated by reference into, mode indicators can be set such that Additional (for example, the 5th or 9th) the piece 4006-4 of the memory array of NNU 121 as LLC 4005.In one embodiment In, each LLC pieces 4006 include the memory array of 2MB, it is contemplated however that having different size of other embodiments.In addition, setting The size of memory array and the embodiment of different sizes of LLC pieces 4006 are thought.Preferably, LLC 4005 includes that L2 high speeds are slow Any other Cache (for example, L1 Caches) in storage 4008 and Cache hierarchical structure.

Ring bus 4024 or ring 4024 are to promote to include dram controller 4018, non-core 4016 and LLC pieces 4006 The expansible two-way interconnection of the communication between relevant component inside.Ring 4024 includes two unidirectional rings, the two unidirectional rings are each From further including five subrings：It asks (Request), is used for transmission the most types of request bag including load；It monitors (Snoop), it is used for transmission monitoring request bag；Confirm (Acknowledge), is used for transmission response bag；Data (Data), for passing Transmission of data packet and specific claims including write-in；And credit (Credit), for emitting in remote queue and obtaining Credit.Each node for being attached to ring 4024 is connected via ring station 4004, and wherein the ring station 4004 includes in ring 4024 The queue of packet is sent and received, such as queue described in more detail Figure 42 to Figure 44.Queue is represented in remote queue The attachment assembly to be received initiates the gate region of request on ring 4024 or to be forwarded to attachment assembly from the reception of ring 4024 Request entry queue.Before gate region initiates request on ring, obtained first from remote destination entry queue Credit on credit ring.Which ensure that Remote Portal queue has the resource that can be used for asking the when of reaching to handle the request.? Gate region wishes that the gate region only can in advance not occupy finally remotely to save when sending transaction packet on ring 4024 Transaction packet is sent in the case of incoming packet of the point for destination.When incoming packet reaches ring station 4004 from any direction, inquiry packet Destination ID with determine the ring station 4004 whether be the packet final destination.If destination ID is not equal to ring station 4004 node ID then wraps and advances to next ring station 4004 in subsequent clock.Otherwise, packet leaves ring in same clock 4024, any entry queue involved by the transaction types for packet is consumed.

In general, LLC 4005 includes N number of LLC pieces 4006, wherein each 4006 in N number of 4006 is responsible for passing through The different about 1/N progress for hashing (hash) algorithm or referred to as hashing the physical address space of identified processor 100 are high Speed caching.Hash is using physical address as the appropriate LLC pieces for inputting and selecting to be responsible for being cached the physical address Function.In the case where that must be made requests on from core 4002 or snoop agents to LLC 4005, which must be sent To the appropriate LLC pieces 4006 for being responsible for being cached the physical address of the request.Appropriate LLC pieces 4006 are by request Physical address application hash and determine.

Hashing algorithm is surjective function (surjective function), and the domain of the wherein surjective function is physical address Set or its subset, and the range of the surjective function is the quantity for the LLC pieces 4006 being currently included.More specifically, the model Enclose the set for the index (for example, being 0 to 7 in the case of eight LLC pieces 4006) for being LLC pieces 4006.The function can pass through The appropriate subset of physical address bits is checked to be calculated.For example, in the system with eight LLC pieces 4006, hashing algorithm Output can be simply PA [10:8], i.e., three positions in physical address bits also ascend the throne 8 to position 10.In LLC pieces 4006 In another embodiment that quantity is 8, the output of hash is other address bits (for example, as { PA [17], PA [14], PA [12] ^ PA [10] ^PA [9] } caused by three) logical function.

Before the completion of 4005 caches of any LLC, all requesters of the LLC 4005 all must be having the same Hashing algorithm.Address is cached during defining operation due to hash position and the position that monitoring will be sent It sets, therefore hash is only changed by the coordination between all cores 4002, LLC pieces 4006 and snoop agents.As double-purpose NNU is deposited Described in memory array application, update hashing algorithm consists essentially of：(1) all cores 4002 is made to synchronize with prevent it is new can be high Fast cache access；(2) write-back for executing current included all LLC pieces 4006 in LLC 4005 is invalid, after this leads to modification Cache line be written back to system storage and all cache lines it is all invalid (it is as described below, write-back without Effect can be that selective write-back is invalid, wherein only address by new hashing algorithm hash to the piece different from old hashing algorithm that A little cache lines are ejected, i.e., in vain, and if by changing, are written back into before invalid)；(3) hash is updated Message is broadcast to each core 4002 and monitoring source, and it is (as follows that this orders each core 4002 and monitoring source to change into new hash It is described, become exclusive hash from inclusive (inclusive) hash, or vice versa)；(4) to being used to control to memory The pattern input of the selection logic of the access of array is updated；And (5) restore to execute with new hashing algorithm.

When the quantity N of LLC pieces 4006 is 8 i.e. 2 power side, above-mentioned hashing algorithm is useful, and these are calculated Method can be modified to be easily adaptable to 2 other power sides, for example, being revised as PA [9 for 4 pieces:8] or for 16 pieces It is revised as PA [11:8].However, according in LLC 4005 whether comprising NNU LLC pieces 4006-4 (and according to core complex 4012 quantity), N may or may not be 2 power side.Therefore, as described in the application of double-purpose NNU memory arrays, When 121 memory arrays of NNU have double duty, at least two different hash can be used.

In an alternative embodiment, NNU 121 and dram controller 4018 are both coupled to single ring station 4004.Single ring station 4004 include that NNU 121 and dram controller 4018 are transmitted between each other to ask summed data rather than via annular total Line 4024 asks the interface of summed data to transmit.This can be advantageous, because this can reduce the stream on ring bus 4024 Amount, and the high transformation property between NNU 121 and system storage is provided.

Preferably, processor 100 is fabricated on single integrated circuit or chip.Therefore, can with it is very high can Continuous rate realizes data transmission between system storage and/or LLC 4005 and NNU 121, this answers neural network With, particularly the relatively large Application of Neural Network of the amount of weight and/or data for can be very favorable.Also that is, to the greatest extent Pipe is not the execution unit of core 4002 as the embodiment of Fig. 1, but NNU 121 is closely coupled to core 4002, phase Compared with the neural network unit for the peripheral bus for being for example coupled to PCIe buses etc., this can provide significant memory It can advantage.

Referring now to Figure 42, a block diagram is shown, which illustrates in greater detail the ring station 4004-N of Figure 41.Ring station 4004-N packets Slave interface 6301 is included, the first main interface 6302-0 of main interface 0 is referred to as and is referred to as the second main interface of main interface 1 6302-1.0 6302-0 of main interface and 1 6302-1 of main interface is usually individually referred to as main interface 6302 or is collectively referenced as (multiple) main interface 6302.Ring station 4004-N further includes being coupled on the first unidirectional ring 4024-0 of ring bus 4024 respectively Three moderators of each buffer 6352,6354 and 6356 of outflow request (REQ), data (DATA) and confirmation (ACK) are provided 6362,6364 and 6366；These three moderators 6362,6364 and 6366 receive incoming ask respectively on the first unidirectional ring 4024-0 It asks (REQ), data (DATA) and confirms (ACK).Ring station 4004-N includes the second unidirectional ring being coupled in ring bus 4024 Each additional buffers 6332,6334 and for spreading out of request (REQ), data (DATA) and confirming (ACK) are provided respectively on 4024-1 6336 three additional moderators 6342,6344 and 6346；These three moderators 6342,6344 and 6346 are in the second unidirectional ring Incoming request (REQ), data (DATA) are received on 4024-1 respectively and confirm (ACK).The foregoing describe ring bus 4024 Request subring, data subring and the confirmation subring of each unidirectional ring.It monitors subring and credit subring is not shown, but slave interface 6301 and main interface 6302 be also coupled to and monitor subring and credit subring.

Slave interface 6301 includes load queue 6312 and storage queue 6314；0 6302-0 of main interface includes load queue 6322 and storage queue 6324；And 1 6302-1 of main interface includes load queue 6332 and storage queue 6334.Slave interface 6301 load queue 6312 receives the request of both unidirectional ring 4024-0 and 4024-1 from ring bus 4024 and to this A little requests are lined up, and each moderator into the corresponding moderator 6364 of ring bus 4024 and 6344 provides queue Data.The storage queue 6314 of slave interface 6301 receives the data of the both direction from ring bus 4024 and to these Data are lined up, and each moderator into the corresponding moderator 6366 of ring bus 4024 and 6346 provides confirmation.It is main The load queue 6322 of 0 6302-0 of interface receives the data from the second unidirectional ring 4024-1, and to the first unidirectional ring 4024-0 Moderator 6362 provide queue request.The storage queue 6324 of 0 6302-0 of main interface, which receives, comes from the second unidirectional ring The confirmation of 4024-1, and to the moderator 6364 of the first unidirectional ring 4024-0 provide queue data.1 6302-1 of main interface Load queue 6332 receives the data from the first unidirectional ring 4024-0, and is carried to the moderator 6342 of the second unidirectional ring 4024-1 For the request of queue.The storage queue 6334 of 1 6302-1 of main interface receives the confirmation from the first unidirectional ring 4024-0, and The data of queue are provided to the moderator 6344 of the second unidirectional ring 4024-1.The load queue 6312 of slave interface 6301 is by team The request of rowization is provided to NNU 121, and receives the data from NNU 121.The storage queue 6314 of slave interface 6301 will Queue asks summed data to be provided to NNU 121, and receives the confirmation from NNU 121.First main interface, 0 6302-0's Load queue 6322 receives the request from NNU 121 and is lined up to these requests, and provides data to NNU 121. The storage queue 6324 of first main interface, 0 6302-0 receives asking summed data and asking summed data to these from NNU 121 It is lined up, and confirmation is provided to NNU 121.The load queue 6332 of second main interface, 1 6302-1, which receives, comes from NNU 121 request is simultaneously lined up these requests, and provides data to NNU 121.The storage of second main interface, 1 6302-2 Queue 6334 receives asking summed data and asking summed data to be lined up these from NNU 121, and is carried to NNU 121 For confirming.

In general, slave interface 6301, which receives the slave NNU 121 made by core 4002, loads the request of data (by loading team Row 6312 receive) and receive the request by data storage to NNU 121 made by core 4002 and (connect by storage queue 6314 Receive), but slave interface 6301 can also be acted on behalf of from other ring bus 4024 and receive such request.For example, via subordinate Interface 6301, core 4002 can be with：Relative to 127 write control data of control/status register and reads status data；It will Instruct write-in program memory 129；Relative to data RAM 122 and weight RAM 124 writing/reading datas/weight；And it will Control word write bus controller storage 6636 is programmed with the dma controller 6602 to NNU 121 (see Figure 45).More Body, in the embodiment that NNU 121 is located at the execution unit on ring bus 4024 rather than as core 4002, core 4002 Control/status register 127 can be written and be instructed described in 1400 with the MTNN for Figure 14 with indicating that NNU 121 is executed Similar operation, and can be read out from control/status register 127 with indicate NNU 121 execute be directed to Figure 15 MFNN instruction 1500 described in similar operation.The list of operation includes but not limited to：Journey in start program memory 129 The execution of sequence, the execution of the program in time out program memory 129, to the program in program storage 129 execution complete Request notice (for example, interruption) resets NNU 121, DMA base registers and write-in gating (strobe) is written Address is so that relative to data/write-ins of weight RAM 122/124 or reading row buffer.In addition, slave interface 6301 can be with It generates and interrupts (for example, PCI is interrupted) to each core 4002 under the request of NNU 121.Preferably, sequencer 128 instruction from Belong to interface 6301 and for example generates interruption in response to being decoded to the instruction picked up from program storage 129.Optionally, DMAC 6602 may indicate that slave interface 6301 for example in response to completing dma operation (for example, that will be used as neural net layer result of calculation Data word is after 122 writing system memories of data RAM) and generate interruption.In one embodiment, it includes vector to interrupt, 8 x86 interrupt vectors etc..Preferably, by DMAC 6602 from the mark in the control word that bus marco memory 6636 is read Will specifies whether DMAC 6602 indicates that slave interface 6301 generates interruption when dma operation is completed.

In general, NNU 121 generates the request for writing data into system storage (by storage queue via main interface 6302 6324/6334 receives), and generate (for example, via dram controller 4018) via main interface 6302 and read from system storage The request (being received by load queue 6322/6332) for evidence of fetching, but main interface 6302 may also receive from the opposite of NNU 121 The request of reading/writing data is acted on behalf of in other ring bus 4024.For example, via main interface 6302, NNU 121 can be by number Data RAM 122 and weight RAM 124 are transferred to from system storage, and can be by data from data RAM 122 according to/weight It is transferred to system storage with weight RAM 124.

Preferably, via various entities (such as data RAM 122, the weight of 4024 addressable NNU 121 of ring bus RAM 124, program storage 129, bus marco memory 6636 and control/status register 127 etc.) it is memory mapped into In system memory space.In one embodiment, 121 entities of addressable NNU are via known peripheral component interconnection (PCI) the PCI configuration registers of configuration protocol to map into line storage.

Have the advantages that two main interfaces 6302 for ring station 4004-N are, enable NNU 121 relative to Both system storage (via dram controller 4018) and various L3 pieces 4006 are carried out at the same time transmission and/or reception, Huo Zhebei Selection of land can with twice of bandwidth of the embodiment with single main interface relative to system storage concurrently carry out send and/or It receives.

In one embodiment, data RAM 122 is 64KB, is arranged to 16 rows of every row 4KB, it is therefore desirable to 4 Specify its row address in position；Weight RAM 124 is 8MB, is arranged to the 2K rows of every row 4KB, it is therefore desirable to which 11 positions refer to Its fixed row address；Program storage 129 is 8KB, is arranged to the 1K rows of every row 64, it is therefore desirable to which it is specified in 10 positions Row address；Bus marco memory 6636 is 1KB, be arranged to 128 rows of every row 64, it is therefore desirable to which 7 positions are specified Its row address；Individual queue in queue 6312/6314/6322/6324/6332/6334 includes 16 entries, it is therefore desirable to 4 Specify the index of entry in position.In addition, the width of the data subring of the unidirectional ring 4024 of ring bus 4024 is 64 bytes.Cause This, the parts of 64 bytes be referred to herein as block, data block, data block etc. (" data " be typically used for referring to data and Both weights).Therefore, the row of data RAM 122 or weight RAM 124 is although unaddressable in block level, respectively quilt It is subdivided into 64 blocks；In addition, (Figure 45's) data/weight write-in buffer 6612/6622 and (Figure 45's) data/weight are read Buffer 6614/6624 is respectively also subdivided into respectively 64 blocks with 64 bytes, and the addressable in block level；Cause This, needs the address of the block in 6 Ge Weilai nominated bank/buffer.It is described below to be convenient for illustrating and assumes these sizes；So And, it is contemplated to other embodiments of all sizes.

Referring now to Figure 43, a block diagram is shown, which illustrates in greater detail the slave interface 6301 of Figure 42.Slave interface 6301 include be coupled to Figure 42 ring bus 4024 load queue 6312, storage queue 6314, moderator 6342,6344, 6346,6362,6364 and 6366 and buffer 6332,6334,6336,6352,6354 and 6356.Figure 43 further includes generating Other requesters 6472 (for example, 0 6302-0 of main interface) and generation to the request of moderator 6362 are to moderator 6342 Other requesters 6474 (for example, 1 6302-1 of main interface) of request.

Subordinate loading queue 6312 includes being coupled to the entry 6412 of request moderator 6416 and data arbiter 6414 Queue.In an illustrated embodiment, queue includes 16 entries 6412.Each entry 6412 include for address, source identifier, The storage in direction, transaction identifiers and data block associated with request.Specify the load in NNU 121 requested in address Data are to return to the position that requesting party's ring bus 4024 acts on behalf of (for example, core 4002).It address can be with specified control/state Block position in register 127 or data RAM 122 or weight RAM 124.122/ weight RAM of data RAM are specified in address In the case of block position in 124, the row of the specified 122/ weight RAM 124 of data RAM of a high position, and low level (for example, 6 positions) Specify the block in specified row.Preferably, low level reads caching multiplexer 6615/6625 (see figure for controlling data/weight 45) to select data/weight to read the appropriate block in buffer 6614/6624 (see Figure 45).Source identifier specifies requesting party's ring Shape bus 4024 is acted on behalf of.Direction is specified to be sent out data on the unidirectional ring of which of two unidirectional ring 4024-0 or 4024-1 Return to requester agent.Transaction identifiers are specified by requester agent, and by ring station 4004-N together with requested data one It rises and returns to requester agent.

The also associated state of each entry 6412.Finite state machine (FSM) more new state.In one embodiment In, FSM is operated as follows.It is asked using it as the load of destination on ring bus 4024 when load queue 6312 detects When asking, which distributes available items 6412 and fills distributed entry 6412, and the item that FSM will be distributed The state of mesh 6412 is updated to requesting party NNU.Request moderator 6416 is arbitrated between requesting party NUU entries 6412.When When the entry 6412 distributed wins arbitration and is sent to NNU 121 as request, FSM is by entry 6412 labeled as pending NNU data.When NNU 121 is responded using the data of the request, load queue 6312 loads data into entry 6412 In and by entry 6412 be labeled as requesting party's data ring.Data arbiter 6414 carries out between requesting party's data ring entry 6412 Arbitration.When entry 6412 wins arbitration and data are sent to the ring bus 4024 of request data on ring bus 4024 When acting on behalf of, entry 6412 is labeled as available and sends out credit on its credit ring by FSM.

Subordinate storage queue 6314 includes the entry 6422 for being coupled to request moderator 6426 and confirming moderator 6424 Queue.In an illustrated embodiment, queue includes 16 entries 6422.Each entry 6422 include for address, source identifier, And the storage of data associated with request.It specifies in NNU 121 and (example is acted on behalf of by requesting party's ring bus 4024 in address Such as, core 4002) provide data to store position extremely.It address can be with specified control/status register 127, data RAM Block position in 122 or weight RAM 124, the position in program storage 129 or the position in bus marco memory 6636 It sets.In the case of block position in 122/ weight RAM 124 of data RAM are specified in address, the specified data RAM of a high position 122/ is weighed The row of weight RAM 124, and low level (for example, 6 positions) specifies the block in specified row.Preferably, low level for control data/ Weight demultiplexer 6611/6621 with select the appropriate block in data/weight write-in buffer 6612/6622 be written (see Figure 45).Source identifier specifies requesting party's ring bus 4024 to act on behalf of.

The also associated state of each entry 6422.Finite state machine (FSM) more new state.In one embodiment In, FSM is operated as follows.It is asked using it as the storage of destination on ring bus 4024 when storage queue 6314 detects When asking, which distributes available items 6422 and fills distributed entry 6422, and the item that FSM will be distributed The state of mesh 6422 is updated to requesting party NNU.Request moderator 6426 is arbitrated between requesting party NUU entries 6422.When When entry 6422 wins arbitration and is sent to NNU 121 together with the data of entry 6422, entry 6422 is labeled as waiting for by FSM NNU is handled to confirm.When NNU 121 is responded using confirmation, entry 6422 is confirmed ring by storage FSM labeled as requesting party. Confirm that moderator 6424 is arbitrated between requesting party confirms ring entry 6422.It arbitrates and is confirming when entry 6422 is won When will confirm that the ring bus 4024 for being sent to request storage data is acted on behalf of on ring, entry 6422 is labeled as can be used simultaneously by FSM And send out credit on its credit ring.It stores queue 6314 and also receives wr_busy signals, wherein wr_busy letters from NNU 121 Number instruction storage queue 6314 do not made requests on from NNU 121, until wr_busy signals are no longer valid.

Referring now to Figure 44, a block diagram is shown, which illustrates in greater detail 0 6302-0 of main interface of Figure 42.Although Figure 44 0 6302-0 of main interface is shown, but 0 6302-0 of the main interface can further represent the details of the main interface 16302-1 of Figure 42, therefore will It is generally referred to herein as main interface 6302.The load queue 6322 of ring bus 4024 of the main interface 6302 including being coupled to Figure 42, Store queue 6324, moderator 6362,6364 and 6366 and buffer 6352,6354 and 6356.Generation needle is also shown in Figure 44 To other confirmation request devices 6576 (for example, slave interface 6301) of the confirmation request of moderator 6366.

Main interface 6302 further includes moderator 6534 (being not shown in Figure 42), and the wherein moderator 6534 is from load queue 6322 and from other requesters 6572 (for example, in the embodiment that NNU 121 and dram controller 4018 share ring station 4004-N Dram controller 4018) receive request, and the moderator 6362 for requests for arbitration will be won being presented to Figure 42.Main interface 6302 is also Including buffer 6544, wherein the buffer 6544 receives related to the entry 6512 of load queue 6312 from ring bus 4024 The data of connection, and provide it to NNU 121.Main interface 6302 further includes moderator 6554 (being not shown in Figure 42), wherein should Moderator 6554 is from storage queue 6324 and from other requesters 6574 (for example, NNU 121 and dram controller 4018 are shared Dram controller 4018 in the embodiment of ring station 4004-N) data are received, and it is presented to the secondary of Figure 42 by arbitrating data is won Cut out device 6364.Main interface 6302 further includes buffer 6564, and wherein the buffer 6564 is received and stored from ring bus 4024 The 6522 associated confirmation of entry of queue 6314, and provide it to NNU 121.

Load queue 6322 includes the queue for the entry 6512 for being coupled to moderator 6514.In an illustrated embodiment, team Row include 16 entries 6512.Each entry 6512 includes the storage accorded with for address and destination mark.The specified annular in address Address (being in one embodiment 46) in 4024 address space of bus (for example, system memory locations).Purpose terrestrial reference Know the specified ring bus 4024 by therefrom load data of symbol and acts on behalf of (for example, system storage).

Load queue 6322 receives main load request from NNU 121 (for example, from DMAC 6602), will come from annular total The data that line 4024 acts on behalf of (for example, system storage) are loaded into data RAM 122, weight RAM 124, program storage 129 Or in bus marco memory 6636.Main load request specify destination mark symbol, ring bus address and it is to be used plus Carry the index of 6322 entry 6512 of queue.When load queue 6322 receives main load request from NNU 121, load queue Entry 6512 after 6322 filling indexes, and 6512 state of entry is updated to requesting party's credit by FSM.When load queue 6322 Credit is obtained to act on behalf of request of (for example, system storage) transmission to data to destination ring bus 4024 from credit ring When, state is updated to requestor requests ring by FSM.Moderator 6514 is arbitrated between requestor requests ring entry 6512 (and moderator 6534 is arbitrated between load queue 6322 and other requesters 6572).It is asked when entry 6512 is awarded When seeking ring, (for example, system storage) transmission request is acted on behalf of to destination ring bus 4024 on request ring, and FSM will State is updated to pending data ring.When ring bus 4024 is responded using (such as from system storage) data When, which is received in buffer 6544.And NNU 121 is provided to (for example, being supplied to data RAM 122, weight RAM 124, program storage 129 or bus marco memory 6636), and 6512 state of entry is updated to can be used by FSM.It is excellent The index of selection of land, entry 6512 is included in data packet so that load queue 6322 can determine it is associated with data packet Entry 6512.Preferably, entry 6512 is indexed and is supplied to NNU 121 together with data by load queue 6322, so that NNU 121 can determine that data are associated with which entry 6512 and so that NNU 121 is able to reuse that entry 6512.

Main storage queue 6324 includes the queue for the entry 6522 for being coupled to moderator 6524.In an illustrated embodiment, Queue includes 16 entries 6522.Each entry 6522 includes for address, destination mark symbol, the number for keeping being stored According to the storage of data field and relevant mark used.Specify 4024 address space of ring bus (for example, system stores in address Device position) in address.Destination mark symbol specifies data that will be stored to the agency of ring bus 4024 therein (for example, being System memory).Relevant mark is sent to destination agency together with data.If provided with relevant mark, the relevant mark Will instruction dram controller 4018 monitors LLC 4005 and keeps the copy in LLC 4005 invalid (if its presence).It is no Then, dram controller 4018 writes data into system storage in the case where being not listening to 4005 LLC.

It stores queue 6324 and receives main storage from NNU 121 (for example, from DMAC 6602) and ask, by data from data RAM 122 or weight RAM 124 storages to ring bus 4024 acts on behalf of (for example, system storage).Mesh is specified in main storage request Ground identifier, ring bus address, the index of storage 6324 entry 6522 of queue to be used and the data to be stored. When storage queue 6324 receives main storage from NNU 121 asks, which fills distributed entry 6522, And 6522 state of entry is updated to requesting party's credit by FSM.When storage queue 6324 obtains credit with to purpose from credit ring When ground ring bus 4024 acts on behalf of (for example, system storage) transmission data, state is updated to requesting party's data ring by FSM.It is secondary It cuts out device 6524 and is arbitrated that (and moderator 6554 is in storage queue 6324 and other between requesting party's data ring entry 6522 It is arbitrated between requester 6574).When data ring is awarded in entry 6522, to destination ring bus on data ring 4024 agency's (for example, system storage) transmission datas, and state is updated to pending confirmation ring by FSM.Work as ring bus 4024 using being responded to (for example, from system storage) confirmation of data when, receive this in buffer 6564 Confirm.Then, storage queue 6324 provides confirmation to NNU 121, to have been carried out storage to the NNU 121 notices, and 6522 state of entry is updated to can be used by FSM.Preferably, storage queue 6324 need not be arbitrated confirms (example to be provided to NNU 121 Such as, as in the embodiment of Figure 45, for each storage queue 6324, there are DMAC 6602).However, in storage queue 6324 must arbitrate to provide in the embodiment confirmed, and when ring bus 4024 is responded using confirmation, FSM is by entry 6522 states are updated to requesting party NNU and complete, once and entry 6522 win arbitration and the confirmation to NNU 121 be provided, 6522 state of entry just is updated to can be used by FSM.Preferably, the index of entry 6522 is included in from ring bus 4024 and receives In the confirmation packet arrived, this so that storing queue 6324 can determine entry associated with packet is confirmed 6522.Store queue 6324 The index of entry 6522 is supplied to NNU 121 together with confirmation, so that NNU 121 can determine data and which entry 6512 are associated and NNU 121 are made to be able to reuse that entry 6522.

Referring now to Figure 45, a block diagram is shown, which shows the ring bus of the ring station 4004-N and NNU 121 of Figure 42 Couple a part for embodiment.1 6302- of slave interface 6301,0 6302-0 of main interface and main interface of ring station 4004-N is shown 1.The ring bus coupling embodiment of the NNU 121 of Figure 45 includes the data RAM 122 being described in detail above, weight RAM 124, the embodiment of program storage 129, sequencer 128, control/status register 127.The ring bus of NNU 121 couples Embodiment is similar with above-mentioned execution unit embodiment in many aspects, and for simplicity, these aspects will not be weighed New description.The ring bus coupling embodiment of NNU 121 further includes the element described in Figure 40, for example, mobile unit 5802, shifting Dynamic register file 5804, multiplexing register 208/705, NPU 126, multiplexer 5806, output unit 5808 and output register 1104.NNU 121 further includes the first direct memory access controller (DMAC0) 6602-0, the second direct memory access control Device (DMAC1) 6602-1 processed, bus marco memory 6636, data demultiplexer 6611, data write-in buffer 6612, data RAM multiplexers 6613, digital independent buffer 6614, digital independent caching multiplexer 6615, weight demultiplexer 6621, weight Be written buffer 6622, weight RAM multiplexers 6623, weight read buffer 6624, weight read caching multiplexer 6625, from Belong to multiplexer 6691, main 0 multiplexer 6693 and main 1 multiplexer 6692.In one embodiment, data demultiplexer 6611, Buffer 6612, digital independent buffer 6614, digital independent caching multiplexer 6615, weight demultiplexer is written in data 6621, it is each to read three cached in multiplexer 6625 for weight write-in buffer 6622, weight reading buffer 6624 and weight From associated with the slave interface of ring bus 4,024 6301, main interface 0 6302-0 and 1 6302-1 of main interface respectively.One In a embodiment, with data demultiplexer 6611, data write-in buffer 6612, digital independent buffer 6614, digital independent It caches multiplexer 6615, weight demultiplexer 6621, weight write-in buffer 6622, weight and reads buffer 6624 and weight reading Take three in caching multiplexer 6625 for a pair, these three respectively with the slave interface of ring bus 4,024 6301, main 0 6302-0 of interface and 1 6302-1 of main interface is associated, to support the data transmission of dual cache way.

Data demultiplexer 6611 is coupled to receive respectively to be connect from slave interface 6301, main interface 06302-0 and master The data block of 1 6302-1 of mouth.Data demultiplexer 6611 is also respectively coupled to data write-in buffer 6612, and data write-in is slow Storage 6612 is coupled to data RAM multiplexers 6613, and data RAM multiplexers 6613 are coupled to data RAM 122, data RAM 122 are coupled to digital independent buffer 6614, and digital independent buffer 6614 is respectively coupled to digital independent caching multiplexer 6615, digital independent caching multiplexer 6615 is coupled to subordinate multiplexer 6691, main 0 multiplexer 6693 and main 1 multiplexer 6692. Subordinate multiplexer 6691 is coupled to slave interface 6301, and main 0 multiplexer 6693 is coupled to main interface 06302-0, and main 1 multiplexing Device 6692 is coupled to 1 6302-1 of main interface.Weight demultiplexer 6621 is respectively coupled to receive from slave interface 6301, master The data block of 1 6302-1 of 0 6302-0 of interface and main interface.It is slow that weight demultiplexer 6621 is also respectively coupled to weight write-in Storage 6622, weight write-in buffer 6622 are coupled to weight RAM multiplexers 6623, and weight RAM multiplexers 6623 are coupled to power The weight that weight RAM 124, weight RAM 124 is coupled to reads buffer 6624, and weight reads buffer 6624 and is respectively coupled to weigh It reads again and takes caching multiplexer 6625, weight reads caching multiplexer 6625 and is coupled to subordinate multiplexer 6691, main 0 multiplexer 6693 With main 1 multiplexer 6692.Data RAM multiplexers 6613 and weight RAM multiplexers 6623 are further coupled to 1104 He of output register Mobile register 5804.Data RAM 122 and weight RAM 124 is also respectively coupled to 5802 sum number of mobile unit of NPU 126 According to multiplexing register 208 and weight multiplexer register 705.Control/status register 127 is coupled to slave interface 6301. Bus marco memory 6636 is coupled to slave interface 6301, sequencer 128, DMAC0 6602-0 and DMAC16602-1.Program Memory 129 is coupled to slave interface 6301 and sequencer 128.Sequencer 128 is coupled to program storage 129, bus marco Memory 6636, NPU 126, mobile unit 5802 and output unit 5808.DMAC0 6602-0 are further coupled to main interface 0 6302-0, and DMAC1 6602-1 are further coupled to 1 6302-1 of main interface.

Data are written buffer 6612, digital independent buffer 6614, weight write-in buffer 6622 and weight and read and delay Storage 6624 is the width of data RAM 122 and weight RAM 124, i.e. the width of 126 arrays of NPU, generally referred to herein as N. Thus, for example, in one embodiment, there are 4096 NPU 126, and data write-in buffer 6612, digital independent are slow The width that storage 6614, weight write-in buffer 6622 and weight read buffer 6624 is 4096 bytes, it is contemplated however that N is The other embodiments of value in addition to 4096.Data RAM 122 and weight RAM 124 is once written into entire N words row.Output is posted Data RAM is written via data RAM multiplexers 6613 in storage 1104, mobile register 5804 and data write-in buffer 6612 122, wherein one of selection of data RAM multiplexers 6613 by line inscribed to be written data RAM 122.Output register 1104, weight RAM 124 is written via weight RAM multiplexers 6623 in mobile register 5804 and weight write-in buffer 6622, Wherein one of selection of weight RAM multiplexers 6623 to weight RAM 124 to be written line inscribed.Control logic (not shown) Data RAM multiplexers 6613 are controlled to be written between buffer 6612, mobile register 5804 and output register 1104 in data It is arbitrated to access data RAM 122, and buffer 6622, movement is written in weight in control weight RAM multiplexers 6623 It is arbitrated between register 5804 and output register 1104 with access weight RAM 124.Data RAM 122 and weight RAM 124 also once read entire N words row.NPU 126, mobile unit 5802 and digital independent buffer 6614 are from data RAM 122 Read line inscribed.NPU 126, mobile unit 5802 and weight read buffer 6624 and read line inscribed from weight RAM 124.Control Logic processed also controls NPU126 (data multiplexer register 208 and weight multiplexer register 705), 5802 sum number of mobile unit Determine that a line exported by data RAM 122 is read in which of they (if any) according to buffer 6614 is read Word.In one embodiment, may include control data RAM multiplexers 6613, weight for the microoperation 3418 described in Figure 34 RAM multiplexers 662, NPU 126, mobile unit 5802, mobile register 5804, output register 1104, digital independent caching Device 6614 and weight read at least some of the control logic signal of buffer 6624.

Data are written buffer 6612, digital independent buffer 6614, weight write-in buffer 6622 and weight and read and delay Storage 6624 can address in the block that block size is aligned.Preferably, data write-in buffer 6612, digital independent buffer 6614, weight write-in buffer 6622 and weight read the width of the block size and 4024 data subring of ring bus of buffer 6624 Degree matches.This makes ring bus 4024 be suitable for carrying out following read/write to data/weight RAM 122/124.It is logical Often, data are written the write-in of each block execution block size of buffer 6612 in ring bus 4024, once and data write-in All pieces of buffer 6612 are all filled, and buffer 6612 is written just by the whole of its N word content write-in data RAM 122 in data Row.Equally, weight is written the write-in of each block execution block size of buffer 6622 in ring bus 4024, once and weight All pieces of write-in buffer 6622 are all filled, and buffer 6622 is written just by its N word content write-in weight RAM 124 in weight Full line.In one embodiment, NNU 121 includes associated with each data/weight write-in buffer 6612/6622 row Location register (not shown).When being written data/weight in block by ring station 4004-N, and buffer 6612/6622 is written, row address Register updates.However, before row address register update, its current value and new value are compared, and if two Value differs (that is, new a line of 122/ weight RAM 124 of data RAM is just written into), then this can trigger data/weight write-in Buffer 6612/6622 arrives the write-in of 122/ weight RAM 124 of data RAM.In one embodiment, write-in program memory 129 also can trigger data/weight write-in buffer 6612/6622 arrive the write-in of 122/ weight RAM 124 of data RAM.On the contrary, It is read into digital independent buffer 6614 from data RAM 122 by N word rows；Then ring bus 4024 is cached from digital independent Each block of device 6614 executes the reading of block size.Equally, N word rows are read to weight from weight RAM 124 and reads buffer In 6624；Then ring bus 4024 reads the reading of each block execution block size of buffer 6624 from weight.Although data RAM 122 and weight RAM 124 shows as dual-ported memory in Figure 45, but they are preferably one-port memory so that 122 ports individual data RAM are shared by data RAM multiplexers 6613 and digital independent buffer 6614, and single weight Buffer 6624 is read by weight RAM multiplexers 6623 and weight and is shared in 124 ports RAM.Therefore, full line read/write cloth The advantages of setting is, by making data RAM 122 and weight RAM124 smallers (in one embodiment with single port In, weight RAM 124 is 8MB and data RAM 122 is 64KB), and ring bus 4024 is relative to data RAM 122 and power The write-in of weight RAM 124 and the bandwidth for reading consumption are less than the bandwidth consumed when the independent block of write-in, therefore are NPU 126, defeated Go out register 1104, mobile register 5804 and the solution of mobile unit 5802 and release more bandwidth, to carry out the wide row of N number of word It accesses.

Control/status register 127 is provided to slave interface 6301.Subordinate multiplexer 6691 receives and slave interface The output of 6301 associated digital independent caching multiplexers 6615 and weight associated with slave interface 6301 read slow The output of multiplexer 6625 is deposited, and selects one of them to be supplied to slave interface 6301.In this way, subordinate loading queue 6312 receive for being made to control/status register 127, data RAM 122 or weight RAM 124 by slave interface 6301 The data that are responded of load request.It is slow that main 0 multiplexer 6693 receives digital independent associated with 0 6302-0 of main interface The output for caching multiplexer 6625 is read in the output and weight associated with 0 6302-0 of main interface for depositing multiplexer 6615, And select one of them to be supplied to 0 6302-0 of main interface.In this way, 0 6302-0 of main interface is received for by leading The data that the storage request that interface 0 6302-0 storage queues 6324 are made is responded.Main 1 multiplexer 6692 is received to be connect with master The output of the associated digital independent caching multiplexers 6615 of 1 6302-1 of mouth and power associated with 1 6302-1 of main interface It reads the output for taking caching multiplexer 6625 again, and selects one of them to be supplied to 1 6302-1 of main interface.In this way, main 1 6302-1 of interface receives the number for being responded to the storage request made by main interface 1 6302-1 storage queues 6324 According to.If the request of 6301 load queue 6312 of slave interface is read out from data RAM122, subordinate multiplexer 6691 selects The output of digital independent caching multiplexer 6615 associated with slave interface 6301；And if 6301 load queue of slave interface 6312 requests are read out from weight RAM 124, then subordinate multiplexer 6691 selects weight associated with slave interface 6301 Read the output of caching multiplexer 6625.Equally, if 0 6302-0 of main interface stores queue request and read from data RAM 122 Data then lead the output that 0 multiplexer 6693 selects digital independent caching multiplexer 6615 associated with 0 6302-0 of main interface； And if 0 6302-0 of main interface storage queue requests read data from weight RAM 124, lead the selection of 0 multiplexer 6693 and master The associated weights of 0 6302-0 of interface read the output of caching multiplexer 6625.Finally, if main interface 1 6302-1 storages Queue request reads data from data RAM 122, then leads 1 multiplexer 6692 and select data associated with 1 6302-1 of main interface Read the output of caching multiplexer 6615；And if 1 6302-1 of main interface stores queue request and reads number from weight RAM 124 According to then main 1 multiplexer 6692 selects weight associated with 1 6302-1 of main interface to read the output for caching multiplexer 6625.Cause This, ring bus 4024 acts on behalf of (for example, core 4002) can be via 6301 load queue 6312 of slave interface from control/state Register 127, data RAM 122 or weight RAM 124 are read out.In addition, the agency of ring bus 4024 is (for example, core 4002) queue 6314 can be stored via slave interface 6301 to control/status register 127, data RAM 122, weight RAM 124, program storage 129 or bus marco memory 6636 are written.More specifically, core 4002 can be by program (example Such as, the program that execution is fully connected, convolution, Chi Hua, LSTM or other Recognition with Recurrent Neural Network layer calculate) write-in program memory 129, control/status register 127 is then written with start program.In addition, core 4002 can be by control word write bus control Memory 6636 processed so that DMAC 6602 data RAM 122 or weight RAM 124 and ring bus 4024 agency (for example, System storage or LLC 4005) between execute dma operation.Control word write bus can also be controlled and be stored by sequencer 128 Device 6636, so that DMAC 6602 executes DMA behaviour between data RAM 122 or weight RAM 124 and ring bus 4024 are acted on behalf of Make.Finally, as described in more detail below, DMAC 6602 can execute dma operation with execute ring bus 4024 agency (for example, System storage or LLC 4005) transmission between data/weight RAM 122/124.

Slave interface 6301,0 6302-0 of main interface and 1 6302-1 of main interface are coupled to each other with to its respective data solution Multiplexer 6611 and weight demultiplexer 6621 provide data block.Arbitrated logic (not shown) is in output register 1104, movement Between register 5804 and slave interface 6301, main interface 0 6302-0 and main interface 16302-1, data write-in buffer 6612 It arbitrates for the access to data RAM 122, and is connect in output register 1104, mobile register 5804 and subordinate In order to weight RAM 124 between mouth 6301,1 6302-1 of 0 6302-0 of main interface and main interface, weight write-in buffer 6622 Access and arbitrated.In one embodiment, write-in buffer 6612/6622 is prior to output register 1104 and movement Register 5804, and slave interface 6301 is prior to main interface 6302.In one embodiment, each data demultiplexer 6611 there are 64 that are coupled to 64 blocks that buffer 6612 is written in respective data to export, and (each output is preferably 64 words Section).Data demultiplexer 6611 provides received in the output for being coupled to the appropriate block that buffer 6612 is written in data Block.Equally, each weight demultiplexer 6611 has be coupled to 64 blocks that buffer 6622 is written in respective weight 64 Output (each output is preferably 64 bytes).The appropriate block of buffer 6622 is written being coupled to weight for weight demultiplexer 6621 Output on received block is provided.

When subordinate, which stores queue 6314, provides data block to its data/weight demultiplexer 6611/6621, subordinate storage It deposits queue 6314 and the data/weight write-in buffer to be written also is provided to data/weight demultiplexer 6611/6621 The address of 6612/6622 appropriate block is inputted as control.Block address is held in low six of the address in entry 6422, It is specified that (such as core 4002) is acted on behalf of by the ring bus 4024 of generation subordinate store transaction.On the contrary, when load storage queue 6312 from its data/weight when reading caching 6615/6625 requested data block of multiplexer, which stores queue 6312 also to number Caching multiplexer 6615/6625 is read according to/weight the data to be read/weight is provided read the appropriate of buffer 6614/6624 The address of block is inputted as control.Block address is to maintain low six of the address in entry 6412, wherein the entry 6412 by The agency of ring bus 4024 (for example, core 4002) for generating subordinate loading affairs is specified.Preferably, core 4002 can be via (for example, to 4024 address of predetermined ring bus) slave interface 6301 stores affairs to execute subordinate, so that NNU 121 will be counted Data/weight RAM 122/124 is written in the content that buffer 6612/6622 is written according to/weight；On the contrary, core 4002 can be through Affairs are stored to execute subordinate by (for example, to 4024 address of predetermined ring bus) slave interface 6301, so that NNU 121 will The row of data/weight RAM 122/124 reads data/weight and reads in buffer 6614/6624.

When 6302 load queue 6322/6332 of main interface provides data block to its data/weight demultiplexer 6611/6621 When, which, which is also provided to the index of entry 6512 to load queue 6322/6332, sends out Go out the corresponding DMAC 6602 of load request.In order to which the data of entire 4KB are transmitted to data/weight from system storage The row of RAM122/124, DMAC 6602 must generate 64 main load requests to load queue 6322/6332.DMAC 6602 is patrolled 64 main load requests are divided into four groups on volume, every group includes 16 requests.DMAC 6602 transmits 16 requests in group To corresponding 16 entries 6512 of load queue 6322/6322.The maintenances of DMAC 6602 index associated with each entry 6512 State.That group that data block is loaded currently using entry in four groups of the state instruction.Therefore, as more fully below Described, when DMAC 6602 receives entry 6512 from load queue 6322/6322 to be indexed, the logic of DMAC 6602 is by by group Number connect with index to construct block address, and data/weight demultiplexer is supplied to using the block address of construction as controlling to input 6611/6621。

On the contrary, being asked from its data/weight caching multiplexer 6615/6625 when main interface 6302 stores queue 6324/6334 When seeking data block, which is also provided to the index of entry 6522 to storage queue 6322/6332 sends out the corresponding DMAC 6602 of storage request.In order to by the data of entire 4KB from data/weight RAM122/124 Row be transmitted to system storage, DMAC 6602 must generate 64 masters and store request to storage queue 6324/6334.DMAC 64 storage requests are divided into four groups by 6602 in logic, and every group includes 16 requests.DMAC 6602 to storage queue 6324/ 6334 corresponding 16 entries 6522 carry out 16 requests in group.DMAC 6602 maintains to index phase with each entry 6522 Associated state.That group of data block is stored in four groups of the state instruction currently using entry.Therefore, as it is following more It is described in detail, when DMAC 6602 receives entry 6522 from storage queue 6324/6334 to be indexed, the logic of DMAC 6602 passes through Group number is connected with index to construct block address, and is supplied to data/weight to read using the block address of construction as control input Multiplexer 6615/6625 is cached,.

Referring now to Figure 46, a block diagram is shown, which shows the ring bus coupling embodiment of NNU 121.Figure 46 is one A little aspects are identical as Figure 34, and the identical element of reference numeral is identical.As Figure 34, Figure 46 show NNU 121 from more A source receives microoperation to be supplied to the ability of its assembly line.However, in the embodiment of Figure 46, that in such as Figure 41 of NNU 121 Sample is coupled to core 4002 via ring bus 4024, and difference will now be described.

In the embodiment of Figure 46, multiplexer 3402 receives microoperation from five different sources.Multiplexer 3402 will select Microoperation 3418 be supplied to 126 pipeline stages 3401 of NPU, data RAM 122 and weight RAM 124,5802 and of mobile unit Output unit 5808, to control it, as described above.For as described in Figure 34, first source is to generate microoperation 3416 Sequencer 128.Second source be the decoder 3404 of Figure 34 modification after version, for from the subordinate stored by core 4002 Interface 6301 stores the data block that queue 6314 receives storage request.As described in above in relation to Figure 34, data block may include with 1400 or MFNN instructions 1500 are instructed to translate the similar information of next microcommand from MTNN.Decoder 3404 solves data block Code simultaneously generates microoperation 3412 in response.Another example is received in response to storing queue 6314 from slave interface 6301 The request for writing data into data/weight RAM 122/124 or in response to being connect from 6301 load queue 6312 of slave interface Slave data received/weight RAM 122/124 read the request of data and the microoperation 3412 that generates.Third source is to come from core Slave interface 6301 stored by the heart 4002 stores the immediate data block of the storage request of queue 6314, and wherein core 4002 includes The microoperation 3414 that NNU 121 is directly executed, as described in above in relation to Figure 34.Preferably, core 4002 is stored to ring bus Different memory mapping address in 4024 address spaces, so that decoder 3404 can distinguish the second microoperation source and third Microoperation source.4th source is the microoperation 7217 generated by DMAC 6602.5th source is that microoperation 7219 is calculated in air transport, In in response to the air transport calculate microoperation 7219, NNU 121 keeps its state.

In one embodiment, five sources have the priority scheme performed by decoder 3404, wherein direct microoperation 3414 have highest priority；The microoperation generated in response to the subordinate storage operation of slave interface 6301 by decoder 3404 3412 have the second high priority；There is time high priority by the microoperation 7217 that DMAC 6602 is generated；It is produced by sequencer 128 Raw microoperation 3416 has secondary high priority；And it is acquiescence (i.e. lowest priority), multiplexer that microoperation is calculated in air transport 3402 when no other sources are asked selected source.According to one embodiment, when DMAC 6602 or slave interface 6301 need When accessing data RAM 122 or weight RAM 124, prior to operating in the program on sequencer 128, and decoder 3404 Make the pause of sequencer 128 until its access is completed in DMAC 6602 and slave interface 6301.

According to the interruption for core of condition

As described above, NNU 121 can generate the interrupt requests for core 4002.Interrupt requests can be to core 4002 Notify the event having occurred and that, therefore core 4002 can correspondingly take action.It is to be directed to core with from NNU 121 below The example of the associated event of interrupt requests of the heart 4002.

First, NNU 121 can generate the result (for example, output of neural net layer) accessed for core 4002.This A little results can be available for the reading of core 4002 in data RAM 122 or weight RAM 124；Alternatively, NNU 121 may be Result is transferred to system storage by (for example, using one of DMAC 6602).Secondly, such as new time step (for example, in LSTM layers etc. Recognition with Recurrent Neural Network (RNN)), for new network layer or for current network layer Interior different node sets (for example, the quantity of the neuron wherein in this layer is more than the quantity of the NPU 126 in NNU 121), Currently running program may need core 4002 to provide more input datas to be handled on NNU 121.For example, core 4002 can provide data by writing data into 122/ weight RAM 124 of data RAM；Alternatively, core 4002 can be to The position of 121 notification datas of NNU in the system memory so that data can be transferred to number by NNU 121 from system storage According to 122/ weight RAM 124 of RAM.Third, such as new network layer or for the different sets of node in current network layer (for example, the quantity of the neuron wherein in this layer is more than the quantity of the NPU 126 in NNU 121) is closed, core 4002 may need More weights to be provided for neural computing.Core 4002 can also for example be weighed by the way that data RAM 122/ is written in weight Weight RAM 124 provides weight；Alternatively, core 4002 can notify the position of weight in the system memory to NNU 121, make 122/ weight RAM 124 of data RAM can be transferred to by weight from system storage by obtaining NNU 121.4th, core 4002 can It can need to provide the new procedures executed for NNU 121, the i.e. new procedures for being loaded into program storage 129.Core 4002 It can also be by providing program write-in program memory 129 to program；Alternatively, core 4002 can notify journey to NNU 121 The position of sequence in the system memory so that program can be transferred to program storage 129 by NNU 121 from system storage. 5th, NNU 121 may simply need that core 4002 is allowed to know that program is completed.

Problem is, in many systems, according to the behaviour run on the instruction set architecture of core 4002 and/or the core 4002 Make system, it is understood that there may be relatively large interruption delay reads the state of equipment with determining related to interrupt requests until core 4002 Until the generation of the event of connection, the system equipment that wherein interruption delay is NNU 121 etc. is generated for core 4002 Time needed for interrupt requests.Interruption delay may cause equipment to enter relatively great amount of time of idle state.Equipment this The relatively low utilization rate of kind may be particularly disadvantageous to the performance of system, is especially the nerve such as NNU 121 in the equipment In the case of the accelerator of network calculations accelerator etc..

Following embodiment is described, NNU 121 is enable preferably to execute use when program starts by NNU 121 It is instructed in the setting interrupt condition of setting interrupt condition to reduce delay, wherein the interrupt condition makes NNU 121 when meeting Core 4002 is interrupted, but program continues to run with.Preferably, can using the combination of the value of the mode of operation of NNU 121 come pair Interrupt condition is dynamically programmed.In the case that above-mentioned various, make NNU 121 before interrupt requests correlating event by core 4002 interruption a certain number of clock cycle were approximately from the time of interrupt request singal is arranged in NNU 121 until executing For access NNU 121 with interrupt requests correlating event is responded (for example, read NNU 121 in status register with Determine event actually whether be completed or start relative to 122/ weight RAM of data RAM, 124 writing/reading datas/ Weight or send the data/power relative to 124 wanted read/writes of data RAM122/ weights RAM of direction to NNU 121 The pointer of the address of weight or start program write-in program memory 129 etc.) Interrupt Service Routine first instruction until Clock number.

Referring now to Figure 47, a block diagram is shown, which shows the embodiment of NNU 121.NNU 121 in many aspects with it is upper The embodiment of the NNU 121 stated is similar, and the identical element of reference numeral is similar, and difference is as described herein.Especially Ground, NNU 121 include data RAM 122 same as described above, weight RAM 124, program storage 129,128 and of sequencer The array of NPU 126.In addition, NNU 121 includes interrupt condition register 4706, status register 4704 and control logic 4702.Interrupt condition register 4706 and status register 4704 are coupled to sequencer 128 and control logic 4702.Status register Device 4704 keeps the state of NNU 121 during its operation.State may include various fields, and embodiment is below for figure 49 is described in more detail.It is stored in program storage 129 and the setting interrupt condition instruction 4722 picked up by sequencer 128 is incited somebody to action Interrupt condition register 4706 is written in interrupt condition.The embodiment of interrupt condition is described in more detail below for Figure 48, it is described Interrupt condition may include the combination of various fields corresponding with mode field.The combination can be referred to by each setting interrupt condition Enable 4722 selections.As described in more detail below, control logic 4702 has output, wherein when state meets interrupt condition, control Logic 4702 processed generates the interrupt requests 4712 for processing core 4002 in the output.Although Figure 47 only shows to post state The sequencer 128 that storage 4704 is updated, but each field of status register 4704 can also pass through the other of NNU 121 The operation of element and be updated.Preferably, NNU 121 includes multiple interrupt condition registers 4706 so that calling program can be arranged Multiple interrupt conditions.

Referring now to Figure 48, a block diagram is shown, which is shown in further detail the interrupt condition register 4706 of Figure 47.Middle broken strip Part register 4706 includes following field：Weight RAM write enters address 4802, weight RAM reads address 4804, data RAM write enters Address 4808, program counter 4812, cycle count 4814 and iteration count 4816 are read (also referred to as in address 4806, data RAM For repeat count 4816).These fields respectively have corresponding significance bit, and V is expressed as in Figure 48.Such as institute more fully below It states, when determining whether the state retained in status register 4704 meets interrupt condition, control logic 4702 only considers shape Field in state register 4704, corresponding with the field of significance bit is provided in interrupt condition register 4706.

As described in more detail below, weight RAM write enters address 4802, weight RAM reads address 4804, data RAM write enters Address 4806, data RAM read address 4808, program counter 4812, cycle count 4814 and 4816 field of iteration count with The similar field of naming method in status register 4704 is corresponding.Preferably, for example, for Figure 50 box 5012, in needle To each field provided with significance bit, the value of interrupt condition field and interrupt condition field in interrupt condition register 4706 In the case that the value of corresponding state field in status register 4704 matches, the shape retained in status register 4704 State meets interrupt condition specified in interrupt condition register 4706.In this way, interrupt condition can be considered as middle broken strip The combined value of the effective field of part register 4706.

Referring now to Figure 49, a block diagram is shown, which is shown in further detail the status register 4704 of Figure 47.Status register 4704 include following field：Weight RAM write enters address 4902, weight RAM reads address 4904, data RAM write enters address 4906, Data RAM read address 4908, program counter 4912 (also referred to as program memory address 4912), cycle count 4914 and Iteration count 4916 (also referred to as repeat count 4916).Weight RAM write enters address 4902, weight RAM reads address 4904, data RAM write enters address 4906, data RAM reads address 4908, program counter 4912, cycle count 4914 and iteration count 4916 Maintain the value for the mode of operation for being used to indicate NNU 121.That is, these fields are by more during the operation of NNU 121 Newly.Preferably, the value of field is identical as the field described in the status register 127 above in relation to Figure 39.Weight RAM write enters address The address (for example, address 125 of Figure 47) of the row of the weight RAM 124 of 4902 specified the last write-ins.Weight RAM reads ground The address (for example, address 125 of Figure 47) of the row of the specified the last weight RAM 124 read in location 4904.Data RAM write enters The address (for example, address 123 of Figure 47) of the row of the data RAM 122 of the specified the last write-in in address 4906.Data RAM is read Take the address (for example, address 123 of Figure 47) of the row of the specified the last data RAM 122 read in address 4908.Data RAM 122 and weight RAM 124 can be by NPU 126 or ring station 4004-N for example via output register 1104, mobile register 5804, it mobile unit 5802, write-in buffer 6612/6622 and/or reads buffer 6614/6624 and is read out and writes Enter.Program counter 4912 specifies sequencer 128 to pick up address (such as Figure 47 of a nearest instruction from program storage 129 Address 131, such as program counter 3802 in Figure 38 value).The cycle of 4914 instruction program of cycle count waits to execute Number (for example, value of the cycle counter 3804 in Figure 38).Iteration count 4916 indicates specified in current procedure instruction Operation keep carry out number (for example, value of the iteration count 3806 in Figure 47).

Referring now to Figure 50, a flow chart is shown, what which illustrated the NNU 121 in Figure 47 generates needle based on condition Operation to the interrupt requests of core 4002.Flow starts from box 5002.

At box 5002, sequencer 128 picks up at the address of program counter 3,802 131 from program storage 129 Interrupt condition is set and instructs 4722, and instruction 4722 is decoded.Interrupt condition is set and instructs 4722 specified interrupt conditions.Stream Journey enters box 5004.

At box 5004, in response to instructing 4722 to be decoded setting interrupt condition, sequencer 128 generates microoperation 3416, the wherein microoperation 3416 instructs interrupt condition write-in interrupt condition deposit specified in 4722 by interrupt condition is arranged Device 4706.Flow enters box 5006.

At box 5006, sequencer 128, which continues to pick up from program storage 129, to be instructed and is decoded to the instruction, And generate the microoperation 3416 executed by NPU 126.This causes the state of NNU 121 to change, including to Status register The update of device 4704 is (for example, to program counter 4912, cycle count 3914, iteration count 4916 and data/weight RAM The update of read/write address 4908/4906/4904/4902).Flow enters box 5008.

At box 5008,4702 monitored state register 4704 of control logic is to check the shape of the status register 4704 Whether state meets specified interrupt condition in interrupt condition register 4706.Preferably, for interrupt condition register Each field provided with significance bit, the value of interrupt condition field and the interrupt condition field in 4706 is in status register 4704 In corresponding state field value match in the case of, the state retained in status register 4704 meets interrupt condition. Flow enters decision block 5012.

At decision block 5012, if state meets interrupt condition, flow enters box 5014；Otherwise, flow is returned Back side frame 5008.

At box 5014, control logic 4702 generates the interrupt requests 4712 for core 4002.The flow side of ending at Frame 5014.

Referring now to Figure 51, a table is shown, which shows the program storage 129 of the NNU 121 for being stored in Figure 47 In and the program that is executed by the NNU 121.Exemplary process carries out associated with the layer of artificial neural network as described above Calculating.For example, program can be used for carrying out multiply-accumulate calculating associated with the neural net layer being fully connected, wherein should 30 different instances of the input data of layer are stored in the row 0~29 of data RAM 122, and associated weight is stored in power In the row 0~511 of weight RAM 124, and NNU 121 writes this layer 30 output corresponding with 30 input data examples Enter the row 30~59 of data RAM 122.In order to reach this point, such as 30 institute of cycle count in the initialization directive at address 0 Specified, program includes the loop body for being performed 30 times.With above in relation to similar (in the example of fig. 4, using difference described in Fig. 4 122 addresses data RAM, and there is no cycles), 512 multiply-accumulate operations of each execution example of loop body execution are simultaneously Result is exported into not going together to data RAM 122.In this example it is assumed that when the interruption delay of core 4002 is about 600 The clock period, and it is expected that core 4002 is made to be written into (the referred herein as interrupt requests of data RAM 122 immediately in the 30th output Correlating event) start to read 30 outputs from data RAM 122 later.That is, it is expected that NNU 121 is made to close in interrupt requests About 600 clocks generate the interrupt requests 4712 for core 4002 before connection event.Advantageously, which includes for specifying The setting interrupt condition of interrupt condition instructs 4722 (at addresses 1), so that NNU 121 (i.e. will in interrupt requests correlating event The row 59 of 30th output write-in data RAM 122) before about 600 clock cycle generate the interrupt requests for core 4002 4712。

At address 0, initialization directive designated cycle count value 30, the value is that the execution of each NPU 126 includes address 2~4 The number of the loop body of the instruction at place.Recursion instruction at cycle end (address 5) makes the value of cycle count 4914 successively decrease, and If result non-zero, so that control returns to the top (that is, instruction at return address 2) of loop body.Preferably, initialization refers to Order also resets accumulator 202.Preferably, the recursion instruction at address 5 also resets accumulator 202.Alternatively, at address 2 Multiply-accumulate instruction, which can specify, resets accumulator 202.The row of data RAM 122 is also initialized as zero by initialization directive, And the output row of data RAM 122 is initialized as 30, so that executing example reading accordingly for 30 times of loop body Row 0~29 and writing line 30~59.

At address 1, setting interrupt condition instructs 4722 to refer to the load setting interrupt condition of interrupt condition register 4706 Enable the condition specified by 4722.In the example of Figure 51, condition is the combination of following item：Program counter 4912 is equal to LABEL 1 value (value is 3, i.e. the address of multiply-accumulate instruction at LABEL 1), cycle count 4914 is equal to 1 and repeat count 4916 are equal to 14.As will be described, these values make NNU 121 about 600 before the generation of interrupt requests correlating event Clock generates the interrupt requests 4712 for core 4002, wherein in interrupt requests correlating event, NNU 121 is by loop body The row 59 of the result write-in data RAM 122 of (the 30th time) execution example for the last time.

At address 2 and 3, as explained in detail above, for example, with above in relation to the mode described in Fig. 4 similarly, multiplication is tired Add instruction to it is being read from data RAM 122, with the 512 row weights from 512 of weight RAM 124 readings of not going together The single line of data rotated between NPU 126 executes 512 multiply-accumulate operations and is accumulated in the tired of NPU 126 to generate in total Add the result of device 202.More specifically, repeat count is appointed as 511 by the instruction of address 3.

At address 4, multiply-accumulate accumulation result is exported to the row of current data RAM 122 (the first of cycle It is 30 when secondary execution example, and in last time/the 30th execution example for 59).In one embodiment, output order Activation primitive is executed to the value of accumulator 202 before writing the result into data RAM 122.

It is appreciated that from the program and above description of Figure 51, it is preferable that each execution example of loop body needs about 514 Clock cycle (1 clock of multiply-accumulate needs at address 2,511 clocks of multiply-accumulate needs at address 3, at address 4 Output order need the recursion instruction at 1 clock and address 5 to need 1 clock).In this example it is assumed that core 4002 interruption delay is about 600 clock cycle.As a result, advantageously, control logic 4702 will be associated in interrupt requests About 600 clock cycle, which generate, before event (that is, the 30th output is written to the row 59 of data RAM 122) is directed to core 4002 Interrupt requests 4712.This is because when program counter 4912 is equal to 3 and cycle count 4914 is equal to 1 and repetition is counted When number 4916 is equal to 86 (that is, when control logic 4702 generates interrupt requests 4712), NNU 121 will usually spend more expense 86 again Clock executes multiply-accumulate last 86 iteration at address 3；Then 2 clocks is spent to execute the finger at address 4 and 5 It enables；Then in the last time example of cycle, 1 clock is spent to execute the instruction at address 2；Then at address 3 511 clocks are spent in instruction, then (row of data RAM 122 is written in the 30th output to the instruction at address 4 by the instruction 59) 1 clock is spent on, spends about 600 clocks in total.In this example it is assumed that NNU 121 and the clock of core 4002 week Phase is identical.It is envisaged, however, that the other embodiments that the clock cycle of the two is different.In such embodiments, in selection The value of broken strip part is to consider the difference of the clock cycle of NNU 121 and core 4002.

Referring now to Figure 52, show that the program storage 129 in the NNU 121 for Figure 47 according to alternative embodiment stores simultaneously And the setting interrupt condition instruction 4722 executed by the NNU 121.The setting interrupt condition instruction 4722 of Figure 52 can be substituted into At address 1 in the program of Figure 51, to complete the result similar with Figure 51 by using different interrupt condition.It is specified in Figure 52 Interrupt condition be data RAM write enter address 4906 equal to 57 and repeat count 4916 be equal to 86.It holds for the 28th time in cycle During row example, the row 57 of data RAM 122 is written in the 28th output by the output order at address 4.Then, the of cycle During the execution of instruction during 29 times execute example and at address 3, repeat count 4916 will be decremented to value 86, this will make Control logic 4702 generates the interrupt requests 4712 for core 4002.This will be interrupt requests correlating event (i.e. by the 30th The row 59 of output write-in data RAM 122) before about 600 clock cycle.This is because when the writing address of data RAM 4906 equal to 1 and when repeat count 4916 is equal to 86 (that is, when the generation interrupt requests 4712 of control logic 4702), NNU 121 will usually spend more 86 clocks of expense to execute multiply-accumulate last 86 iteration at address 3 again；When then spending 2 Clock executes the instruction at address 4 and 5；Then in the last time example of cycle, 1 clock is spent to execute at address 2 Instruction；Then 511 clocks are spent in the instruction at address 3, then (instruction is by the 30th for the instruction at address 4 The row 59 of output write-in data RAM 122) on spend 1 clock, spend about 600 clocks in total.

Referring now to Figure 53, a table is shown, which shows that the program storage 129 in the NNU 121 for Figure 47 stores And the program executed by the NNU 121.Exemplary process carries out associated with the layer of artificial neural network as described above It calculates.For example, as described in the program for Figure 26, which can be used for using convolution kernel (for example, each convolution in Figure 24 Core) convolution is executed to data matrix, and write back weight RAM 124.In order to reach this point, such as the initialization at address 0 Specified by cycle count 400 in instruction, program includes the loop body for being performed 400 times.Each execution example of loop body is all 9 multiply-accumulate operations (instruction of address 2~7) are executed, result is exported to (the finger of address 8 of not going together of weight RAM 124 Enable), so that the row register of weight RAM 124 is successively decreased (instruction of address 9), and be circulated back to loop body top (address 10 Instruction).Therefore, each execution example of loop body can all expend about 12 clock cycle.

Again, in this example it is assumed that the interruption delay of core 4002 is about 600 clock cycle, and it is expected to make core The heart 4002, which is written into immediately in the 400th output after weight RAM 124 (referred herein as interrupt requests correlating event), to be started 400 outputs are read from weight RAM 124.That is, it is expected to make NNU 121 about 600 before interrupt requests dependent event A clock generates the interrupt requests 4712 for core 4002.Advantageously, which includes the setting for specifying interrupt condition Interrupt condition instructs 4722 (at addresses 1), so that about 600 clock cycle before interrupt requests correlating event of NNU 121 Generate the interrupt requests 4712 for core 4002.

In the example of Figure 53, setting interrupt condition instructs 4722 to load following interruption to interrupt condition register 4706 Condition：Cycle count 4914 is equal to 50.The interrupt condition value makes NNU 121 about 600 before the generation of interrupt requests correlating event A clock generates the interrupt requests 4712 for core 4002, wherein in interrupt requests correlating event, NNU 121 is by loop body The execution example of last time (the 400th time) result write-in weight RAM 124.This is because when cycle count 4914 is equal to 50 When (that is, when control logic 4702 generate interrupt requests 4712 when), NNU 121 generally directed to cycles left execute example in it is every Secondary cycle, which executes example, will spend 12 clocks, spend about 600 clocks in total.

The advantages of described embodiment, is that these embodiments, which impart, enables program designer as accurately as possible NNU 121 generates the clock number of interrupt requests before knowing that 121 interrupt requests correlating events of NNU occur.In view of the fact that This is particularly advantageous：(1) the single instructions of NNU 121 may need thousands of clock (for example, having big repeat count) It completes, and 121 programs of (2) NNU may be with the relatively large cycle of cycle count, but is needed while program direct circulation Generate an interrupt requests, i.e., interrupt requests need in the particular iteration of cycle (that is, all cycles completion before) generation, And it needs to generate at the certain point in the cycle of particular iteration in some cases.

It should be appreciated that in some cases, interrupt condition is possibly set to so that NNU 121 " too earlyly " interrupts core 4002.That is, at the end of interruption delay, event associated with interrupt requests may not yet occur.(however, operation System (for example, device driver) still can execute correct operation by reading the status register 127 of NNU 121, To determine that event has occurred and that before proceeding.As described above, NNU 121 includes status register 127, wherein core 4002 can To read the state of the status register 127 to determine NNU 121, for example, determine whether to have occurred associated with interrupt requests Event.) this may lead to a degree of inefficient utilization to core 4002.Hence it is advantageous to which program designer can be certainly Interrupt condition is defined, more to be closed according to the utilization rate waste for being reduction NNU 121 or the utilization rate waste for reducing core 4002 Key generates the time quantum of interrupt requests 4712 come NNU 121 before customizing interrupt requests correlating event.That is, if core 4002 be considered as more crucial resource, then program designer can make NNU 121 before interrupt requests correlating event Clock periodicity less than the interruption delay of 4002/ operating system of core generates interrupt requests 4712；And on the contrary, if NNU 121 be considered as more crucial resource, then program designer can make being more than before interrupt requests correlating event of NNU 121 The clock periodicity of the interruption delay of NNU 121 generates interrupt requests 4712.Another advantage, which is it, can improve exploitation NNU 121 The productivity of the program designer of the program of upper operation.

Although it have been described that the equipment for generating interrupt requests according to condition is the embodiment of neural network unit, but It is that in other embodiments, which can be able to carry out other programmable devices of program.For example, it is contemplated to which equipment is to add Close/decryption unit, compression/decompression unit, Multimedia Encoder/decoder element, database index unit or graphics process The embodiment of unit.

Although there has been described various embodiments of the present invention, these embodiments are by way of example, and not limitation It presents.The technical staff of relevant computer arts will be apparent, it can be in the case of without departing from the scope of the present invention Various changes in terms of carry out form and details.For example, software can for example support the work(of device and method of the present invention Can, manufacture, modeling, emulation, description and/or test etc..This can use general programming language (for example, C, C++), include The hardware description language (HDL) of Verilog HDL, VHDL etc. or other available programs are realized.Such software can be arranged In any of computer usable medium, tape, semiconductor, disk, CD (for example, CD-ROM, DVD-ROM etc.), Network, wired or other communication medias etc..The embodiment of device and method described herein may be included with such as processor core In the semiconductor intellectual property core of the heart (for example, being embodied or specified using HDL) etc., and by the making of integrated circuit by Be converted to hardware.In addition, device and method described herein can also be presented as the combination of hardware and software.Therefore, the present invention is not It should be limited, and should be carried out according only to following following claims and its equivalent item with any exemplary embodiments described herein It limits.Specifically, the present invention may be implemented in the processor device that can be used for all-purpose computer.Finally, people in the art Based on member should be appreciated that they easily can use disclosed concept and specific embodiment, to design and change it Its structure without departing from the scope of the present invention as defined in the appended claims to realize the identical purpose of the present invention.

Cross reference to related applications

This application involves following U.S. non-provisional applications, are incorporated by this each by reference.

The above non-provisional application respectively requires the priority based on following U.S. Provisional Application, the U.S. Provisional Application each It is incorporated by from by reference in this.

The application further relates to following U.S. non-provisional application, is incorporated by this each by reference.

Claims

1. a kind of programmable device, including：

Output is used to generate the interrupt requests for the processing core for being coupled to the equipment；

Program storage, the instruction of program for being used to keep the equipment to pick up and executing；

Data storage is used to keep the data handled by described instruction；

Status register is used to keep the equipment newer state during the operation of the equipment, the state to have packet Include the field of following item：

Program memory address, wherein at the program memory address, from the nearest instruction of described program memory pickup；

Data store access address, wherein at the data store access address, the equipment is in the data storage Middle the last time accesses data；And

Repeat count is used to indicate operation specified in current procedure instruction and waits the number carried out；

Condition register has condition field corresponding with the mode field retained in the status register, wherein It includes one or more of described condition field condition word that the condition register can be written via the instruction of described program The condition of section；And

Control logic is used to meet the condition register in response to being detected as the state retained in the status register In specified condition, the interrupt requests for the processing core are generated in the output.

2. equipment according to claim 1, wherein

The condition field respectively has corresponding significance bit；And

For in the condition register each condition field provided with significance indicator, the value of the condition field and this In the case that the value of corresponding state field of the part field in the status register matches, protected in the status register The state held meets condition specified in the condition register.

3. equipment according to claim 1, wherein

The state is also with the field for including following item：

Cycle count, the cycle for being used to indicate described program wait the number executed.

4. equipment according to claim 1, wherein further include：

Weights memory is used to keep weight associated with neural computing；

The state is also with the field for including following item：

Weights memory access address, wherein at the weights memory access address, the equipment is in the weights memory Middle the last time accesses weight.

5. equipment according to claim 4, wherein

The weights memory access address includes following two addresses：

Weights memory reads address, wherein being read at address in the weights memory, described equipment the last time is from the power Weight memory has read weight；And

Weights memory writing address, wherein at the weights memory writing address, described equipment the last time is to the power Weight memory is written with weight.

6. equipment according to claim 1, wherein

The equipment includes for making the neural network unit for calculating acceleration associated with neural network.

7. equipment according to claim 1, wherein further include：

Ring station is used to the equipment being coupled to the ring bus that the processing core is equally coupled to.

8. equipment according to claim 1, wherein

The data store access address includes following two addresses：

Data storage reads address, wherein being read at address in the data storage, described equipment the last time is from the number Data are had read according to memory；And

Data storage writing address, wherein at the data storage writing address, described equipment the last time is to the number It is written with data according to memory.

9. a kind of operating method of equipment, the equipment include：Program storage is used to keep the equipment to pick up and holds The instruction of capable program；Data storage is used to keep the data handled by described instruction；Status register is used to protect The equipment newer state during the operation of the equipment is held, wherein the state is with the field for including following item：Program Storage address, wherein at the program memory address, from the nearest instruction of described program memory pickup；Data store Device access address, wherein at the data store access address, the equipment is the last right in the data storage Data are accessed；And repeat count, it is used to indicate time that operation specified in current procedure instruction waits to carry out Number, the equipment further includes condition register, and the condition register has and the state retained in the status register The corresponding condition field of field, the method includes：

It includes one or more of condition field condition to be written to the condition register via the instruction of described program The condition of field；And

Meet condition specified in the condition register in response to the state being detected as retained in the status register, Generate the interrupt requests for processing core.

10. according to the method described in claim 9, wherein,

The condition field respectively has corresponding significance bit；And

11. according to the method described in claim 9, wherein,

The state is also with the field for including following item：

12. according to the method described in claim 9, wherein,

The equipment further includes the weights memory for keeping weight associated with neural computing；

The state is also with the field for including following item：

13. according to the method for claim 12, wherein

The weights memory access address includes following two addresses：

14. according to the method described in claim 9, wherein,

15. according to the method described in claim 9, wherein, the equipment further includes：

16. according to the method described in claim 9, wherein,

The data store access address includes following two addresses：

17. a kind of non-transitory computer usable medium comprising computer available programs, the computer available programs make Computer is used as each component in processor according to any one of claim 1 to 8.

18. non-transitory computer usable medium according to claim 17, wherein the non-transitory computer is available Medium is selected from disk, band or other magnetic, optics and electronic storage medium set.