CN106503796A - Multioperation neutral net unit - Google Patents

Multioperation neutral net unit Download PDF

Info

Publication number
CN106503796A
CN106503796A CN201610864609.2A CN201610864609A CN106503796A CN 106503796 A CN106503796 A CN 106503796A CN 201610864609 A CN201610864609 A CN 201610864609A CN 106503796 A CN106503796 A CN 106503796A
Authority
CN
China
Prior art keywords
unit
processing unit
neural processing
data
neutral net
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610864609.2A
Other languages
Chinese (zh)
Other versions
CN106503796B (en
Inventor
G·葛兰·亨利
泰瑞·派克斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhaoxin Semiconductor Co Ltd
Original Assignee
Shanghai Zhaoxin Integrated Circuit Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/090,807 external-priority patent/US10380481B2/en
Priority claimed from US15/090,796 external-priority patent/US10228911B2/en
Priority claimed from US15/090,823 external-priority patent/US10409767B2/en
Priority claimed from US15/090,727 external-priority patent/US10776690B2/en
Application filed by Shanghai Zhaoxin Integrated Circuit Co Ltd filed Critical Shanghai Zhaoxin Integrated Circuit Co Ltd
Publication of CN106503796A publication Critical patent/CN106503796A/en
Application granted granted Critical
Publication of CN106503796B publication Critical patent/CN106503796B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • G06F7/575Basic arithmetic logic units, i.e. devices selectable to perform either addition, subtraction or one of several logical operations, using, at least partially, the same circuitry
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/468Specific access rights for resources, e.g. using capability register
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit

Abstract

Neutral net unit includes a neural pe array.First and second multiplexer of N number of neural processing unit respectively selectively collective as first and second N word rotator.First and second memorizer loads N number of weight/data literal row respectively and provides N number of for string weight/data literal to corresponding neural processing unit.Neural processing unit is optionally:Multiply-accumulate computing is executed with the string data literal for being received from second memory using the 2nd N word rotator to the multiple row weight word from first memory;Convolution algorithm is executed with the multi-column data word from second memory to the multiple row weight word for being received from first memory using a N word rotator;And common source computing is executed to the multiple row weight word for being received from first memory using a N word rotator.

Description

Multioperation neutral net unit
Technical field
The present invention relates to the place of a kind of processor, more particularly to a kind of operation efficiency for lifting artificial neural network and efficiency Reason device.
Subject application advocates the international priority of following U.S. Provisional Application case.These priority cases are incorporated by this Case is for reference.
Subject application is associated with following while the U. S. application case filed an application.These association request cases are incorporated by this Case is for reference.
Background technology
In recent years, artificial neural network (artificial neural networks, ANN) has attracted the note of people again Meaning.These researchs are commonly known as deep learning (deep learning), computer learning (computer learning) etc. Similar terms.The lifting of general processor operational capability also raised people after many decades now for artificial neural network Interest.The recent application of artificial neural network includes language and image identification etc..For the computing for lifting artificial neural network Efficiency seems to increase with the demand of efficiency.
Content of the invention
In view of this, the present invention provides a kind of neutral net unit.This neutral net unit includes one by N number of nerve The array that reason unit (NPU) is constituted, first memory and second memory.Wherein, each neural processing unit includes an arithmetic Unit and an accumulator;And first and second multitask buffer.First and second multitask buffer is respectively provided with first With the second output, received by arithmetical unit first and second multiplexer corresponding with an adjacent nerve processing unit, N number of god Through processing unit multiple first multiplexers and multiple second multiplexers respectively selectively collective as first N number of word Rotator is operated with second N number of word rotator.First memory in order to load the row of N number of weight word, and by string N Individual weight word provide to N number of neural processing unit in corresponding multiple neural processing unit.Second memory is in order to filling Carry the row of N number of data literal, and the corresponding multiple god during N number of for string data literal is provided to N number of neural processing unit Through processing unit.Wherein, neutral net unit programmed can make neural pe array optionally:To being received from first N number of weight word of the multiple row of memorizer and the N number of data of the string for being received from second memory using second N number of word rotator Word, executes multiply-accumulate computing;To being received from N number of weight of the multiple row of first memory using first N number of word rotator Word and N number of data literal of the multiple row for being received from second memory, execute convolution algorithm, and wherein, multiple row weight word is one Data matrix, multi-column data word are the element of a convolution kernel;And deposit to being received from first using first N number of word rotator N number of weight word of the multiple row of reservoir, executes common source computing.
The method that the present invention also provides a kind of one neutral net unit of running.This neutral net unit includes one by N number of The array that neural processing unit is constituted, each neural processing unit include an arithmetical unit and an accumulator and first and second Multitask buffer.First and second multitask buffer is respectively provided with first and second output, adjacent with one by arithmetical unit First and second multiplexer corresponding of neural processing unit is received.The method includes:Using a first memory, by string N number of weight word provide to N number of neural processing unit in corresponding multiple neural processing unit;Using one second storage Device, the corresponding multiple neural processing unit during N number of for string data literal is provided to N number of neural processing unit;And journey Sequence neutral net unit, makes neural pe array optionally:In one first example (instance), to being received from N number of weight word of the multiple row of one memorizer and the N number of number of the string for being received from second memory using second N number of word rotator According to word, multiply-accumulate computing is executed;In one second example, to being received from first memory using first N number of word rotator Multiple row N number of weight word and the multiple row for being received from second memory N number of data literal, execute convolution algorithm, wherein, many Row weight word is a data matrix, and multi-column data word is the element of a convolution kernel;And in one the 3rd example, to utilizing One N number of word rotator is received from N number of weight word of the multiple row of first memory, executes common source computing.
The present invention also offer one kind is encoded in an at least non-momentary computer can be using media so that a computer installation makes One computer program.This computer program includes that the computer for being included in media can use program code, uses To describe a neutral net unit.This computer can include the first program code, the second program code and using program code Three program codes.First program code is in order to describing an array being made up of N number of neural processing unit (NPU).Wherein, each Neural processing unit includes an arithmetical unit and an accumulator, and first and second multitask buffer.First and second is more Task buffer device is respectively provided with first and second output, by arithmetical unit and an adjacent nerve processing unit corresponding first and Second multiplexer is received.Multiple first multiplexers of N number of neural processing unit are selected respectively with multiple second multiplexers Xing Di collectives are operated with second N number of word rotator as first N number of word rotator.Second program code is in order to describing One first memory.First memory loads the row of N number of weight word, and N number of for string weight word is provided to N number of nerve Corresponding multiple neural processing unit in reason unit.3rd program code is in order to describing a second memory.Second storage Device loads the row of N number of data literal, and corresponding many during N number of for string data literal is provided to N number of neural processing unit Individual neural processing unit.Wherein, neutral net unit programmed can make neural pe array optionally:To being received from N number of weight word of the multiple row of first memory is N number of with the string for being received from second memory using second N number of word rotator Data literal, executes multiply-accumulate computing;To be received from using first N number of word rotator first memory multiple row N number of Weight word and N number of data literal of the multiple row for being received from second memory, execute convolution algorithm, wherein, multiple row weight word For a data matrix, multi-column data word is the element of a convolution kernel;And to being received from using first N number of word rotator N number of weight word of the multiple row of one memorizer, executes common source computing.
Specific embodiment of the present invention, will be further described by below example and schema.
Description of the drawings
Fig. 1 illustrates for the square for showing the processor comprising neutral net unit (neural network unit, NNU) Figure.
Fig. 2 is the block schematic diagram of the neural processing unit (neural processing unit, NPU) for showing Fig. 1.
Fig. 3 is block chart, shows that N number of multitask of the N number of neural processing unit of the neutral net unit using Fig. 1 is cached Device, for the string data literal that the data random access memory by Fig. 1 is obtained is executed such as the rotator of N number of word (rotator) or claim cyclic shifter (circular shifier) running.
Fig. 4 is form, shows the program storage of a neutral net unit for being stored in Fig. 1 and by the neutral net list The program that unit executes.
Fig. 5 is to show that neutral net unit executes the sequential chart of the program of Fig. 4.
Fig. 6 A are to show that the neutral net unit of Fig. 1 executes the block schematic diagram of the program of Fig. 4.
Fig. 6 B are flow chart, show the computing device framework program of Fig. 1, are associated with being executed using neutral net unit The running of the typical multiply-accumulate run function computing of the neuron of the hidden layer of artificial neural network, such as the program by Fig. 4 The running of execution.
Fig. 7 is the block schematic diagram of another embodiment of the neural processing unit for showing Fig. 1.
Fig. 8 is the block schematic diagram of the another embodiment of the neural processing unit for showing Fig. 1.
Fig. 9 is form, shows the program storage of a neutral net unit for being stored in Fig. 1 and by the neutral net list The program that unit executes.
Figure 10 is to show that neutral net unit executes the sequential chart of the program of Fig. 9.
Figure 11 is the block schematic diagram of an embodiment of the neutral net unit for showing Fig. 1.In the embodiment in figure 11, one Individual neuron is divided into two parts, i.e. run function cell mesh, and (this part also includes shifting cache with ALU part Device part), and each run function cell mesh is by multiple ALU partial sharings.
Figure 12 is to show that the neutral net unit of Figure 11 executes the sequential chart of the program of Fig. 4.
Figure 13 is to show that the neutral net unit of Figure 11 executes the sequential chart of the program of Fig. 4.
Figure 14 is block schematic diagram, shows that being moved to the instruction of neutral net (MTNN) framework and its corresponds to the god of Fig. 1 Running through the part of NE.
Figure 15 is block schematic diagram, shows that being moved to the instruction of neutral net (MTNN) framework and its corresponds to the god of Fig. 1 Running through the part of NE.
Figure 16 is the block schematic diagram of an embodiment of the data random access memory for showing Fig. 1.
Figure 17 is block schematic diagram of the weight random access memory with an embodiment of buffer for showing Fig. 1.
Figure 18 be show Fig. 1 can dynamic configuration neural processing unit block schematic diagram.
Figure 19 is block schematic diagram, shows the embodiment according to Figure 18, using N number of nerve of the neutral net unit of Fig. 1 2N multitask buffer of processing unit, for the string data literal that the data random access memory by Fig. 1 is obtained is held Row is such as the running of rotator (rotator).
Figure 20 is form, shows the program storage of a neutral net unit for being stored in Fig. 1 and by the neutral net The program that unit is executed, and this neutral net unit has neural processing unit as shown in the embodiment of figure 18.
Figure 21 is to show that neutral net unit executes the sequential chart of the program of Figure 20, and this neutral net unit has such as Figure 18 Shown neural processing unit is implemented in narrow configuration.
Figure 22 is the block schematic diagram of the neutral net unit for showing Fig. 1, and this neutral net unit has as shown in figure 18 Neural processing unit executing the program of Figure 20.
Figure 23 be show Fig. 1 can dynamic configuration neural processing unit another embodiment block schematic diagram.
Figure 24 is block schematic diagram, shows and is transported using convolution (convolution) is executed by the neutral net unit of Fig. 1 One example of the data structure of calculation.
Figure 25 is flow chart, shows the computing device framework program of Fig. 1 with using neutral net unit foundation Figure 24's Data array executes the convolution algorithm of convolution kernel.
Program listings of Figure 26 A for neutral net unit program, convolution kernel of this neutral net unit program using Figure 24 Execute the convolution algorithm of data matrix and write back weight random access memory.
Figure 26 B are that the square of an embodiment of some fields of the control buffer of the neutral net unit for showing Fig. 1 shows It is intended to.
Figure 27 is block schematic diagram, shows an example of the weight random access memory for inserting input data in Fig. 1, this Input data executes common source computing (pooling operation) by the neutral net unit of Fig. 1.
Program listings of the Figure 28 for neutral net unit program, this neutral net unit program execute the input data of Figure 27 The common source computing of matrix is simultaneously write back weight random access memory.
Figure 29 A are the block schematic diagram of an embodiment of the control buffer for showing Fig. 1.
Figure 29 B are the block schematic diagram of another embodiment of the control buffer for showing Fig. 1.
Figure 29 C are to show Figure.
Figure 30 is the block schematic diagram of an embodiment of the run function unit (AFU) for showing Fig. 2.
Figure 31 is an example of the running of the run function unit for showing Figure 30.
Figure 32 is second example of the running of the run function unit for showing Figure 30.
Figure 33 is the 3rd example of the running of the run function unit for showing Figure 30.
Figure 34 is the block schematic diagram of the part details of the processor and neutral net unit that show Fig. 1.
Figure 35 is block chart, shows the processor with variable rate neutral net unit.
Figure 36 A are sequential chart, show the running example that there is the processor of neutral net unit to operate on general modfel, This general modfel i.e. with main when frequency operation.
Figure 36 B are sequential chart, show the running example that there is the processor of neutral net unit to operate on mitigation pattern, Frequency when frequency is less than main during the running of mitigation pattern.
Figure 37 is flow chart, shows the running of the processor of Figure 35.
Figure 38 is block chart, displays the details of the sequence of neutral net unit.
Figure 39 is block chart, shows the control of neutral net unit and some fields of status register.
Figure 40 is block chart, shows Elman time recurrent neural networks (recurrent neural network, RNN) An example.
Figure 41 is block chart, to show and execute the Elman time recurrent neural networks for being associated with Figure 40 when neutral net unit Calculating when, one of the data configuration in the data random access memory of neutral net unit and weight random access memory Example.
Figure 42 is form, shows the program of the program storage for being stored in neutral net unit, and this program is by neutral net Unit is executed, and the configuration according to Figure 41 uses data and weight, to reach Elman time recurrent neural networks.
Figure 43 is the example that block chart shows Jordan time recurrent neural networks.
Figure 44 is block chart, to show and execute the Jordan time recurrent neural networks for being associated with Figure 43 when neutral net unit Calculating when, one of the data configuration in the data random access memory of neutral net unit and weight random access memory Example.
Figure 45 is form, shows the program of the program storage for being stored in neutral net unit, and this program is by neutral net Unit is executed, and the configuration according to Figure 44 uses data and weight, to reach Jordan time recurrent neural networks.
Figure 46 is block chart, shows that shot and long term remembers an embodiment of (long short term memory, LSTM) born of the same parents.
Figure 47 is block chart, to show and execute the calculating of the shot and long term memory cell layer for being associated with Figure 46 when neutral net unit When, an example of the data configuration in the data random access memory of neutral net unit and weight random access memory.
Figure 48 is form, shows the program of the program storage for being stored in neutral net unit, and this program is by neutral net Unit is executed and the configuration according to Figure 47 uses data and weight, to reach the calculating for being associated with shot and long term memory cell layer.
Figure 49 is block chart, shows the embodiment of neutral net unit, has in the neural processing unit group of this embodiment There are the masking of output buffering and feedback capability.
Figure 50 is block chart, to show and execute the calculating of the shot and long term memory cell layer for being associated with Figure 46 when neutral net unit When, the data random access memory of the neutral net unit of Figure 49, in weight random access memory and output buffer One example of data configuration.
Figure 51 is form, shows the program of the program storage for being stored in neutral net unit, god of this program by Figure 49 Execute through NE and the configuration according to Figure 50 uses data and weight, in terms of reaching and be associated with shot and long term memory cell layer Calculate.
Figure 52 is block chart, shows the embodiment of neutral net unit, has in the neural processing unit group of this embodiment There are the masking of output buffering and feedback capability, and shared run function unit.
Figure 53 is block chart, to show and execute the calculating of the shot and long term memory cell layer for being associated with Figure 46 when neutral net unit When, the data random access memory of the neutral net unit of Figure 49, in weight random access memory and output buffer Another embodiment of data configuration.
Figure 54 is form, shows the program of the program storage for being stored in neutral net unit, god of this program by Figure 49 Execute through NE and the configuration according to Figure 53 uses data and weight, in terms of reaching and be associated with shot and long term memory cell layer Calculate.
Figure 55 is block chart, shows the partial nerve processing unit of another embodiment of the present invention.
Figure 56 is block chart, to show and execute the Jordan time recurrent neural networks for being associated with Figure 43 when neutral net unit Calculating and using Figure 55 embodiment when, the data random access memory of neutral net unit and weight random access memory One example of the data configuration in device.
Figure 57 is form, shows the program of the program storage for being stored in neutral net unit, and this program is by neutral net Unit is executed and the configuration according to Figure 56 uses data and weight, to reach Jordan time recurrent neural networks.
Specific embodiment
There is the processor of framework neutral net unit
Sides of the Fig. 1 for processor 100 of the display comprising neutral net unit (neural network unit, NNU) 121 Block schematic diagram.As shown in FIG., this processor 100 comprising instruction acquisition unit 101, instruction cache 102, instruction translator 104, Renaming unit 106, multiple reservation stations 108, multiple media caches 118, multiple general caching devices 116, aforementioned neurological network Multiple performance elements 112 and memory sub-system 114 outside unit 121.
Processor 100 is electronic installation, used as the CPU of integrated circuit.The number of 100 receives input of processor Digital data, processes these data according to the instruction that is seized by memorizer, and produces the result of the computing indicated by instruction and make Export for which.This processor 100 can be used for desktop PC, running gear or tablet PC, and for calculating, word The application such as process, multimedia display and network browsing.This processor 100 is may also be disposed in embedded system, to control various bags Include the device of equipment, mobile phone, smart phone, vehicle and industrial controller.Central processing unit is passed through and executes bag to data The computings such as arithmetic, logical AND input/output are included, to execute computer program (or referred to as computer applied algorithm or application program) The electronic circuit (i.e. hardware) of instruction.Integrated circuit is made in small semiconductor material, typically silicon for one group, electronics electricity Road.Integrated circuit is also commonly used to represent chip, microchip or crystal grain.
Instruction acquisition unit 101 controls to seize framework instruction 103 to instruction cache 102 by system storage (not shown) Running.Instruction acquisition unit 101 is provided seizes address to instruction cache 102, is seized to cache with given processor 100 The storage address of the cache line of 102 framework command byte.Seize the instruction pointer that selectes based on processor 100 of address The currency or program counter of (not shown).In general, program counter can be incremented by proper order according to instruction size, until referring to Make the control instruction for occurring such as branch, calling or return in crossfire, or there is such as interruptions, trap (trap), make an exception or The exceptional conditions such as mistake, and need to update program with non-sequential addresses such as such as branch target address, return address or exception vectors Enumerator.Sum it up, program counter can be updated in response to 112/121 execute instruction of performance element.Program counter Also can be updated when exceptional condition is detected, for example instruction translator 104 suffers from the finger for not being defined in processor 100 The instruction 103 of order collection framework.
The framework instruction 103 from a system storage for being coupled to processor 100 is seized in the storage of instruction cache 102.This A little framework instructions 103 include that being moved to neutral net (MTNN) instruction removes (MFNN) instruction with by neutral net, and the details will be described later. In one embodiment, framework instruction 103 is the instruction of x86 instruction set architectures, and affix MTNN is instructed and MFNN is instructed.At this In disclosure, x86 instruction set architecture processors be interpreted as execute same mechanical sound instruction in the case of, withProcessor produces the processor of identical result in instruction set architecture layer.But, other instruction set architectures, example Such as, advanced reduced instruction set machine framework (ARM), the extendible processor architecture (SPARC) of yang invigorating (SUN) or enhancing essence Simple instruction set performance operational performance optimization architecture (PowerPC), it can also be used to other embodiments of the invention.Instruction cache 102 Framework instruction 103 to instruction translator 104 is provided, framework instruction 103 is translated to microcommand 105.
Microcommand 105 is provided to renaming unit 106 and is finally executed by performance element 112/121.These microcommands 105 Can realize that framework is instructed.For a preferred embodiment, instruction translator 104 includes Part I, in order to will frequently execute with And/or be that relatively uncomplicated framework instruction 103 translates to microcommand 105.This instruction translator 104 also includes second Point, which has microcode unit (not shown).There is microcode unit microcode memory to load micro-code instruction, to execute framework instruction set Middle complicated and/or few instruction.Microcode unit also includes that micro-sequencer (microsequencer) provides nand architecture microprogram Enumerator (micro-PC) is to microcode memory.For a preferred embodiment, these microcommands (are not schemed via micro- transfer interpreter Show) translate to microcommand 105.Whether selector is currently possessed of control power according to microcode unit, selects from Part I or the The microcommand 105 of two parts is provided to renaming unit 106.
The entity of the framework buffer RNTO processor 100 that framework instruction 103 can be specified by renaming unit 106 delays Storage.For a preferred embodiment, this processor 100 includes reorder buffer (not shown).Renaming unit 106 can be according to The allocation of items of reorder buffer is given each microcommand 105 by program order.Processor 100 can be made according to program order so Remove microcommand 105 and its corresponding framework instruction 103.In one embodiment, media cache 118 has 256 bit wides Degree, and general caching device 116 has 64 bit widths.In one embodiment, media cache 118 is x86 media caches, for example Advanced vector expands (AVX) buffer.
In one embodiment, each project of reorder buffer has storage area to store the result of microcommand 105.This Outward, processor 100 includes framework register file, and this framework register file has physical registers slow corresponding to each framework Storage, such as media cache 118, general caching device 116 and other framework buffers.(for a preferred embodiment, citing For, media cache 118 is of different sizes with general caching device 116, you can corresponding to this using separate register file Two kinds of buffers.) for each source operand for being assigned with a framework buffer in microcommand 105, renaming unit can profit With the reorder buffer catalogue of a newest microcommand in the old microcommand 105 of write framework buffer, microcommand 105 is inserted Source operand field.When performance element 112/121 completes the execution of microcommand 105, performance element 112/121 can be by its result Write the reorder buffer project of this microcommand 105.When microcommand 105 is removed, unit meeting (not shown) is removed in the future since then The result of the reorder buffer field of microcommand writes the buffer of physical registers archives, this physical registers profile associated in Thus the framework purpose buffer specified by microcommand 105 is removed.
In another embodiment, processor 100 includes physical registers archives, and the quantity of its physical registers having is more In the quantity of framework buffer, but, this processor 100 is not included in framework register file, and reorder buffer project Result storage area is not included.(for a preferred embodiment, because the size of media cache 118 and general caching device 116 Different, you can corresponding to both buffers using separate register file.) this processor 100 also includes pointer gauge, which has There is the corresponding pointer of each framework buffer.For each operand for being assigned with framework buffer in microcommand 105, order again Name unit can utilize a pointer for pointing to free buffer in physical registers archives, insert the purpose behaviour in microcommand 105 Make digital section.If there is no free buffer in physical registers archives, renaming unit 106 can lie over pipeline.Right Each source operand of framework buffer is assigned with microcommand 105, and renaming unit can utilize one to point to entity caching In device archives, the pointer of the buffer of newest microcommand in the old microcommand 105 of write framework buffer is assigned to, is inserted micro- Source operand field in instruction 105.When performance element 112/121 completes to execute microcommand 105,112/121 meeting of performance element Write the result into the buffer that the destination operand field of microcommand 105 in physical registers archives is pointed to.When microcommand 105 is removed Except when, remove unit and the destination operand field value of microcommand 105 can be copied to and be associated with this and remove what microcommand 105 was specified The pointer of the pointer gauge of framework purpose buffer.
Reservation station 108 can load microcommand 105, until these microcommands complete to be distributed to performance element 112/121 for The preparation of execution.When all source operands of a microcommand 105 can all be taken and performance element 112/121 can also be used for holding During row, the preparation that this microcommand 105 completes to issue is.Performance element 112/121 is implemented by reorder buffer or aforementioned first Framework register file described in example, or the physical registers archives accession buffer source behaviour described in aforementioned second embodiment Count.Additionally, performance element 112/121 can be directed through result transmission bus (not shown) order caching device source operand.This Outward, performance element 112/121 can receive the immediate operand specified by microcommand 105 from reservation station 108.MTNN and MFNN framves Structure instruction 103 includes immediate operand to specify 121 function to be performed by of neutral net unit, and this function by MTNN with One or more microcommands 105 that 103 translation of MFNN frameworks instruction is produced are provided, and the details will be described later.
Performance element 112 includes one or more load/store units (not shown), is loaded by memory sub-system 114 Data and data are stored to memory sub-system 114.For a preferred embodiment, this memory sub-system 114 includes depositing Reservoir administrative unit (not shown), this MMU may include that (lookaside) buffering is searched in for example multiple translations Device, table movement (tablewalk) unit, one data quick of stratum (with instruction cache 102), a stratum two unify Cache and a Bus Interface Unit as the interface between processor 100 and system storage.In one embodiment, Fig. 1 Processor 100 is represented with one of multiple processing cores of polycaryon processor, and this polycaryon processor shares one most Stratum's cache afterwards.Performance element 112 may also include multiple integer units, multiple media units, multiple floating point units and Individual branch units.
Neutral net unit 121 includes weight random access memory (RAM) 124, data random access memory 122, N Individual neural processing unit (NPU) 126, one sequencer 128 of program storage 129, and multiple controls and status register 127.These neural processing units 126 conceptually such as neutral net in neuron function.Weight random access memory Device 124, data random access memory 122 are both transparent for MTNN and MFNN frameworks instruction 103 with program storage 129 and write respectively Enter and reading.Weight random access memory 124 is arranged as W row, the N number of weight word of each column, data random access memory 122 It is arranged as D row, the N number of data literal of each column.Each data literal is multiple positions with each weight word, with regard to a preferred embodiment For, can be 8 positions, 9 positions, 12 positions or 16 positions.Each data literal is used as the neuron of preceding layer in network Output valve (represents) that each weight word is used as the neuron for being associated with entrance network current layer in network sometimes with initiation value The weight of link.Although in many applications of neutral net unit 121, it is loaded into the text of weight random access memory 124 Word or operand are actually the weight for being associated with the link for entering neuron, but it should be noted that in neutral net In some applications of unit 121, the word of weight random access memory 124 not weight is loaded into, but because these texts Word is stored in weight random access memory 124, so still being represented with the term of " weight word ".For example, exist In some applications of neutral net unit 121, the example or Figure 27 to Figure 28 of the convolution algorithm of such as Figure 24 to Figure 26 A are total to The example of source computing, weight random access memory 124 can load the object beyond weight, and for example data matrix is (such as image picture Prime number evidence) element.Similarly, although in many applications of neutral net unit 121, data random access storage is loaded into The word or operand of device 122 is substantially exactly the output valve or initiation value of neuron, but it should be noted that in nerve net In some applications of network unit 121, the word for being loaded into data random access memory 122 is really not so, but because these Word is stored in data random access memory 122, so still being represented with the term of " data literal ".For example, In some applications of neutral net unit 121, the example of the convolution algorithm of such as Figure 24 to Figure 26 A, data random access are deposited Reservoir 122 can load non-neuronal output, the element of such as convolution kernel.
In one embodiment, neural processing unit 126 includes combination logic, sequencing logic, state machine with sequencer 128 Device or its combination.The content of status register 127 can be loaded one of them and be led to by framework instruction (such as MFNN instructions 1500) With buffer 116, to confirm the state of neutral net unit 121, such as neutral net unit 121 is from program storage 129 An order or the running of a program is completed, or neutral net unit 121 can freely receive a new order or open Begin a new neutral net unit program.
The quantity of neural processing unit 126 can be deposited with data at random according to increase in demand, weight random access memory 124 The width of access to memory 122 also can be adjusted therewith with depth to be expanded.For a preferred embodiment, weight random access memory is deposited Reservoir 124 can be more than data random access memory 122, this is because there are many links in typical neural net layer, because And need larger storage area storage to be associated with the weight of each neuron.Many is disclosed herein with regard to data and weight word Size, weight random access memory 124 and the size of data random access memory 122 and different nerves process single The embodiment of first 126 quantity.In one embodiment, it is 64KB (8192 x64 row) that neutral net unit 121 has a size Data random access memory 122, size is the weight random access memory 124 of 2MB (8192 x2048 row), And 512 neural processing units 126.This neutral net unit 121 is with 16 nanometers of processing procedures of Taiwan integrated circuit (TSMC) Manufacture, its occupied area are about 3.3 square millimeters.
Sequencer 128 is seized by program storage 129 and instructs and execute, and its running for executing also includes producing address and control Signal processed is supplied to data random access memory 122, weight random access memory 124 with neural processing unit 126.Sequencing Device 128 produces storage address 123 and is supplied to data random access memory 122 with reading order, uses the N number of number in D row According to one is selected in word be supplied to N number of neural processing unit 126.Sequencer 128 can also produce storage address 125 and read Order is supplied to weight random access memory 124, uses in N number of weight word of W row, selects one to be supplied to N number of nerve Processing unit 126.Sequencer 128 produces the order of the address 123,125 for being also provided to neural processing unit 126 and determines nerve " link " between unit.Sequencer 128 can also produce storage address 123 and be supplied to data random access memory with writing commands 122, use in N number of data literal of D row, select one to be write by N number of neural processing unit 126.Sequencer 128 is also Storage address 125 can be produced and weight random access memory 124 is supplied to writing commands, use the N number of weight text in W row One is selected to be write by N number of neural processing unit 126 in word.Sequencer 128 can also produce storage address 131 to program To select the neutral net unit instruction for being supplied to sequencer 128, this part can illustrate memorizer 129 in following sections. Storage address 131 is corresponding to program counter (not shown), position of the sequencer 128 usually in accordance with program storage 129 Order is incremented by program counter, and except non-sequencer 128 suffers from control instruction, such as recursion instruction (is refer to such as Figure 26 A institutes Show), in the case, program counter can be updated to sequencer 128 destination address of this control instruction.Sequencer 128 is also Control signal can be produced to neural processing unit 126, indicate that neural processing unit 126 executes a variety of computings or function, Such as Initiation, arithmetic/logic, rotation/shift operation, run function and computing is write back, related example is follow-up Chapters and sections (refer to as shown in micro- computing 3418 of Figure 34) can be described in more detail.
N number of neural processing unit 126 can produce N number of result word 133, these result words 133 can be written back into weight with Machine access memorizer 124 or the string of data random access memory 122.For a preferred embodiment, weight random access memory Memorizer 124 is coupled directly to N number of neural processing unit 126 with data random access memory 122.Furthermore, it is understood that weight Random access memory 124 belongs to these neural processing units 126 with 122 turns of data random access memory, and is not shared with Other performance elements 112 in processor 100, these neural processing units 126 can constantly within each time-frequency cycle As soon as from weight random access memory 124 and data random access memory 122 or the two obtain and complete string, one compared with For good embodiment, processed in pipelined fashion can be adopted.In one embodiment, data random access memory 122 is random with weight Each in access memorizer 124 can provide 8192 positions within each time-frequency cycle to neural processing unit 126. This 8192 positions can be considered as 512 16 bytes or 1024 8 bytes to be processed, and the details will be described later.
The data set size processed by neutral net unit 121 is not limited to weight random access memory 124 with number According to the size of random access memory 122, and the size of system storage can be only limited to, this is because data and weight can be Refer to through MTNN and MFNN between system storage and weight random access memory 124 and data random access memory 122 The use of order (for example, 118) is moved through media cache.In one embodiment, data random access memory 122 is assigned Give dual-port, enable and data literal is being read by data random access memory 122 or write data literal is deposited at random to data While access to memory 122, write data literal is to data random access memory 122.In addition, including including cache The larger memory hierarchical structure of memory sub-system 114 can provide very big data bandwidth for system storage and nerve net Carry out data transmission between network unit 121.Additionally, for a preferred embodiment, this memory sub-system 114 includes hardware data Device, the access mode of trace memory, the neural deta that for example by system storage loaded and weight are seized in advance, and to cache rank Rotating fields execution data are seized in advance to be beneficial to and are being transmitted to weight random access memory 124 and data random access memory The transmission of high frequency range and low latency is reached during 122.
Although in the embodiments herein, provided by weights memory to one of behaviour of each neural processing unit 126 Count and be denoted as weight, this term is common in neutral net, it is to be appreciated, however, that these operands can also be other with The data of related type are calculated, and its calculating speed can pass through these devices and be lifted.
Fig. 2 is the block schematic diagram of the neural processing unit 126 for showing Fig. 1.As shown in FIG., this neural processing unit 126 running can perform many functions or computing.Especially, this neural processing unit 126 can be used as in artificial neural network Neuron or node are operated, to execute typical product accumulation function or computing.That is, in general, nerve net Network unit 126 (neuron) in order to:(1) there is the neuron receives input value of link from each with which, this link would generally but The preceding layer being not necessarily in artificial neural network;(2) each output valve is multiplied by and is associated with its corresponding power for linking Weight values are producing product;(3) all products are added up to produce a sum;(4) run function is executed producing god to this sum Output through unit.But, different from traditional approach need execute be associated with all link input all multiplyings and by its Product is added up, and each neuron of the present invention can perform within the given time-frequency cycle and be associated with one of power for linking input Multiplying by the cumulative of the product of performed link input in its product and the time-frequency cycle being associated with before the time point again Value is added (adding up).It is assumed that a total M link connects so far neuron, after M product is added up, (probably need M time-frequency The time in cycle), this neuron can execute run function to produce output or result to this cumulative number.The advantage of this mode is The quantity of the multiplier needed for can reducing, and a less, relatively simple and more quick addition is only needed in neuron Device circuit (for example using two input adders), without use can by all link input products add up or even Adder needed for adding up to a wherein subclass.This mode is also beneficial to use a myriad of in neutral net unit 121 (N) neuron (neural processing unit 126), thus, after about M time-frequency cycle, neutral net unit 121 can be produced The output of this big quantity (N) neuron.Finally, for substantial amounts of different link inputs, the nerve net being made up of these neurons Network unit 121 can just be effective as the execution of ANN network layers.If that is, the quantity of M has increased in different layers Subtract, produce the time-frequency periodicity needed for memory cell output and also accordingly can increase and decrease, and resource (such as multiplier and accumulator) Can be fully utilized.In comparison, traditional design has the part of some multipliers and adder for less M values Fail to be utilized.Therefore, in response to the link output number of neutral net unit, embodiment as herein described has elasticity concurrently with efficiency Advantage, and high efficiency can be provided.
Neural processing unit 126 includes 205, dual input multitask buffer 208, ALU of buffer (ALU) 204, accumulator 202 and run function unit (AFU) 212.Buffer 205 is connect by weight random access memory 124 Weight word 206 of retaking the power simultaneously provides its output 203 in the follow-up time-frequency cycle.Multitask buffer 208 is input into 207, in 211 at two Select one to be stored in its buffer and be provided in which in the follow-up time-frequency cycle and export 209.Input 207 receives random from data The data literal of access memorizer 122.Another input 211 then receives the output 209 of adjacent nerve processing unit 126.Fig. 2 institutes The neural processing unit 126 for showing is denoted as neural processing unit J in the N number of neural processing unit shown in Fig. 1.That is, Neural processing unit J is that the one of this N number of neural processing unit 126 represents example.For a preferred embodiment, nerve processes single The input 211 of the multitask buffer 208 of the J examples of unit 126 receives the multitask of the J-1 examples of neural processing unit 126 and delays The output 209 of storage 208, and the output 209 of the multitask buffer 208 of neural processing unit J is supplied to neural processing unit The input 211 of the multitask buffer 208 of 126 J+1 examples.Thus, the multitask buffer of N number of neural processing unit 126 Cooperating syringe by 208, such as rotator or the title cyclic shifter of N number of word, this part has in follow-up Fig. 3 in more detail Explanation.Multitask buffer 208 utilizes control input 213 to control in the two inputs, and which can be by multitask buffer 208 Selection is stored in its buffer and in being subsequently provided in export 209.
ALU 204 is input into three.One of input receives weight word 203 by buffer 205.Separately One input receives the output 209 of multitask buffer 208.Yet another input receives the output 217 of accumulator 202.This calculation Art logical block 204 can be input into execution arithmetic and/or logical operationss to which and be provided in its output to produce a result.Preferable with regard to one For embodiment, the arithmetic of the execution of ALU 204 and/or logical operationss are by the instruction for being stored in program storage 129 Specified.For example, multiply-accumulate computing is specified in the multiply-accumulate instruction in Fig. 4, that is, result 215 can be accumulator 202 Numerical value 217 and weight word 203 and the totalling of the product of the data literal of the output of multitask buffer 208 209.But also may be used To specify other computings, these computings to include but is not limited to:As a result 215 is the numerical value of 209 transmission of multitask buffer output;Knot Really 215 is the numerical value of the transmission of weight word 203;As a result 215 is null value;As a result 215 is 202 numerical value 217 of accumulator and weight 203 Totalling;As a result 215 is the totalling of 202 numerical value 217 of accumulator and multitask buffer output 209;As a result 215 is accumulator Maximum in 202 numerical value 217 and weight 203;As a result 215 is 202 numerical value 217 of accumulator and multitask buffer output 209 In maximum.
ALU 204 provides its output 215 to accumulator 202 and stores.ALU 204 includes multiplier 242 pairs of weight words 203 are carried out multiplying to produce a product with the data literal of the output of multitask buffer 208 209 246.In one embodiment, two 16 positional operands are multiplied to produce multiplier 242 result of 32.This arithmetical logic Unit 204 also includes that adder 244 is total to produce one plus product 246 in the output 217 of accumulator 202, and this sum is It is stored in the result 215 of the accumulating operation of accumulator 202.In one embodiment, adder 244 one 41 in accumulator 202 Place value 217 plus multiplier 242 32 results producing 41 results.Thus, the phase in multiple time-frequency cycles The interior rotator characteristic having using multitask buffer 208, neural processing unit 126 may achieve needed for neutral net Neuron product add up computing.This ALU 204 may also comprise other circuit units to execute other such as front institute The arithmetic/logic that states.In one embodiment, second adder subtracts in the data literal of the output of multitask buffer 208 209 Go weight word 203 to produce a difference, subsequent adder 244 can add this difference in the output 217 of accumulator 202 to produce One result 215, this result are the accumulation result in accumulator 202.Thus, in a period of multiple time-frequency cycles, at nerve Reason unit 126 can just reach the computing of difference totalling.For a preferred embodiment, although weight word 203 and data literal 209 size is identical (in bits), and they also can have different binary point positions, and the details will be described later.With regard to a preferable reality For applying example, multiplier 242 is integer multiplier and adder with adder 244, patrols compared to the arithmetic using floating-point operation Volume unit, this ALU 204 have low complex degree, small-sized, quickly with low power consuming.But, the present invention's In other embodiments, ALU 204 also can perform floating-point operation.
Although only show a multiplier 242 and adder 244 in the ALU 204 of Fig. 2, but, with regard to one compared with For good embodiment, this ALU 204 also includes other components to execute other different computings aforementioned.Citing comes Say, this ALU 204 may include that comparator (not shown) compares accumulator 202 and data/weight word, and multiplexing Device (not shown) selects the greater (maximum) to store to accumulator 202 in the two values that comparator is specified.At another In example, ALU 204 includes selecting logic (not shown), skips multiplier 242 using data/weight word, Adder 224 is made to store to accumulator to produce a sum plus this data/weight word in the numerical value 217 of accumulator 202 202.These extra computings can be described in more detail in following sections such as Figure 18 to Figure 29 A, and these computings are also contributed to Such as convolution algorithm and the execution of common source computing.
Run function unit 212 receives the output 217 of accumulator 202.Run function unit 212 can be to accumulator 202 Output executes run function to produce the result 133 of Fig. 1.In general, in the neuron of the intermediary layer of artificial neural network Run function can be used to the sum after standardization product accumulation, it is particularly possible to be carried out using nonlinear mode.For " standard Change " progressive total, the run function of Current neural unit can be in other neurons expection receptions of connection Current neural unit as defeated An end value is produced in the numerical range for entering.(result after standardization is referred to as " startup " sometimes, and herein, startup is to work as The output of front nodal point, and this can be exported and be multiplied by the weight for being associated with link between output node and receiving node to produce by receiving node A raw product, and the product accumulation that this product can be linked with other inputs for being associated with this receiving node.) for example, connecing Receive/be concatenated neuron expection receive as be input into numerical value between 0 and 1 in the case of, output neuron may require that non-thread Property ground extruding and/or adjustment (such as upward displacement with by negative value be converted on the occasion of) beyond 0 and 1 extraneous progressive total, Which is made to fall within this desired extent.Therefore, the computing that run function unit 212 is executed to 202 numerical value of accumulator 217 can be by result 133 take in known range.The result 133 of N number of neural performance element 126 all can be by write back data random access memory simultaneously 122 or weight random access memory 124.For a preferred embodiment, run function unit 212 is in order to executing multiple startups Function, and the input for example from control buffer 127 can select one to be implemented in accumulator 202 in these run functions Output 217.These run functions may include but be not limited to step function, correction function, S type functions, hyperbolic tangent function with soft Plus function (also referred to as smooth correction function).The analytic formula of soft plus function is f (x)=ln (1+ex), that is, 1 and exPlus Total natural logrithm, wherein, " e " is Euler's numbers (Euler ' s number), and x is the input 217 of this function.With regard to a preferable enforcement For example, run function may also comprise transmission (pass-through) function, directly transmit 202 numerical value 217 of accumulator or wherein A part, the details will be described later.In one embodiment, the circuit of run function unit 212 can be executed within single time-frequency cycle and be opened Dynamic function.In one embodiment, run function unit 212 includes multiple lists, and which receives accumulated value and exports a numerical value, to certain A little run functions, such as S type functions, hyperbolic tangent function, soft plus function etc., this numerical value can be similar to real run function and be carried For numerical value.
For a preferred embodiment, the width (in bits) of accumulator 202 is more than the output of run function function 212 133 width.For example, in one embodiment, the width of this accumulator is 41, to avoid being added to most 512 (this part can be described in more detail at Figure 30 as corresponded in following sections) loss precision in the case of the product of 32, and As a result 133 width is 16.In one embodiment, in the follow-up time-frequency cycle, run function unit 212 can transmit accumulator Other undressed parts of 202 outputs 217, and these parts can be write back data random access memory 122 or power Weight random access memory 124, this part can be described in more detail at following sections are corresponding to Fig. 8.Will not by so 202 numerical value of accumulator through processing carries back media cache 118 through MFNN instructions, and thereby, other in processor 100 are executed The instruction that unit 112 is executed can just execute the complicated run function that run function unit 212 cannot be executed, and for example common is soft Very big (softmax) function, this function are also referred to as standardization exponential function.In one embodiment, the instruction set of processor 100 Framework includes the instruction for executing this exponential function, is typically expressed as exOr exp (x), this instruction can be held by other of processor 100 Row unit 112 is using the execution speed for lifting soft very big run function.
In one embodiment, neural processing unit 126 adopts pipeline designs.For example, neural processing unit 126 can be wrapped Include the buffer of ALU 204, be for example located at multiplier and adder and/or be ALU 204 its Buffer between its circuit, neural processing unit 126 may also include a buffer for loading 212 output of run function function. The other embodiments of this neural processing unit 126 can be illustrated in following sections.
Fig. 3 is block chart, display N number of many using the N number of neural processing unit 126 of the neutral net unit 121 of Fig. 1 Business buffer 208, for the string data literal 207 that the data random access memory 122 by Fig. 1 is obtained is executed as N number of The rotator (rotator) of word or the running of title cyclic shifter (circular shifter).In the fig. 3 embodiment, N It is 512, therefore, neutral net unit 121 has 512 multitask buffers 208, is denoted as 0 to 511, is respectively corresponding to 512 Individual neural processing unit 126.Wherein the one of the D row of each the meeting receiving data of multitask buffer 208 random access memory 122 Corresponding data literal 207 on row.That is, multitask buffer 0 can be received from the row of data random access memory 122 Data literal 0, multitask buffer 1 can be from 122 row receiving data word 1 of data random access memory, multitask buffers 2 Can be from 122 row receiving data word 2 of data random access memory, the rest may be inferred, and multitask buffer 511 can be random from data Access 122 row receiving data word 511 of memorizer.Additionally, multitask buffer 1 can receive the output 209 of multitask buffer 0 Used as another input 211, multitask buffer 2 can receive the output 209 of multitask buffer 1 as another input 211, many Business buffer 3 can receive the output 209 of multitask buffer 2 as another input 211, and the rest may be inferred, multitask buffer 511 The output 209 of multitask buffer 510 can be received as another input 211, and multitask buffer 0 can receive multitask caching The output 209 of device 511 is used as other inputs 211.Each multitask buffer 208 can receive control input 213 to control which Select data literal 207 or circulation input 211.In the pattern of here running, control input 213 can be in the first time-frequency cycle Interior, control each multitask buffer 208 select data literal 207 to store to buffer and arithmetic is supplied in subsequent step Logical block 204, and within the follow-up time-frequency cycle (such as the aforementioned M-1 time-frequency cycle), it is many that control input 213 can control each Task buffer device 208 selects circulation input 211 to store to buffer and be supplied to ALU 204 in subsequent step.
Although in the embodiment described by Fig. 3 (and follow-up Fig. 7 and Figure 19), multiple neural processing units 126 can use With by the numerical value of these multitask buffers 208/705 to right rotation, namely by neural processing unit J towards neural processing unit J+1 is moved, but the present invention is not limited to this, in other embodiments (for example corresponding to the embodiment of Figure 24 to Figure 26), Multiple neural processing units 126 are may be used to the numerical value of multitask buffer 208/705 to anticlockwise, namely are processed by nerve single First J is towards neural processing unit J-1 movements.Additionally, in other embodiments of the invention, these neural processing units 126 can Optionally the numerical value of multitask buffer 208/705 is rotated to the left or to the right, for example, this selection can be by neutral net Unit instruction.
Fig. 4 is form, shows the program storage 129 of a neutral net unit 121 for being stored in Fig. 1 and by the nerve The program that NE 121 is executed.As it was previously stated, one layer of relevant calculating of this example program performing and artificial neural network. The form of Fig. 4 shows four row and three rows.Per string corresponding to the address for being shown in the first row in program storage 129.Second Row specifies corresponding instruction, and the third line points out the time-frequency periodicity for being associated with this instruction.For a preferred embodiment, front State time-frequency periodicity and represent the effective time-frequency periodicity that time-frequency periodic quantity is often instructed in the embodiment that pipeline is executed, rather than refer to Order postpones.As shown in FIG., because neutral net unit 121 has the essence of pipeline execution, each instruction has associated Time-frequency cycle, the instruction positioned at address 2 are exceptions, and this instructs actually to can do by myself and repeats 511 times, so that In 511 time-frequency cycles, the details will be described later.
Each instruction in all of meeting of neural processing unit 126 parallel processing program.That is, all of N number of god All can be in the instruction of execution of same time-frequency cycle first row through processing unit 126, all of N number of neural processing unit 126 is all Can be in the instruction of execution of same time-frequency cycle secondary series, the rest may be inferred.But the present invention is not limited to this, in following sections In other embodiments, some instructions are executed in the way of the parallel portion sequence of part, for example, such as the embodiment of Figure 11 Described, multiple neural processing units 126 share a run function unit embodiment in, run function be located at address 3 It is to execute in this way with 4 output order.One layer is assumed in the example of Fig. 4, and there are 512 neuron (neural processing units 126), and each neuron have 512 from 512 neurons of preceding layer links be input into, a total of 256K link. Each neuron can receive 16 bit data values from each link input, and this 16 bit data value is multiplied by appropriate 16 Position weighted value.
The first row (also may specify to other addresses) for being located at address 0 can specify the neural processing unit instruction of initialization.This Initialization directive can be removed 202 numerical value of accumulator and is allowed to be zero.In one embodiment, initialization directive also can be in accumulator 202 In the string of interior loading data random access memory 122 or weight random access memory 124, the thus phase of instruction Corresponding word.This initialization directive also can be by Configuration Values Loading Control buffer 127, and this part is in subsequent figure 29A and Figure 29 B Can be described in more detail.For example, width that can be by data literal 207 with weight word 209 is loaded, for arithmetical logic list Unit 204 utilizes to confirm that the computing size of circuit execution, this width can also affect the result 215 for being stored in accumulator 202.One In embodiment, neural processing unit 126 includes a circuit before the output 215 of ALU 204 is stored in accumulator 202 This output 215 is filled up, and Configuration Values can be loaded this circuit by initialization directive, this Configuration Values can affect aforesaid to fill up computing. In one embodiment, also can ALU function instruction (such as the multiply-accumulate instruction of address 1) or output order (as The write starting function unit output order of location 4) in so specify, accumulator 202 is removed to null value.
The secondary series for being located at address 1 specifies multiply-accumulate instruction to indicate that this 512 neural processing units 126 are random from data The string of access memorizer 122 loads corresponding data literal and loads from the string of weight random access memory 124 Corresponding weight word, and the first multiply-accumulate fortune is executed to this data literal input 207 and weight word input 206 Calculate, i.e., plus initialization 202 null value of accumulator.Furthermore, it is understood that this instruction can indicate that sequencer 128 is produced in control input 213 A raw numerical value is input into 207 to select data literal.In the example of Fig. 4, the specified of data random access memory 122 is classified as 17, the specified of weight random access memory 124 is classified as row 0, therefore sequencer can be instructed to output numerical value 17 as data with Machine accesses storage address 123, and output numerical value 0 is used as weight random access memory address 125.Therefore, random from data 512 data literals of the row 17 of access memorizer 122 provide defeated as the corresponding data of 512 neural processing units 126 Enter 207, and 512 weight words from the row 0 of weight random access memory 124 are provided as 512 nerve process lists The corresponding weight input 206 of unit 126.
The 3rd row for being located at address 2 specify multiply-accumulate rotation instruction, and it is 511 that this instruction has one to count its numerical value, with Indicate that this 512 neural processing units 126 execute 511 multiply-accumulate computings.This instruction indicates this 512 neural processing units 126 will be input into the data literal 209 of ALU 204 in the computing each time of 511 multiply-accumulate computings, used as from neighbour The rotational value 211 of nearly nerve processing unit 126.That is, this instruction can indicate that sequencer 128 is produced in control input 213 Give birth to a numerical value to select rotational value 211.Additionally, this instruction can indicate that 511 multiplication are tired out by this 512 neural processing units 126 Plus " next " row of the corresponding weighted value loading weight random access memory 124 in the computing each time of computing.Namely Say, this instruction can indicate that weight random access memory address 125 is increased by sequencer 128 from the numerical value in previous time-frequency cycle One, in this example, the first time-frequency cycle of instruction is row 1, and the next time-frequency cycle is exactly row 2, in the next time-frequency cycle It is exactly row 3, the rest may be inferred, the 511st time-frequency cycle is exactly row 511.Each computing in this 511 multiply-accumulate computings In, the product that rotation input 211 is input into 206 with weight word can be added into the previous numerical value of accumulator 202.This 512 god This 511 multiply-accumulate computings can be executed within 511 time-frequency cycles through processing unit 126, each 126 meeting of neural processing unit Different pieces of information word-it is, adjacent neural processing unit for the row 17 from data random access memory 122 126 data literals for executing computing in the previous time-frequency cycle, and it is associated with the different weight words execution one of data literal Individual multiply-accumulate computing is conceptually the different of neuron and links input.This example assumes each neural processing unit 126 (neuron) there are 512 to link input, therefore involve the process of 512 data literals and 512 weight words.In row 2 Multiply-accumulate rotation instruction repeat last time iteration after, this 512 will be deposited in accumulator 202 and link taking advantage of for input Long-pending totalling.In one embodiment, the instruction set of neural processing unit 126 includes " execution " instruction to indicate ALU 204 execute the arithmetic logic unit operation that is specified by the neural processing unit of initialization, the such as ALU of Figure 29 A Person specified by function 2926, rather than for each different types of arithmetic logical operation (for example aforesaid multiply-accumulate, accumulator With the maximum of weight etc.) there is independent instruction.
The 4th row for being located at address 3 specify run function instruction.This run function instruction indicates run function unit 212 pairs Specified run function is executed in 202 numerical value of accumulator to produce result 133.The embodiment of run function is in following sections meeting It is described in more detail.
The 5th row for being located at address 4 specify write run function unit output order, single to indicate that this 512 nerves are processed Its run function unit 212 is exported 133 string for being written back to data random access memory 122 as a result, here by unit 216 I.e. row 16 in example.That is, this instruction can indicate 128 output numerical value 16 of sequencer as data random access memory ground Location 123 and writing commands (corresponding to by the reading order of the multiply-accumulate instruction of address 1).With regard to a preferable enforcement For example, because the characteristic that pipeline is executed, write run function unit output order can be executed simultaneously with other instructions, therefore write Enter run function unit output order to execute within single time-frequency cycle.
For a preferred embodiment, used as a pipeline, this pipeline has various different work(to each neural processing unit 126 Energy component, such as multitask buffer 208 (and multitask buffer 705 of Fig. 7), ALU 204, accumulator 202nd, run function unit 212, multiplexer 802 (refer to Fig. 8), column buffer 1104 (please be joined with run function unit 1112 According to Figure 11) etc., some of which component itself can pipeline execution.In addition to data literal 207 with weight word 206, this pipeline Also instruction can be received from program storage 129.These instructions can be flowed along pipeline and control several functions unit.In another reality Apply in example, instruct not comprising run function in this program, but by the neural processing unit instruction of initialization specified be implemented in cumulative The run function of 202 numerical value 217 of device, it is indicated that the numerical value of appointed run function is stored in allocating cache device, opening for pipeline 212 part of dynamic function unit is after last 202 numerical value 217 of accumulator is produced, that is, the multiply-accumulate rotation in address 2 refers to After order repeats last time execution, it is used.For a preferred embodiment, in order to save power consumption, the run function of pipeline 212 part of unit can be in not starting state before write run function unit output order is reached, and when instruction is reached, start The output of accumulator 202 217 that function unit 212 can start and initialization directive is specified executes run function.
Fig. 5 is to show that neutral net unit 121 executes the sequential chart of the program of Fig. 4.Every string of timing diagram is correspondingly extremely The continuous time-frequency cycle that the first row is pointed out.Other rows are then to be respectively corresponding to different god in this 512 neural processing units 126 Through processing unit 126 and point out its computing.Only show the computing of neural processing unit 0,1,511 with simplified explanation in figure.
In the time-frequency cycle 0, each the neural processing unit 126 in this 512 neural processing units 126 can execute figure 4 initialization directive, is that null value is assigned to accumulator 202 in Figure 5.
In the time-frequency cycle 1, each the neural processing unit 126 in this 512 neural processing units 126 can execute figure The multiply-accumulate instruction of address 1 in 4.As shown in FIG., neural processing unit 0 can be by 202 numerical value of accumulator (i.e. zero) plus number Product according to the word 0 of the row 0 of the word 0 and weight random access memory 124 of the row 17 of random access memory 122;God 202 numerical value of accumulator (i.e. zero) plus the word 1 of the row 17 of data random access memory 122 and can be weighed through processing unit 1 The product of the word 1 of the row 0 of weight random access memory 124;The rest may be inferred, and accumulator 202 can be counted by neural processing unit 511 Word 511 and the row 0 of weight random access memory 124 of the value (i.e. zero) plus the row 17 of data random access memory 122 Word 511 product.
In the time-frequency cycle 2, each the neural processing unit 126 in this 512 neural processing units 126 can carry out figure The first time iteration of the multiply-accumulate rotation instruction of address 2 in 4.As shown in FIG., neural processing unit 0 can be by accumulator 202 Numerical value plus being exported the spin data word 211 of 209 receptions (i.e. by counting by the multitask buffer 208 of neural processing unit 511 511) take advantage of with the word 0 of the row 1 of weight random access memory 124 according to the data literal that random access memory 122 is received Product;202 numerical value of accumulator can be added and export 209 by the multitask buffer 208 of neural processing unit 0 by neural processing unit 1 The spin data word 211 (data literal 0 for being received by data random access memory 122) of reception and weight random access memory The product of the word 1 of the row 1 of memorizer 124;The rest may be inferred, and neural processing unit 511 can add 202 numerical value of accumulator by god The spin data word 211 received through the output of multitask buffer 208 209 of processing unit 510 (is deposited by data random access The data literal that reservoir 122 is received product 510) with the word 511 of the row 1 of weight random access memory 124.
In the time-frequency cycle 3, each the neural processing unit 126 in this 512 neural processing units 126 can carry out figure Second iteration of the multiply-accumulate rotation instruction of address 2 in 4.As shown in FIG., neural processing unit 0 can be by accumulator 202 Numerical value plus being exported the spin data word 211 of 209 receptions (i.e. by counting by the multitask buffer 208 of neural processing unit 511 510) take advantage of with the word 0 of the row 2 of weight random access memory 124 according to the data literal that random access memory 122 is received Product;202 numerical value of accumulator can be added and export 209 by the multitask buffer 208 of neural processing unit 0 by neural processing unit 1 The spin data word 211 (data literal 511 for being received by data random access memory 122) of reception is deposited at random with weight The product of the word 1 of the row 2 of access to memory 124;The rest may be inferred, neural processing unit 511 can by 202 numerical value of accumulator add by The spin data word 211 that the output of multitask buffer 208 209 of neural processing unit 510 is received is (i.e. by data random access The data literal that memorizer 122 is received product 509) with the word 511 of the row 2 of weight random access memory 124.Such as figure 5 omission label shows that following 509 time-frequency cycles persistently can be carried out according to this, until the time-frequency cycle 512.
In the time-frequency cycle 512, each the neural processing unit 126 in this 512 neural processing units 126 can be carried out 511st iteration of the multiply-accumulate rotation instruction of address 2 in Fig. 4.As shown in FIG., neural processing unit 0 can be by accumulator 202 numerical value add the spin data word 211 for exporting 209 receptions by the multitask buffer 208 of neural processing unit 511 (i.e. The data literal received by data random access memory 122 word 0 1) with the row 511 of weight random access memory 124 Product;202 numerical value of accumulator can be added and be exported by the multitask buffer 208 of neural processing unit 0 by neural processing unit 1 The 209 spin data words 211 (data literal 2 for being received by data random access memory 122) for receiving are random with weight The product of the word 1 of the row 511 of access memorizer 124;The rest may be inferred, and 202 numerical value of accumulator can be added by neural processing unit 511 On 209 receptions are exported by the multitask buffer 208 of neural processing unit 510 spin data word 211 (random i.e. by data The data literal that access memorizer 122 is received product 0) with the word 511 of the row 511 of weight random access memory 124.? Multiple time-frequency cycles are needed to read from data random access memory 122 with weight random access memory 124 in one embodiment Data literal and weight word are executing the multiply-accumulate instruction of address 1 in Fig. 4;But, data random access memory 122, Weight random access memory 124 is to adopt pipeline configuration with neural processing unit 126, so in first multiply-accumulate computing After beginning (as shown in the time-frequency cycle 1 of Fig. 5), follow-up multiply-accumulate computing (as shown in the time-frequency cycle 2-512 of Fig. 5) will Start to execute within the time-frequency cycle that continues.For a preferred embodiment, in response to being instructed using framework, such as MTNN or MFNN refers to (can illustrate in follow-up Figure 14 and Figure 15) is made, for data random access memory 122 and/or weight random access memory The access action of device 124, or the microcommand that framework instruction translation goes out, these neural processing units 126 briefly can be shelved.
In the time-frequency cycle 513, the startup letter of each the neural processing unit 126 in this 512 neural processing units 126 Counting unit 212 can all execute the run function of address 3 in Fig. 4.Finally, in the time-frequency cycle 514, this 512 neural processing units Each neural processing unit 126 in 126 can be passed through the row 16 of its 133 write back data random access memory 122 of result In corresponding word executing the write run function unit output order of address 4 in Fig. 4, that is to say, that nerve processes single The result 133 of unit 0 can be written into the word 0 of data random access memory 122, and the result 133 of neural processing unit 1 can be write Enter the word 1 of data random access memory 122, the rest may be inferred, the result 133 of neural processing unit 511 can be written into data The word 511 of random access memory 122.Corresponding block chart corresponding to the computing of aforementioned Fig. 5 is shown in Fig. 6 A.
Fig. 6 A are that the neutral net unit 121 for showing Fig. 1 executes the block schematic diagram of the program of Fig. 4.This neutral net list Unit 121 includes 512 neural processing units 126, receives the data random access memory 122 of address input 123, with reception ground The weight random access memory 124 of location input 125.When the time-frequency cycle 0, this 512 neural processing units 126 can be held Row initialization directive.This running does not show in figure.As shown in FIG., when the time-frequency cycle 1,512 16 of row 17 The data literal of position can read from data random access memory 122 and provide to this 512 neural processing units 126.When During the frequency cycle 1 to 512, the weight word of 512 16 of row 0 to row 511 can respectively from weight random access memory Device 122 reads and provides to this 512 neural processing units 126.When the time-frequency cycle 1, this 512 neural processing units 126 can execute its corresponding multiply-accumulate computing to the data literal for loading with weight word.This running does not show in figure Show.During the time-frequency cycle 2 to 512, the multitask buffer 208 of 512 neural processing units 126 can be such as same tool The rotator for having 512 16 words is operated, and the number that previously will be loaded by the row 17 of data random access memory 122 Turn to neighbouring neural processing unit 126 according to word, and these neural processing units 126 can to rotation after corresponding data Word and the multiply-accumulate computing of the corresponding weight word execution loaded by weight random access memory 124.In time-frequency week When phase 513, this 512 run function units 212 can execute enabled instruction.This running does not show in figure.In time-frequency When cycle 514, this 512 neural processing units 126 can by its 512 16 corresponding 133 write back data of result with Machine accesses the row 16 of memorizer 122.
As shown in FIG., result word (neuron output) write back data random access memory 122 or weight are produced The data input (link) that the current layer of the time-frequency periodicity substantially neutral net that random access memory 124 needs is received The square root of quantity.For example, if current layer has 512 neurons, and each neuron has 512 from previous The link of layer, the sum of these links is exactly 256K, and the time-frequency periodicity for producing current layer result needs will be slightly larger than 512.Therefore, neutral net unit 121 can provide high efficiency in terms of neural computing.
Fig. 6 B are flow chart, show that the processor 100 of Fig. 1 executes framework program, to execute using neutral net unit 121 The running of the typical multiply-accumulate run function computing of the neuron of the hidden layer of artificial neural network is associated with, as by Fig. 4 Program performing running.The example of Fig. 6 B suppose there is four hidden layers and (be shown in the variable NUM_ of initialization step 602 LAYERS), each hidden layer has 512 neurons, and each neuron links 512 whole neurons of preceding layer and (passes through The program of Fig. 4).However, it is desirable to be understood by, the selection of these layers and the quantity of neuron to illustrate the invention, neutral net Unit 121 has varying number nerve when the embodiment that similar calculating can be applied to varying number hidden layer, in each layer The embodiment of unit, or the embodiment all not linked by neuron.In one embodiment, for non-existent god in this layer The weighted value linked through first or non-existent neuron can be set to zero.For a preferred embodiment, framework program meeting First group of weight is write weight random access memory 124 and starts neutral net unit 121, when neutral net unit 121 When being carrying out the calculating for being associated with ground floor, second group of weight can be write weight random access memory by this framework program 124, once thus, neutral net unit 121 completes the calculating of the first hidden layer, neutral net unit 121 can just start Two layers of calculating.Thus, framework program can travel to and fro between two regions of weight random access memory 124, to guarantee nerve net Network unit 121 can be fully utilized.This flow process starts from step 602.
In step 602, as described in the related Sections of Fig. 6 A, input value is write number by the processor 100 for executing framework program According to the Current neural unit hidden layer of random access memory 122, that is, the row 17 of write data random access memory 122. These values are likely to have been positioned at the row 17 of data random access memory 122 and are directed to preceding layer as neutral net unit 121 Operation result 133 (such as convolution, common source or input layer).Secondly, variable N can be initialized as numerical value 1 by framework program.Variable N represents the current layer that will be processed in hidden layer by neutral net unit 121.Additionally, framework program can be by variable NUM_ LAYERS is initialized as numerical value 4, because there is four hidden layers in this example.Next flow process advances to step 604.
In step 604, the weight word of layer 1 is write weight random access memory 124, such as Fig. 6 A by processor 100 Shown row 0 to 511.Next flow process advances to step 606.
In step 606, processor 100 is instructed with the MTNN of write-in program memorizer 129 using specified function 1432 1400, by 121 program storage 129 of write neutral net unit (as shown in Figure 4) for multiply-accumulate run function program.Processor 100 instruct 1400 to start neutral net unit program followed by MTNN, and this instruction specified function 1432 starts to execute this journey Sequence.Next flow process advances to step 608.
In steps in decision-making 608, whether the numerical value of framework program validation variable N is less than NUM_LAYERS.If so, flow process is just Step 612 can be advanced to;Step 614 is otherwise proceeded to.
In step 612, the weight word of layer N+1 is write weight random access memory 124 by processor 100, for example Row 512 to 1023.Therefore, when the hidden layer that framework program just can execute current layer in neutral net unit 121 is calculated by under One layer of weight word write weight random access memory 124, thereby, completes the calculating of current layer, that is, write number After according to random access memory 122, neutral net unit 121 can just get started the hidden layer calculating for executing next layer.Connect Get off to advance to step 614.
In step 614, processor 100 confirms the neutral net unit program being carrying out (for layer 1, in step 606 start to execute, and for layer 2 to 4, are then to start to execute in step 618) whether complete to execute.With regard to a preferable enforcement For example, processor 100 can pass through and execute 1500 reading neutral net unit of MFNN instructions, 121 status register 127 to confirm Whether complete to execute.In another embodiment, neutral net unit 121 can produce an interruption, represent and completed multiplication Cumulative run function layer program.Next flow process advances to steps in decision-making 616.
In steps in decision-making 616, whether the numerical value of framework program validation variable N is less than NUM_LAYERS.If so, flow process meeting Advance to step 618;Step 622 is otherwise proceeded to.
In step 618, processor 100 can update multiply-accumulate run function program, enable the hidden layer of execution level N+1 Calculate.Furthermore, it is understood that processor 100 can be by the data random access memory 122 of the multiply-accumulate instruction of address in Fig. 41 Train value, is updated to the row (being for example updated to row 16) more of the write of preceding layer result of calculation in data random access memory 122 New output row (being for example updated to row 15).Processor 100 then begins to update neutral net unit program.In another embodiment In, the program of Fig. 4 specify the same row of the output order of address 4 as the multiply-accumulate instruction of address 1 row (also It is the row read by data random access memory 122).In this embodiment, input data word when prostatitis can be written (because this column data word has been read into multitask buffer 208 and through N words rotator in these neural processing units Rotated between 126, as long as this column data word is not required to be used for other purposes, such processing mode can exactly be allowed to ).In the case, avoid the need in step 618 updating neutral net unit program, and only need to be restarted. Next flow process advances to step 622.
In step 622, neutral net unit program of the processor 100 from 122 reading layer N of data random access memory Result.But, if these results can only be used for next layer, framework program is just not necessary to from data random access memory 122 read these results, and can be retained on data random access memory 122 and be used for next hidden layer calculating.Connect Flow process of getting off advances to step 624.
In steps in decision-making 624, whether the numerical value of framework program validation variable N is less than NUM_LAYERS.If so, before flow process Proceed to step 626;Otherwise just terminate this flow process.
In step 626, the numerical value of N can be increased by one by framework program.Next flow process can return to steps in decision-making 608.
As shown in the example of Fig. 6 B, generally per 512 time-frequency cycles, these neural processing units 126 will logarithm Execute according to random access memory 122 and once read and the write-once (effect through the computing of the neutral net unit program of Fig. 4 Really).Additionally, these neural processing units 126 generally each time-frequency cycle can be carried out to weight random access memory 124 Read to read string weight word.Therefore, the whole frequency range of weight random access memory 124 all can be because of neutral net list Unit 121 executes hidden layer computing in a mixed manner and is consumed.Furthermore, it is assumed that there is a write in one embodiment and read The buffer 1704 of buffer, such as Figure 17, while neural processing unit 126 is read out, processor 100 is random to weight Access memorizer 124 is write, and such buffer 1704 generally can be to weight random access memory per 16 time-frequency cycles Device 124 executes write-once to write weight word.Therefore, in the enforcement that weight random access memory 124 is single-port In example (as described in the corresponding chapters and sections of Figure 17), generally per 16 time-frequency cycles, these neural processing units 126 will be temporary When shelve the reading carried out by weight random access memory 124, and enable buffer 1704 to weight random access memory Device 124 is write.But, in the embodiment of dual-port weight random access memory 124, these neural processing units 126 are just not required to lie on the table.
Fig. 7 is the block schematic diagram of another embodiment of the neural processing unit 126 for showing Fig. 1.The nerve of Fig. 7 processes single Neural processing unit 126 of the unit 126 similar to Fig. 2.But, in addition the neural processing unit 126 of Fig. 7 has a dual input many Task buffer device 705.This multitask buffer 705 selects one of input 206 or 711 to be stored in its buffer, and in rear The continuous time-frequency cycle is provided in which and exports 203.Input 206 receives weight word from weight random access memory 124.Another is defeated Enter the output 203 that 711 is the second multitask buffer 705 of reception adjacent nerve processing unit 126.With regard to a preferred embodiment For, the multitask buffer of the neural processing unit 126 for being arranged in J-1 that the input 711 of neural processing unit J can be received 705 outputs 203, and the output 203 of neural processing unit J is then to provide to many of the neural processing unit 126 for being arranged in J+1 The input 711 of business buffer 705.Thus, the multitask buffer 705 of N number of neural processing unit 126 can cooperating syringe, such as The rotator of same N number of word, its are operated similar to the mode shown in aforementioned Fig. 3, but are non-data for weight word Word.Multitask buffer 705 utilizes control input 213 to control in the two inputs, and which can be by multitask buffer 705 Selection is stored in its buffer and in being subsequently provided in export 203.
Using multitask buffer 208 and/or multitask buffer 705 (and other realities as shown in Figure 18 and Figure 23 Apply the multitask buffer in example), effectively forming a large-scale rotator will be from data random access memory 122 And/or data/the weight of the string of weight random access memory 124 is rotated, neutral net unit 121 is avoided the need for There is provided using a very big multiplexer between data random access memory 122 and/or weight random access memory 124 The data of needs/weight word is to appropriate neutral net unit.
Accumulator value is written back in addition to run function result
For some applications, (the MFNN command receptions for example through Figure 15 are slow to media to allow processor 100 to be received back to Storage 118) 202 numerical value 217 of undressed accumulator, to be supplied in terms of the instruction for being implemented in other performance elements 112 executes Calculate, have really its use.For example, in one embodiment, run function unit 212 is not for holding for soft very big run function Row is configured to reduce the complexity of run function unit 212.So, neutral net unit 121 can export undressed 202 numerical value 217 of accumulator or one of subset are bonded to data random access memory 122 or weight random access memory 124, and framework program can be read by data random access memory 122 or weight random access memory 124 in subsequent step Take and this undressed numerical value is calculated.But, for 202 numerical value 217 of undressed accumulator application not It is limited to execute soft very big computing, other application is also covered by the present invention.
Fig. 8 is the block schematic diagram of the another embodiment of the neural processing unit 126 for showing Fig. 1.The nerve of Fig. 8 processes single Neural processing unit 126 of the unit 126 similar to Fig. 2.But, the neural processing unit 126 of Fig. 8 is in run function unit 212 Including multiplexer 802, and this run function unit 212 has control input 803.The width (in bits) of accumulator 202 is more than The width of data literal.Multiplexer 802 has multiple inputs to receive the data literal width segments of the output of accumulator 202 217. In one embodiment, the width of accumulator 202 is 41 positions, and neural processing unit 216 may be used to export the knot of 16 Fruit word 133;So, for example, multiplexer 802 (or multiplexer 3032 and/or multiplexer 3037 of Figure 30) has three Input receives the position [15 of the output of accumulator 202 217 respectively:0], position [31:16] with position [47:32].With regard to a preferred embodiment Speech, non-carry-out bit (the such as position [47 provided by accumulator 202:41]) can be forced to be set as off bit.
Sequencer 128 can control input 803 produce a numerical value, control multiplexer 802 accumulator 202 word (such as 16) in select first, to instruct in response to write accumulator, for example follow-up write accumulator of Fig. 9 middle positions in address 3 to 5 refers to Order.For a preferred embodiment, multiplexer 802 also has one or more inputs to receive run function circuit (such as Figure 30 In component 3022,3024,3026,3018,3014 and output 3016), and output that these run function circuits are produced Width is equal to a data literal.Sequencer 128 can produce numerical value in control input 803 and be opened at these with controlling multiplexer 802 Which is selected in dynamic functional circuit output, rather than which is selected in the word of accumulator 202, with the startup in response to address 4 in such as Fig. 4 Function unit output order.
Fig. 9 is form, shows the program storage 129 of a neutral net unit 121 for being stored in Fig. 1 and by the nerve The program that NE 121 is executed.Program of the example program of Fig. 9 similar to Fig. 4.Especially, the two finger in address 0 to 2 Make identical.But, in Fig. 4, the instruction of address 3 and 4 is replaced by write accumulator instruction in fig .9, and this instruction can refer to Show that 512 neural processing units 126 accumulate it the 133 write back data random access memory as a result of the output of device 202 217 122 three row, i.e. row 16 to 18 in this example.That is, the instruction of this write accumulator can indicate sequencer 128 at the Frequency cycle output numerical value is 16 data random access memory address 123 and writing commands, exports in the second time-frequency cycle Numerical value is 17 data random access memory address 123 and writing commands, is then that output numerical value is in the 3rd time-frequency cycle 18 data random access memory address 123 and writing commands.For preferred embodiment, the execution of write accumulator instruction Time can be overlapped with other instructions, thus, write accumulator instruction just actually just can be held within these three time-frequency cycles OK, wherein each time-frequency cycle can write the row of data random access memory 122.In an embodiment, user is specified and is started Function 2934 orders the numerical value (Figure 29 A) on 2956 hurdles with the output for controlling buffer 127, by the required part of accumulator 202 Write data random access memory 122 or weight random access memory 124.In addition, write accumulator instruction can be selected The subset of accumulator 202, rather than the full content that write back accumulator 202 are write back to property.In an embodiment, standard type can be write back Accumulator 202.This part can be described in more detail in the chapters and sections subsequently corresponding to Figure 29 to Figure 31.
Figure 10 is to show that neutral net unit 121 executes the sequential chart of the program of Fig. 9.The sequential chart of Figure 10 is similar to Fig. 5 Sequential chart, the wherein time-frequency cycle 0 to 512 is identical.But, in time-frequency cycle 513-515, this 512 nerves process list In unit 126, the run function unit 212 of each neural processing unit 126 can execute the write accumulator of address 3 to 5 in Fig. 9 and refer to One of order.Especially, each neural processing unit 126 in 513,512 neural processing units 126 of time-frequency cycle Accumulator 202 can be exported 217 position [15:0] in the row 16 as its 133 write back data random access memory 122 of result Corresponding word;In 514,512 neural processing units 126 of time-frequency cycle, each neural processing unit 126 can tire out Plus the position [31 of the output of device 202 217:16] as the row 17 of its 133 write back data random access memory 122 of result in relative Answer word;And in the time-frequency cycle 515, in 512 neural processing units 126, each neural processing unit 126 can be by accumulator The position [40 of 202 outputs 217:32] as the corresponding text in the row 18 of its 133 write back data random access memory 122 of result Word.For a preferred embodiment, position [47:41] can be forced to be set as null value.
Shared run function unit
Figure 11 is the block schematic diagram of an embodiment of the neutral net unit 121 for showing Fig. 1.Embodiment in Figure 11 In, a neuron is divided into two parts, i.e. run function cell mesh, and (this part is also comprising displacement with ALU part Buffer parts), and each run function cell mesh is by multiple ALU partial sharings.In fig. 11, arithmetic is patrolled Collect cell mesh and refer to neural processing unit 126, and shared run function cell mesh then refers to run function unit 1112.Phase For the such as embodiment of Fig. 2, each neuron is then the run function unit 212 comprising oneself.According to this, in Figure 11 embodiments In one example, neural processing unit 126 (ALU part) may include the accumulator 202, ALU of Fig. 2 204th, multitask buffer 208 and buffer 205, but do not include run function unit 212.In the embodiment in figure 11, nerve NE 121 includes 512 neural processing units 126, and but, the present invention is not limited to this.In the example of Figure 11, this 512 neural processing units 126 are divided into 64 groups, be denoted as group 0 to 63 in fig. 11, and each group has eight Neural processing unit 126.
Neutral net unit 121 also includes column buffer 1104 and multiple shared run function units 1112, and these open Dynamic function unit 1112 is coupled between neural processing unit 126 and column buffer 1104.The width of column buffer 1104 is (with position Meter), such as 512 words identical with the string of data random access memory 122 or weight random access memory 124.Per One neural 126 group of processing unit has a run function unit 1112, that is, each run function unit 1112 is corresponding In 126 group of neural processing unit;Thus, it is corresponding to 64 to there is 64 run function units 1112 in the embodiment in figure 11 Individual 126 group of neural processing unit.The shared startup letter corresponding to this group of the neural processing unit 126 of eight of same group Counting unit 1112.It is can also be applied to having difference in run function unit and each group with varying number The embodiment of the neural processing unit of quantity.For example, it is can also be applied in each group have two, four or 16 neural processing units 126 share the embodiment of same run function unit 1112.
Shared run function unit 1112 contributes to the size for reducing neutral net unit 121.Size reduction can sacrifice effect Energy.That is, according to the difference of shared rate, it may be desirable to whole nerve processing unit could be produced using the extra time-frequency cycle The result 133 of 126 arrays, for example, as shown in figure 12 below, is accomplished by seven additionally in the case of 8: 1 shared rate The time-frequency cycle.But, it is however generally that, compared to the time-frequency periodicity produced needed for progressive total (for example, for each Neuron has a layer of 512 links, it is necessary to 512 time-frequency cycles), the time-frequency periodicity of aforementioned extra increase is (for example 7) quite few.Therefore, very little (for example, the about centesimal calculating of increase of impact of the run function unit to efficiency is shared Time), can be a worthwhile cost for it can reduce the size of neutral net unit 121.
In one embodiment, each neural processing unit 126 includes that run function unit 212 is relatively easy in order to execute Run function, these simple run function units 212 there is less size and can be comprised in each nerve process single In unit 126;Conversely, shared complicated run function unit 1112 is then carried out relative complex run function, its size can be bright Show and be more than simple run function unit 212.In this embodiment, only need by shared multiple in specified complexity run function In the case that miscellaneous run function unit 1112 is executed, the extra time-frequency cycle is needed, can be by simple in specified run function In the case that run function unit 212 is executed, this extra time-frequency cycle is avoided the need for.
Figure 12 and Figure 13 is that the neutral net unit 121 for showing Figure 11 executes the sequential chart of the program of Fig. 4.The sequential of Figure 12 Sequential chart of the figure similar to Fig. 5,0 to 512 all same of time-frequency cycle of the two.But, in the computing not phase in time-frequency cycle 513 With because the neural processing unit 126 of Figure 11 can share run function unit 1112;That is, the nerve process of same group Unit 126 can share the run function unit 1112 for being associated with this group, and Figure 11 shows this share framework.
Every string of the sequential chart of Figure 13 is correspondingly to the continuous time-frequency cycle for being shown in the first row.Other rows are then right respectively To different run function unit 1112 in this 64 run function units 1112 and its computing should be pointed out.Only show nerve in figure The computing of processing unit 0,1,63 is with simplified explanation.The corresponding time-frequency cycle to Figure 12 in the time-frequency cycle of Figure 13, but with not Tongfang Formula shows that neural processing unit 126 shares the computing of run function unit 1112.As shown in figure 13, in the time-frequency cycle 0 to 512, This 64 run function units 1112 are at not starting state, and neural processing unit 126 executes initialization nerve and processes Unit instruction, multiply-accumulate instruction and multiply-accumulate rotation instruction.
As shown in Figure 12 and Figure 13, in the time-frequency cycle 513, run function unit 0 (is associated with the run function list of group 0 1112) unit starts to execute 202 numerical value 217 of accumulator of neural processing unit 0 specified run function, neural processing unit First neural processing unit 216 in 0 i.e. group 0, and the output of run function unit 1112 will be stored in row buffer 1104 word 0.Equally in the time-frequency cycle 513, each run function unit 1112 can start to process list to corresponding nerve In first 216 groups, 202 numerical value 217 of accumulator of first neural processing unit 126 executes specified run function.Therefore, As shown in figure 13, in the time-frequency cycle 513, run function unit 0 starts the accumulator 202 to neural processing unit 0 and executes indication Fixed run function is producing the result of the word 0 that will be stored in row buffer 1104;Run function unit 1 starts to nerve The accumulator 202 of processing unit 8 executes specified run function to produce the word 8 that will be stored in row buffer 1104 As a result;The rest may be inferred, and run function unit 63 starts to execute the accumulator 202 of neural processing unit 504 specified startup Function is producing the result of the word 504 that will be stored in row buffer 1104.
In the time-frequency cycle 514, run function unit 0 (being associated with the run function unit 1112 of group 0) starts to nerve 202 numerical value 217 of accumulator of processing unit 1 executes specified run function, and neural processing unit 1 is second in group 0 Neural processing unit 216, and the output of run function unit 1112 will be stored in the word 1 of row buffer 1104.Equally In the time-frequency cycle 514, each run function unit 1112 can start to second in corresponding 216 group of neural processing unit 202 numerical value 217 of accumulator of neural processing unit 126 executes specified run function.Therefore, as shown in figure 13, in time-frequency Cycle 514, run function unit 0 start to execute specified run function to produce to the accumulator 202 of neural processing unit 1 The result of the word 1 of row buffer 1104 will be stored in;Run function unit 1 starts the accumulator to neural processing unit 9 202 execute specified run function to produce the result of the word 9 that will be stored in row buffer 1104;The rest may be inferred, opens Dynamic function unit 63 starts to execute the accumulator 202 of neural processing unit 505 specified run function and will be stored up with producing It is stored in the result of the word 505 of row buffer 1104.Such place comprehends and lasts till the time-frequency cycle 520, run function unit 0 (being associated with the run function unit 1112 of group 0) starts 202 numerical value of accumulator 217 to neural processing unit 7 and executes indication Fixed run function, neural processing unit 7 are (last) neural processing unit 216 the 8th in group 0, and run function The output of unit 1112 will be stored in the word 7 of row buffer 1104.Equally in the time-frequency cycle 520, each run function Unit 1112 can all start the accumulator 202 to the 8th in corresponding 216 group of neural processing unit neural processing unit 126 Numerical value 217 executes specified run function.Therefore, as shown in figure 13, in the time-frequency cycle 520, it is right that run function unit 0 starts The accumulator 202 of neural processing unit 7 executes specified run function to produce the text that will be stored in row buffer 1104 The result of word 7;Run function unit 1 start to execute the accumulator 202 of neural processing unit 15 specified run function with Produce the result of the word 15 that will be stored in row buffer 1104;The rest may be inferred, and run function unit 63 starts at nerve The accumulator 202 of reason unit 511 executes specified run function to produce the word 511 that will be stored in row buffer 1104 Result.
In the time-frequency cycle 521, once whole 512 results of this 512 neural processing units 126 have all been produced and have been write Fall in lines buffer 1104, row buffer 1104 will start its content write data random access memory 122 or weight Random access memory 124.Thus, the run function unit 1112 of each 126 group of neural processing unit is carried out in Fig. 4 A part for the run function instruction of address 3.
Share the embodiment of run function unit 1112 as shown in figure 11 in 204 group of ALU, especially haveHelp the use of collocation integer arithmetic logical block 204. this part at follow-up chapters and sections as having phase Speak on somebody's behalf bright. corresponding to Figure 29 A to Figure 33 place
MTNN is instructed with MFNN frameworks
Figure 14 is block schematic diagram, shows that being moved to neutral net (MTNN) framework instruction 1400 and its corresponds to Fig. 1 Neutral net unit 121 part running.This MTNN instruction 1400 include executing code field 1402, src1 fields 1404, Src2 fields, gpr fields 1408 and immediate field 1412.This MTNN instruction is instructed for framework, namely this instruction is included in process In the instruction set architecture of device 100.For a preferred embodiment, this instruction set architecture can utilize the acquiescence for executing code field 1402 Value, distinguishes other instructions in MTNN instructions 1400 and instruction set architecture.The actuating code 1402 of this MTNN instruction 1400 can be wrapped Include the preamble (prefix) for being common in x86 frameworks etc., it is also possible to do not include.
Immediate field 1412 provides a numerical value with the control logic 1434 of specified function 1432 to neutral net unit 121. For a preferred embodiment, immediate operand of this function 1432 as the microcommand 105 of Fig. 1.These can be by nerve net The function 1432 that network unit 121 is executed includes writing data random access memory 122, write weight random access memory 124th, write-in program memorizer 129, write control buffer 127, the program started in executive memory 129, time-out are held Notice request (for example interrupting) after program in line program memorizer 129, the program for completing in executive memory 129, And reset neutral net unit 121, but not limited to this.For a preferred embodiment, this neutral net unit instruction group meeting Instruct including one, the result of this instruction points out that neutral net unit program is completed.In addition, this neutral net unit instruction set Interrupt instruction is clearly produced including one.For a preferred embodiment, running bag that neutral net unit 121 is reseted Include in neutral net unit 121, except data random access memory 122, weight random access memory 124, program are deposited The data of reservoir 129 can maintain completely motionless outer other parts, and effectively pressure returns back to the state of reseting and (for example, empties interior Portion's state machine simultaneously sets it to idle state).Additionally, internal buffer, such as accumulator 202, can't be reseted letter Several impacts, and emptying must be expressed, such as using the initialization nerve processing unit instruction of address in Fig. 40.Real one Apply in example, function 1432 may include directly to execute function, which first carrys out Source buffer and (for example, can refer to comprising micro- computing Micro- computing of Figure 34 is 3418).This directly executes function and indicates the micro- computing specified by the 121 direct execution of neutral net unit.Such as This, framework program just directly can execute computing by control neural network unit 121, rather than write the instruction into program storage 129 And this is located at the instruction in program storage 129 or through MTNN instructions in 121 execution of neutral net unit is subsequently indicated The execution of 1400 (or MFNN instructions 1500 of Figure 15).Figure 14 shows the function of this write data random access memory 122 One example.
This gpr field specifies the general caching device in general caching device archives 116.In one embodiment, each is general slow Storage is 64.This general caching device archives 116 provides the numerical value of selected general caching device to neutral net unit 121, as shown in FIG., and this numerical value is used by neutral net unit 121 as address 1422.This address 1422 can select function The string of the memorizer that specifies in 1432.For data random access memory 122 or weight random access memory 124, This address 1422 can additionally select a data block, and its size is the twice (such as 512 of the position of media cache in this select column Position).For a preferred embodiment, this position is located at 512 bit boundarys.In one embodiment, multiplexer can select address 1422 (or the addresses 1422 in the case of MFNN described below instructions 1400) or the address from sequencer 128 123/125/131 is provided to 124/ weight random access memory of data random access memory, 124/ program storage 129.? In one embodiment, data random access memory 122 has dual-port, enables neural processing unit 126 to utilize media buffer Read/write of the device 118 to this data random access memory 122, while read/write this data random access memory 122.In one embodiment, the purpose in order to be similar to, weight random access memory 124 also have dual-port.
Src1 fields 1404 in figure specify a media buffer of media cache archives 118 with src2 fields 1406 Device.In one embodiment, each media cache 118 is 256.Media cache archives 118 can be by from selected fixed The conjoint data (such as 512 positions) of media cache is provided to (or the weight random access memory of data random access memory 122 Select column 1428 that 129) memorizer 124 or program storage is specified with writing address 1422 and in select column 1428 by The position that address 1422 is specified, as shown in FIG..Through a series of MTNN instruction 1400 (and described below MFNN instruction 1500) fill up by execution, the framework program for being implemented in processor 100 data random access memory 122 row with weight with Machine access memorizer 124 is arranged and by program write-in program memorizer 129, program for example as herein described is (as shown in Fig. 4 and Fig. 9 Program) neutral net unit 121 can be made to carry out computing at a very rapid rate to data and weight, to complete this artificial neuron Network.In one embodiment, the direct control neural network unit 121 of this framework program rather than by program write-in program memorizer 129.
In one embodiment, MTNN instructions 1400 are specified a starting source buffer and carry out the quantity of Source buffer, i.e., Q, and non-designated two are carried out Source buffer (person as specified by field 1404 and 1406).The MTNN instructions 1400 of this form can refer to Show that processor 100 comes the media cache 118 of Source buffer and the following Q-1 media buffer for continuing by starting is appointed as Device 118 writes neutral net unit 121, that is, the data random access memory 122 specified by writing or weight are deposited at random Access to memory 124.For a preferred embodiment, it is all Q that MTNN instructions 1400 can be translated to write by instruction translator 104 The microcommand of specified 118 requirement of media cache.For example, in one embodiment, when MTNN instructions 1400 will It is 8 that buffer MR4 be appointed as starting to come Source buffer and Q, and MTNN instructions 1400 will be translated to four by instruction translator 104 Individual microcommand, wherein first microcommand write buffer MR4 and MR5, and second microcommand writes buffer MR6 and MR7, the Three microcommands write buffer MR8 and MR9, and the 4th microcommand writes buffer MR10 and MR11.In another enforcement In example, it is 1024 rather than 512 by media cache 118 to the data path of neutral net unit 121, in the case, MTNN instructions 1400 can be translated to two microcommands by instruction translator 104, and wherein first microcommand write buffer MR4 is extremely MR7, second microcommand are then write buffer MR8 to MR11.It is can also be applied to MFNN instructions 1500 are specified together The embodiment of the quantity of beginning purpose buffer and purpose buffer, and allow each MFNN instruction 1500 random from data The string of access memorizer 122 or weight random access memory 124 reads the data block more than single medium buffer 118.
Figure 15 is block schematic diagram, shows that being moved to neutral net (MTNN) framework instruction 1500 and its corresponds to Fig. 1 Neutral net unit 121 part running.This MFNN instruction 1500 include executing code field 1502, dst fields 1504, Gpr fields 1508 and immediate field 1512.MFNN instructions are instructed for framework, namely this instruction is contained in the finger of processor 100 In order collection framework.For a preferred embodiment, this instruction set architecture can utilize the default value for executing code field 1502, distinguish Other instructions in MFNN instructions 1500 and instruction set architecture.The actuating code 1502 of this MFNN instruction 1500 may include to be common in The preamble (prefix) of x86 frameworks etc., it is also possible to do not include.
Immediate field 1512 provides a numerical value with the control logic 1434 of specified function 1532 to neutral net unit 121. For a preferred embodiment, immediate operand of this function 1532 as the microcommand 105 of Fig. 1.These neutral net units 121 functions 1532 that can be executed include reading data random access memory 122, read weight random access memory 124, Reading program memorizer 129 and reading state buffer 127, but not limited to this.The example of Figure 15 shows that reading data are random The function 1532 of access memorizer 122.
This gpr field 1508 specifies the general caching device in general caching device archives 116.This general caching device archives 116 The numerical value of selected general caching device is provided to neutral net unit 121, as shown in FIG., and neutral net unit 121 will This numerical value carries out computing as address 1522 and in the way of the address 1422 similar to Figure 14, uses selection 1532 middle finger of function The string of fixed memorizer.For data random access memory 122 or weight random access memory 124, this address 1522 can additionally select a data block, its size to be the position of media cache in this select column (such as 256 positions).With regard to one compared with For good embodiment, this position is located at 256 bit boundarys.
This dst field 1504 specifies a media cache in media cache archives 118.As shown in FIG., media Register file 118 will be from data random access memory 122 (or weight random access memory 124 or program storage 129) data (such as 256) are received to selected media cache, and this digital independent address 1522 from data receiver is specified Select column 1528 and select column 1528 in the position specified of address 1522.
The port configuration of neutral net unit internal random access memory
Figure 16 is the block schematic diagram of an embodiment of the data random access memory 122 for showing Fig. 1.This data is random Access memorizer 122 includes memory array 1606, read port 1602 with write port 1604.Memory array 1606 is loaded Data literal, for a preferred embodiment, the array of N number of word that these data arrangements are arranged into D as previously mentioned.Implement one In example, this memory array 1606 includes an array being made up of 64 horizontally arranged static random-access memory cells, its In each memory cell have 128 width and the height of 64, the data random access of a 64KB can be so provided Memorizer 122, its width are 8192 and arrange with 64, and the crystal grain face used by this data random access memory 122 Substantially 0.2 square millimeter of product.But, the present invention is not limited to this.
For a preferred embodiment, write port 1602 with multitask mode be coupled to neural processing unit 126 and Media cache 118.Furthermore, it is understood that these media caches 118 can be coupled to read port through result bus, and tie Fruit bus is also used for providing data to reorder buffer and/or result transmission bus to provide to other performance elements 112.These Neural processing unit 126 shares this read port 1602 with media cache 118, to enter to data random access memory 122 Row reads.Also, for a preferred embodiment, write port 1604 is also to be coupled to neural processing unit with multitask mode 126 and media cache 118.These neural processing units 126 share this write port 1604 with media cache 118, with Write this data random access memory 122.Thus, media cache 118 just can neural processing unit 126 to data with While machine access memorizer 122 is read out, data random access memory 122 is write, and neural processing unit 126 is also Data random access can be write while media cache 118 is read out to data random access memory 122 Memorizer 122.Such ways of carrying out can lift efficiency.For example, these neural processing units 126 can read data Random access memory 122 (for example continuously carrying out calculating), and this is simultaneously, media cache 118 can be by more data word Write data random access memory 122.In another example, result of calculation can be write by these neural processing units 126 Data random access memory 122, and this is simultaneously, media cache 118 then can be read from data random access memory 122 Result of calculation.In one embodiment, string result of calculation can be write data random access memory by neural processing unit 126 122, while also reading string data literal from data random access memory 122.In one embodiment, memory array 1606 It is configured to memory block (bank) .When neural processing unit 126 accesses data random access memory 122, own Memory block can all be initiated to access memory array 1606 a complete column; But, access in media cache 118 When data random access memory 122, only have specified memory block to be activated. in one embodiment, each The width of memory block is 128, and the width of media cache 118 is 256, so, for instance, deposits at every turn Get media cache 118 and just need to start two memory blocks. in one embodiment, these port ones 602/1604 are wherein One of be read/write port. in one embodiment, these port ones 602/1604 are all read/write ports.
Allowing these neural processing units 126 possess the advantage of the ability of circulator is as described herein, compared to be guarantee that neural processing unit 126 can be fully utilized and make framework program (by media cache 118) be continued to provide data are to data random access memory 122 and when neural processing unit 126 is carried out calculating, from data random access memory 122 is fetched the needed memory array of result, this ability contributes to reduce depositing of data random access memory 122 the columns of reservoir array 1606, thereby can minification.
Internal random access memory buffer
Figure 17 is the square signal that shows weight random access memory 124 with an embodiment of buffer 1704 of Fig. 1 figure. this weight random access memory 124 comprises memory array 1706 and port one 702. these memory arrays 1706 loading power weigh word, with regard to a preferred embodiment, these weight character arranging become the array of N word of W row as previously mentioned. and real one execute in example, this memory array 1706 comprises an array being made up of 128 horizontal static random-access memory cells, wherein each memory cell has the width of 64 and the height of 2048, so can provide the weight of a 2MB to deposit at random access to memory 124, its width is 8192 and has 2048 row, and the crystalline substance that this weight random access memory 124 uses grain is long-pending is roughly 2.4 square millimeters. but, the present invention is not limited to this.
With regard to a preferred embodiment, this port one 702 is coupled to neural processing unit 126 and buffer 1704. these neural processing units 126 and buffer 1704 and sees through this port one 702 and read and write weight random access memory 124. buffers 1704 and be also coupled to the media cache 118 of Fig. 1 with multitask mode, so, media cache 118 can see through the advantage that buffer 1704 read and write weight random access memory 124. these modes and be, in the time that neural processing unit 126 is reading or is writing weight random access memory 124, media cache 118 can also write buffer 118 or from buffer 118 read (if but neural processing unit 126 carrying out, preferably in situation, shelving these neural processing units 126, to avoid when the buffer 1704 access weight random access memory 124, access weight random access memory 124). this mode can promote usefulness, particularly because media cache 118 is significantly less than the reading and writing for weight random access memory 124 of neural processing unit 126 for reading of weight random access memory 124 with writing on relative. lifting the time clashing between the framework program of the heavy random access memory 124 of weighting can be less than the roughly width of full-time percent device 118 and be only 256, and each MTNN instruction 1400 only writes two media cache 118,512. therefore, carry out 16 MTNN instructions 1400 to fill up buffer 1704 in the situation that in framework program, the time clashing between the framework program of neural processing unit 126 and access weight random access memory 124 can be less than roughly full-time 6 percent. in another embodiment, a MTNN instruction 1400 is translated to two microcommands 105 by instruction transfer interpreter 104, and each microcommand can write single data buffer 118 buffer 1704, so, neural processing unit 126 and the framework program frequency that generation conflicts in the time of access weight random access memory 124 also can further reduce.
In the embodiment that comprises buffer 1704, utilizing framework program to write weight random access memory 124 needs the one or more MTNN instructions 1400 of multiple MTNN instructions 1400. to specify a function 1432 to write the data block of specifying in buffer 1704, a MTNN instruction 1400 subsequently specifies a function 1432 to indicate neutral net unit 121 content of buffer 1704 to be write to a select column of weight random access memory 124. and the size of single data block is the twice of the figure place of media cache 118, and these data blocks can come into line naturally in buffer 1704. in one embodiment, each specified function 1432 comprises a bit mask (bitmask) to write the MTNN instruction 1400 of buffer 1704 specified data blocks, it has position and corresponds to each data block of buffer 1704. and the data from the source buffer 118 of two appointments are written in the data block of buffer 1704, corresponding position in bit mask each data block for being set. this embodiment contributes to a column memory of weight random access memory 124 in the situation of repeating data value. for instance, for buffer 1704 (and row for subsequent weight random access memory 124) is made zero, program designer can load null value source buffer and set all position .Additionally of bit mask, bit mask can also allow the selected data block that program designer is only written in buffer 1704, and make other data blocksMaintain the data mode which is previous.
In the embodiment comprising buffer 1704, many MFNN instruction 1500.Initial MFNN instructions of reading weight random access memory 124 using framework program needs 1500 specify a function 1532 by a finger of weight random access units 124Fixed row loading buffer 1704, subsequently one or more MFNN instructions 1500 specify a function 1532 by a finger of buffer 1704Determine data block to read to purpose buffer.The digit of the size of single data block as media cache 118, and these dataBlock can be come into line in buffer 1704 naturally.The technical characteristic of the present invention is equally applicable to other embodiments, and such as weight is randomAccess memorizer 124 has multiple buffers 1704, and when executing through the neural processing unit 126 of increase, framework program accessesQuantity, the conflict of further to reduce between neural processing unit 126 and framework program because accessing produced by weight random access memory 124, must be in the time-frequency cycle of access weight random access memory 124 and be increased in neural processing unit 126, change the 1704 enters the probability of line access. by Buffer
Figure 16 describes dual-port data random access memory 122, but, the present invention is not limited to this .The skill of the present invention The other embodiments that art feature is equally applicable to weight random access memory 124 also for dual-port design.Additionally, retouching in Figure 17 State buffer collocation weight random access memory 124 to use, but, the present invention is not limited to this.The technical characteristic of the present invention Data random access memory 122 is equally applicable to the enforcement similar to the corresponding buffer of buffer 1704 Example.
Can dynamic configuration neural processing unit
Figure 18 be show Fig. 1 can dynamic configuration neural processing unit 126 block schematic diagram.The nerve process of Figure 18Neural processing unit 126 of the unit 126 similar to Fig. 2.But, the neural processing unit 126 of Figure 18 can dynamic configuration operatingIn two different configuration of one of them.In first configuration, the running of the neural processing unit 126 of Figure 18 is similar to Fig. 2Neural processing unit 126.That is, in first configuration, here is denoted as " wide " configuration or " single " configuration, The ALU 204 of neural processing unit 126 is to single wide data literal and single wide weight word (exampleSuch as 16 positions) execute computing to produce single wide result.In comparison, in second configuration, i.e., it is denoted as herein " narrow's " configuration or " even numbers " configuration, neural processing unit 126 can be to two narrow data literals and two narrow weight words (such as 8 positions) executes computing and produces two narrow results respectively.In one embodiment, the configuration of neural processing unit 126 is (wideOr narrow) reached by initialization neural processing unit instruction (being for example located at the instruction of address 0 in aforementioned Figure 20) .In addition, this configurationThe MTNN instructions that can also have function 1432 to specify by one to set the configuration (wide or narrow) that neural processing unit sets comeReach.For a preferred embodiment, program storage 129 instructs or determines that the MTNN instructions of configuration (wide or narrow) can be filled up and matches somebody with somebodyPut buffer.For example, the output of allocating cache device is supplied to ALU 204, run function unit 212 and producesThe logic of raw multitask cache control signal 213.Substantially, the component of the neural processing unit 126 of Figure 18 is identical with Fig. 2The component of numbering can execute similar function, can therefrom obtain reference to understand the embodiment of Figure 18.Reality below for Figure 18Apply example to illustrate with not existing together for Fig. 2 comprising which.
The neural processing unit 126 of Figure 18 includes two buffer 205A and 205B, two three input multitask buffers 208A and 208B, ALU 204, two accumulator 202A and 202B and two run function unit 212A With 212B.Buffer 205A/205B is respectively provided with the half (such as 8 positions) of the width of the buffer 205 of Fig. 2.Buffer 205A/ 205B respectively from weight random access memory 124 receive a corresponding narrow weight word 206A/B206 (such as 8 positions) and Output it 203A/203B to provide in a follow-up time-frequency cycle to the operand selection logic 1898 of ALU 204.God When wide configuration is in through processing unit 126, buffer 205A/205B will be operated together and be deposited from weight at random with receiving Wide weight word 206A/206B (such as 16 positions) of the one of access to memory 124, similar to the buffer in the embodiment of Fig. 2 205;When neural processing unit 126 is in narrow configuration, buffer 205A/205B will actually be independent work, each Receive from one narrow weight word 206A/206B (such as 8 positions) of weight random access memory 124, thus, nerve is processed Unit 126 is actually equivalent to two narrow neural processing units each independent work.But, no matter neural processing unit 126 configuration aspect why, and the identical carry-out bit of weight random access memory 124 can all be coupled and be provided to buffer 205A/205B.For example, the buffer 205A of neural processing unit 0 receives byte 0, the buffer of neural processing unit 0 205B receives byte 1, the buffer 205A of neural processing unit 1 and receives byte 2, the buffer of neural processing unit 1 205B receives byte 3, the rest may be inferred, and the buffer 205B of neural processing unit 511 will receive byte 1023.
Multitask buffer 208A/208B is respectively provided with the half (such as 8 positions) of the width of the buffer 208 of Fig. 2.Many Business buffer 208A in input 207A, 211A and 1811A can select one and store to its buffer and in the follow-up time-frequency cycle There is provided by output 209A, multitask buffer 208B can select one in input 207B, 211B and 1811B and store to its caching Device is simultaneously provided to operand selection logic 1898 by 209B is exported in the follow-up time-frequency cycle.Input 207A is deposited from data random access Reservoir 122 receives one narrow data literal (such as 8 positions), and input 207B receives a narrow number from data random access memory 122 According to word.When neural processing unit 126 is in wide configuration, multitask buffer 208A/208B will actually be one Running is played to receive wide data literal 207A/207B (such as 16 positions) from data random access memory 122, is similar to Multitask buffer 208 in the embodiment of Fig. 2;When neural processing unit 126 is in narrow configuration, multitask buffer 208A/208B will actually be independent work, each receive the narrow data literal from data random access memory 122 207A/207B (such as 8 positions), thus, neural processing unit 126 to be actually equivalent to two narrow neural processing units each From independent work.But, though the configuration aspect of neural processing unit 126 why, data random access memory 122 identical Carry-out bit can all be coupled and provide at most task buffer device 208A/208B.For example, the multitask of neural processing unit 0 is delayed Storage 208A receives byte 0, the multitask buffer 208B of neural processing unit 0 and receives byte 1, neural processing unit 1 Multitask buffer 208A receive byte 2, the multitask buffer 208B of neural processing unit 1 and receive byte 3, according to this Analogize, the multitask buffer 208B of neural processing unit 511 will receive byte 1023.
Input 211A receives the output 209A of the multitask buffer 208A of neighbouring neural processing unit 126, input 211B receives the output 209B of the multitask buffer 208B of neighbouring neural processing unit 126.Input 1811A receives neighbouring god Through the output 209B of the multitask buffer 208B of processing unit 126, and it is input into 1811B and receives neighbouring nerve processing unit 126 Multitask buffer 208A output 209A.Neural processing unit 126 shown in Figure 18 belongs at the N number of nerve shown in Fig. 1 One of reason unit 126 is simultaneously denoted as neural processing unit J.That is, nerve processing unit J is at this N number of nerve The one of reason unit represents example.For a preferred embodiment, the multitask buffer 208A input 211A of neural processing unit J The multitask buffer 208A output 209A of the neural processing unit 126 of example J-1 can be received, and neural processing unit J's is more Task buffer device 208A input 1811A can receive the multitask buffer 208B outputs of the neural processing unit 126 of example J-1 209B, and the multitask buffer 208A output 209A of neural processing unit J can be simultaneously provided to the nerve process of example J+1 The multitask buffer 208B of the neural processing unit 126 of the multitask buffer 208A input 211A and example J of unit 126 Input 211B;The input 211B of the multitask buffer 208B of neural processing unit J can receive the neural processing unit of example J-1 126 multitask buffer 208B output 209B, and the input 1811B meetings of the multitask buffer 208B of neural processing unit J Receive the multitask buffer 208A output 209A of the neural processing unit 126 of example J, also, many of neural processing unit J The output 209B of business buffer 208B can be simultaneously provided to the multitask buffer 208A of the neural processing unit 126 of example J+1 The multitask buffer 208B input 211B of the neural processing unit 126 of input 1811A and example J+1.
Each in the control multitask buffer 208A/208B of control input 213, selects one from these three inputs Buffer corresponding to which is stored, and is provided to corresponding output 209A/209B in subsequent step.When neural processing unit 126 be instructed to from data random access memory 122 load string when (the multiply-accumulate instruction of address 1 in such as Figure 20, in detail As be described hereinafter), no matter this neural processing unit 126 is that control input 213 can control multitask and delay in wide configuration or narrow configuration Each multitask buffer in storage 208A/208B, from the corresponding narrow of the select column of data random access memory 122 A corresponding narrow data literal 207A/207B (such as 8) is selected in word.
(for example scheme when neural processing unit 126 is received to be indicated and need and the data columns value of previous receipt is rotated The multiply-accumulate rotation instruction of address 2 in 20, the details will be described later), if neural processing unit 126 is in narrow configuration, control defeated Enter 213 and will control the corresponding input 1811A/ of each multitask buffer selection in multitask buffer 208A/208B 1811B.In the case, multitask buffer 208A/208B can actually be independent work and make neural processing unit 126 real Just as two independent narrow neural processing units on border.Thus, the multitask buffer 208A of N number of neural processing unit 126 Will have at Figure 19 in more detail in follow-up corresponding to such as the rotator of same 2N narrow word, this part with 208B cooperating syringes Explanation.
When neural processing unit 126 is received to be indicated to need to rotate the data columns value of previous receipt, if refreshing It is that it is many that control input 213 will control in multitask buffer 208A/208B each in wide configuration through processing unit 126 Task buffer device selects corresponding input 211A/211B.In the case, multitask buffer 208A/208B can cooperating syringe And actually just look like this neural processing unit 126 be single wide neural processing unit 126.Thus, N number of nerve processes list The multitask buffer 208A of unit 126 will be similar to corresponding to figure such as the rotator of same N number of wide word with 208B cooperating syringes Mode described by 3.
ALU 204 includes that operand selects 1898, wide multiplier 242A of logic, a narrow multiplier 242B, a wide dual input multiplexer 1896A, a narrow dual input multiplexer 1896B, wide adder 244A are narrow with one Adder 244B.In fact, this ALU 204 can be regarded as including that operand selects logic, a wide arithmetical logic Unit 204A (including aforementioned wide multiplier 242A, aforementioned wide multiplexer 1896A and aforementioned wide adder 244A) and a narrow calculation Art logical block 204B (including aforementioned narrow multiplier 242B, aforementioned narrow multiplexer 1896B and aforementioned narrow adder 244B).With regard to one For preferred embodiment, two wide words can be multiplied by wide multiplier 242A, similar to the multiplier 242 of Fig. 2, such as one 16 Take advantage of the multiplier of 16 in position.Two narrow words can be multiplied by narrow multiplier 242B, and such as one 8 multipliers for taking advantage of 8 are producing Raw one 16 result.Neural processing unit 126 be in narrow configuration when, through operand select logic 1898 assistance, i.e., Wide multiplier 242A can be made full use of, makes two narrow words be multiplied as a narrow multiplier, so nerve processing unit 126 will be such as the narrow neural processing unit of two effective operations.For a preferred embodiment, wide adder 244A can be by width The output of multiplexer 1896A is added with the output 217A of wide accumulator 202A and produce a sum 215A and make for width accumulator 202A With which operates the adder 244 similar to Fig. 2.Narrow adder 244B can be by the output of narrow multiplexer 1896B and narrow accumulator 202B output 217B are added and are used for narrow accumulator 202B with producing a sum 215B.In one embodiment, narrow accumulator 202B There is the width of 28, to avoid accuracy being lost when the up to accumulating operation of 1024 16 products is carried out.At nerve During wide configuration, narrow multiplier 244B, narrow accumulator 202B and narrow run function unit 212B are preferably in reason unit 126 Starting state is not reducing energy dissipation.
Operand selects logic 1898 selection operation number can provide to arithmetical logic from 209A, 209B, 203A and 203B Other components of unit 204, the details will be described later.For a preferred embodiment, operand selects logic 1898 that also there are other work( Can, the symbol for for example executing signed magnitude data literal with weight word extends.For example, if neural processing unit 126 is that operand selects the symbol that logic 1898 can be by narrow data literal with weight word to extend into wide word in narrow configuration Width, be then just supplied to wide multiplier 242A.Similarly, if ALU 204 receives instruction and will transmit one Narrow data/weight word (skip wide multiplier 242A using wide multiplexer 1896A), and operand selects the meeting of logic 1898 by narrow number According to the width that the symbol of word and weight word extends into wide word, wide adder 244A is then just supplied to.With regard to a preferable reality For applying example, the logic of this execution symbol extension function exists in the arithmetic logical operation of the neural processing unit 126 of Fig. 2 204 inside.
Wide multiplexer 1896A receives the output and the operation for selecting logic 1898 from operand of wide multiplier 242A Number, and from these inputs, select one to be supplied to wide adder 244A, narrow multiplexer 1896B receives the defeated of narrow multiplier 242B The operand for going out and selecting logic 1898 from operand, and from these inputs, select one to be supplied to narrow adder 244B.
The configuration of the neural processing unit 126 of the operand selection meeting foundation of logic 1898 and ALU 204 will The arithmetic of execution and/or logical operationss provide operand, the finger that this arithmetic/logic is executed according to neural processing unit 126 Make specified function to determine.For example, if instructing the one multiply-accumulate computing of execution of instruction ALU 204 In wide configuration, it is wide that operand selects logic 1898 that output 209A is just concatenated for constituting with 209B to neural processing unit 126 Word is provided to an input of wide multiplier 242A, and the width word that output 203A concatenate composition with 203B is provided to another Input, and narrow multiplier 242B is then not start, thus, the running of neural processing unit 126 will as single similar to The wide neural processing unit 126 of the neural processing unit 126 of Fig. 2.But, if instruction indicates that ALU executes one and takes advantage of Method accumulating operation and neural processing unit 126 are in narrow configuration, after operand selects logic 1898 that just one extends or expansion After, the narrow data literal 209A of version is provided to an input of wide multiplier 242A, and the narrow weight word by version after extension 203A is provided to another input;Additionally, operand selects logic 1898 provide narrow data literal 209B to narrow multiplier One input of 242B, and narrow weight word 203B is provided to another input.As previously mentioned narrow word is extended for reaching Or the computing of expansion, if narrow word carries symbol, operand selects logic 1898 carry out symbol extension to narrow word;If Narrow word without symbol, operand select logic 1898 will above narrow word is added off bit.
In another example, if neural processing unit 126 in wide configuration and instructs instruction ALU 204 The accumulating operation of a weight word is executed, wide multiplier 242A will be skipped, and operand selects the logic 1898 will will be defeated Going out 203A, offer is concatenated with 203B to wide multiplexer 1896A being supplied to wide adder 244A.But, if neural processing unit 126 in narrow configuration and instruct the accumulating operation for indicating that ALU 204 executes a weight word, wide multiplier 242A Will be skipped, and the output 203A of version is provided to wide multiplexer after operand selects logic 1898 that one will extend 1896A is being supplied to wide adder 244A;Additionally, narrow multiplier 242B can be skipped, operand selects logic 1898 can prolong After exhibition, the output 203B of version is provided to narrow multiplexer 1896B being supplied to narrow adder 244B.
In another example, if neural processing unit 126 in wide configuration and instructs instruction ALU 204 The accumulating operation of a data literal is executed, wide multiplier 242A will be skipped, and operand selects the logic 1898 will will be defeated Going out 209A, offer is concatenated with 209B to wide multiplexer 1896A being supplied to wide adder 244A.But, if neural processing unit 126 in narrow configuration and instruct the accumulating operation for indicating that ALU 204 executes a data literal, wide multiplier 242A Will be skipped, and the output 209A of version is provided to wide multiplexer after operand selects logic 1898 that one will extend 1896A is being supplied to wide adder 244A;Additionally, narrow multiplier 242B can be skipped, operand selects logic 1898 can prolong After exhibition, the output 209B of version is provided to narrow multiplexer 1896B being supplied to narrow adder 244B.Weight/data literal cumulative Calculating contributes to average calculating operation, and average calculating operation is available if image processing is in the common source of some interior artificial neural network applications (pooling) layer.
For a preferred embodiment, neural processing unit 126 also includes the second wide multiplexer (not shown), in order to skip Wide adder 244A, is beneficial to the wide data/weight word under wide configuration or the narrow data/power after the extension under narrow configuration Weigh the wide accumulator 202A of word loading, and second narrow multiplexer (not shown), in order to skip narrow adder 244B, be beneficial to by Narrow data under narrow configuration/weight word loads narrow accumulator 202B.For a preferred embodiment, this ALU 204 also include that wide with narrow comparator/multiplexer combines (not shown), and this comparator/multiplexer combined reception is corresponding to tire out Plus device numerical value 217A/217B and corresponding multiplexer 1896A/1896B outputs, use accumulator value 217A/217B with Maximum, the common source of some artificial neural network applications is selected between one data/weight word 209A/209B/203A/203B (pooling) layer uses this computing, this part for example to correspond at Figure 27 and Figure 28, have in more detail in following sections Bright.Additionally, operand selects logic 1898 in order to providing the operand of value of zero (for the additive operation of Jia zero or in order to clear Remove accumulator), and the operand (for taking advantage of one multiplying) of numerical value one is provided.
Narrow run function unit 212B receives the output 217B of narrow accumulator 202B and executes run function to which to produce Narrow result 133B, wide run function unit 212A receive the output 217A of wide accumulator 202A and execute run function to which to produce Raw width result 133A.When neural processing unit 126 is in narrow configuration, it is cumulative that wide run function unit 212A can configure understanding according to this The output 217A of device 202A simultaneously executes run function to produce narrow result to which, and such as 8, this part is such as corresponded in following sections Figure 29 A to Figure 30 places are described in more detail.
As it was previously stated, single nerve processing unit 126 effectively can function as two narrow nerves when in narrow configuration Processing unit therefore, for less word, compared to during wide configuration, can generally provide up to twice operating Disposal ability.For example, it is assumed that neural net layer has 1024 neurons, and each neuron is received from preceding layer 1024 narrow input (and there is narrow weight word), will so produce 1,000,000 links.Process for having 512 nerves For the neutral net unit 121 of unit 126, under narrow configuration (neural processing unit narrow equivalent to 1024), although process Be narrow word rather than wide word, but its treatable connective number of institute can reach four times of wide configuration, and (1,000,000 link Upper 256K is linked), and substantially half of required time (about 1026 time-frequency cycles are to upper 514 time-frequency cycles).
In one embodiment, the dynamic configuration nerve processing unit 126 of Figure 18 is included similar to multitask buffer 208A With the three of 208B input multitask buffers to replace buffer 205A and 205B, to constitute a rotator, place's reason weight with Machine access memorizer 124 receive weight text line, this operational part similar to Fig. 7 embodiment described by mode but application In the dynamic configuration described in Figure 18.
Figure 19 be a block schematic diagram, show according to Figure 18 embodiment, using Fig. 1 neutral net unit 121 N number of 2N multitask buffer 208A/208B of neural processing unit 126, for the data random access memory 122 by Fig. 1 takes The string data literal 207 for obtaining executes the running such as same rotator.In the embodiment of Figure 19, N is 512, and nerve processes single Unit 121 has 1024 multitask buffer 208A/208B, is denoted as 0 to 511, is respectively corresponding to 512 neural processing units 126 and actually 1024 narrow neural processing units.Two narrow neural processing units in neural processing unit 126 are marked respectively A and B is shown as, in each multitask buffer 208, its corresponding narrow neural processing unit is also indicated.Further come Say, the multitask buffer 208A for being denoted as 0 neural processing unit 126 is denoted as 0-A, is denoted as 0 neural processing unit 126 multitask buffer 208B is denoted as 0-B, is denoted as the multitask buffer 208A signs of 1 neural processing unit 126 For 1-A, the multitask buffer 208B for being denoted as 1 neural processing unit 126 is denoted as 1-B, is denoted as 511 nerve process The multitask buffer 208A of unit 126 is denoted as 511-A, and is denoted as the multitask caching of 511 neural processing unit 126 Device 208B is denoted as 511-B, and its numerical value is also corresponding to the narrow neural processing unit described in follow-up Figure 21.
Each multitask buffer 208A receives its phase in the wherein string that the D of data random access memory 122 is arranged Corresponding narrow data literal 207A, and each multitask buffer 208B is arranged wherein in the D of data random access memory 122 Its corresponding narrow data literal 207B is received in string.That is, multitask buffer 0-A receiving data random access memory is deposited The narrow data literal 0 of the row of reservoir 122, the narrow data literal of the row of multitask buffer 0-B receiving datas random access memory 122 1, the narrow data literal 2 of the row of multitask buffer 1-A receiving datas random access memory 122, multitask buffer 1-B are received The narrow data literal 3 of the row of data random access memory 122, the rest may be inferred, and multitask buffer 511-A receiving datas are deposited at random The narrow data literal 1022 of the row of access to memory 122, and multitask buffer 511-B is then receiving data random access memory The narrow data literal 1023 of 122 row.Additionally, multitask buffer 1-A receives the output 209A of multitask buffer 0-A as which Input 211A, multitask buffer 1-B receive the output 209B of multitask buffer 0-B and are input into 211B as which, and the rest may be inferred, Multitask buffer 511-A receives the output 209A of multitask buffer 510-A and is input into 211A, multitask buffer as which 511-B receives the output 209B of multitask buffer 510-B and is input into 211B as which, and multitask buffer 0-A receptions are more The output 209A of task buffer device 511-A is input into 211A as which, and multitask buffer 0-B receives multitask buffer 511-B Output 209B as its be input into 211B.Each multitask buffer 208A/208B can receive control input 213 to control which After 211A/211B or rotation is input into after selecting data literal 207A/207B or rotation be input into 1811A/1811B.Finally, many Task buffer device 1-A receives the output 209B of multitask buffer 0-B and is input into 1811A as which, and multitask buffer 1-B is received The output 209A of multitask buffer 1-A is input into 1811B as which, and the rest may be inferred, and multitask buffer 511-A receives multitask The output 209B of buffer 510-B is input into 1811A as which, and multitask buffer 511-B receives multitask buffer 511-A's Output 209A is input into 1811B as which, and multitask buffer 0-A receives the output 209B works of multitask buffer 511-B 1811A is input into for which, and multitask buffer 0-B receives the output 209A of multitask buffer 0-A and 1811B is input into as which.Per Individual multitask buffer 208A/208B can receive control input 213 and select data literal 207A/207B or rotation to control which After 211A/211B or rotation is input into after turning be input into 1811A/1811B.In an operational pattern, in the first time-frequency cycle, control Input 213 can control each multitask buffer 208A/208B and select data literal 207A/207B to store to buffer for follow-up There is provided to ALU 204;And in the follow-up time-frequency cycle (for example aforesaid M-1 time-frequencies cycle), control input 213 can be controlled Make each multitask buffer 208A/208B select rotation after be input into 1811A/1811B store to buffer for subsequently provide to ALU 204, this part can be described in more detail in following sections.
Figure 20 is a form, shows the program storage 129 of a neutral net unit 121 for being stored in Fig. 1 and by this The program that neutral net unit 121 is executed, and this neutral net unit 121 has nerve process as shown in the embodiment of figure 18 Unit 126.Program of the example program of Figure 20 similar to Fig. 4.Illustrate below for its difference.It is located at the initial of address 0 Changing neural processing unit instruction specifies nerve processing unit 126 enter narrow configuration.Additionally, as shown in FIG., positioned at address 2 Multiply-accumulate rotation instruction specify a numerical value to be 1023 count value and need 1023 time-frequency cycles.This is because Figure 20 Assume to be of virtually 1024 narrow (such as 8) neuron (i.e. neural processing units), each narrow nerve in one layer in example There are 1024 links from 1024 neurons of preceding layer to be input into for unit, therefore a total of 1024K link.Each nerve Unit receives 8 bit data values and this 8 bit data value is multiplied by appropriate 8 weighted value from each link input.
Figure 21 is to show that neutral net unit 121 executes the sequential chart of the program of Figure 20, and this neutral net unit 121 has Neural processing unit 126 as shown in figure 18 is implemented in narrow configuration.Sequential chart of the sequential chart of Figure 21 similar to Fig. 5.Following pin Its difference is illustrated.
In the sequential chart of Figure 21, these neural processing units 126 can be in narrow configuration, this is because positioned at address 0 The neural processing unit instruction of initialization is initialized with narrow configuration.So, this 512 neural processing units 126 are actually transported Make to get up just as 1024 narrow neural processing unit (or neuron), this 1024 narrow neural processing units are in field with god Through processing unit 0-A and neural processing unit 0-B (two narrow neural processing units of the neural processing unit 126 for being denoted as 0), (two narrow nerves of the neural processing unit 126 for being denoted as 1 process single neural processing unit 1-A and neural processing unit 1-B Unit), (511 neural processing unit is denoted as so on up to neural processing unit 511-A and nerve processing unit 511-B 126 two narrow neural processing units), indicated.For simplifying explanation, in figure, only show narrow neural processing unit 0-A, 0-B Computing with 511-B.Because being 1023 positioned at the count value of the multiply-accumulate rotation instruction of address 2, and need 1023 Therefore the individual time-frequency cycle is operated, and the columns of the sequential chart of Figure 21 includes up to 1026 time-frequency cycles.
In the time-frequency cycle 0, each of this 1024 neural processing units can execute the initialization directive of Fig. 4, i.e. Fig. 5 The shown running for assigning null value to accumulator 202.
In the time-frequency cycle 1, each of this 1024 narrow neural processing units can execute taking advantage of positioned at address 1 in Figure 20 Method accumulated instruction.As shown in FIG., accumulator 202A numerical value (i.e. zero) is added data random access by narrow neural processing unit 0-A The 17 narrow word 0 of row of unit 122 and the product of the 0 narrow word 0 of row of weight random access units 124;Narrow neural processing unit 0-B The 17 narrow word 1 of row and weight random access units that accumulator 202B numerical value (i.e. zero) is added data random access unit 122 The product of 124 0 narrow word 1 of row;Accumulator 202B numerical value (i.e. zero) is added so on up to narrow neural processing unit 511-B The 17 narrow word 1023 of row of upper data random access unit 122 and taking advantage of for the 0 narrow word 1023 of row of weight random access units 124 Product.
In the time-frequency cycle 2, each of this 1024 narrow neural processing units can execute taking advantage of positioned at address 2 in Figure 20 Method adds up and rotates the first time iteration of instruction.As shown in FIG., accumulator 202A numerical value 217A are added by narrow neural processing unit 0-A On by narrow data literal after the multitask buffer 208B output rotations that received of 209B of narrow neural processing unit 511-B 1811A (the narrow data literal 1023 for namely being received by data random access memory 122) and weight random access units The product of 124 1 narrow word 0 of row;Accumulator 202B numerical value 217B is added and is processed list by narrow nerve by narrow neural processing unit 0-B After the rotation received by the multitask buffer 208A output 209A of first 0-A, narrow data literal 1811B is (namely random by data The narrow data literal received by access memorizer 122 product 0) with the 1 narrow word 1 of row of weight random access units 124;According to this Analogize, until narrow neural processing unit 511-B adds accumulator 202B numerical value 217B by many of narrow neural processing unit 511-A After the rotation received by task buffer device 208A output 209A, narrow data literal 1811B (is namely stored by data random access The narrow data literal received by device 122 product 1022) with the 1 narrow word 1023 of row of weight random access units 124.
In the time-frequency cycle 3, each of this 1024 narrow neural processing units can execute taking advantage of positioned at address 2 in Figure 20 Method adds up and rotates second iteration of instruction.As shown in FIG., accumulator 202A numerical value 217A are added by narrow neural processing unit 0-A On by narrow data literal after the multitask buffer 208B output rotations that received of 209B of narrow neural processing unit 511-B 1811A (the narrow data literal 1022 for namely being received by data random access memory 122) and weight random access units The product of 124 2 narrow word 0 of row;Accumulator 202B numerical value 217B is added and is processed list by narrow nerve by narrow neural processing unit 0-B After the rotation received by the multitask buffer 208A output 209A of first 0-A, narrow data literal 1811B is (namely random by data The narrow data literal received by access memorizer 122 product 1023) with the 2 narrow word 1 of row of weight random access units 124; The rest may be inferred, until narrow neural processing unit 511-B adds accumulator 202B numerical value 217B by narrow neural processing unit 511-A The multitask buffer 208A output rotations that received of 209A after narrow data literal 1811B (namely by data random access The narrow data literal received by memorizer 122 product 1021) with the 2 narrow word 1023 of row of weight random access units 124.Such as Shown in Figure 21, this computing can be persistently carried out in follow-up 1021 time-frequency cycles, until the time-frequency cycle 1024 of described below.
In the time-frequency cycle 1024, each of this 1024 narrow neural processing units can be executed in Figure 20 positioned at address 2 Multiply-accumulate rotation instruction the 1023rd iteration.As shown in FIG., narrow neural processing unit 0-A is by accumulator 202A numerical value Narrow data text after the rotation received plus the multitask buffer 208B output 209B by narrow neural processing unit 511-B by 217A Word 1811A (the narrow data literal 1 for namely being received by data random access memory 122) and weight random access units The product of 124 1023 narrow word 0 of row;Narrow neural processing unit 0-B adds accumulator 202B numerical value 217B by narrow nerve After the rotation received by the multitask buffer 208A output 209A of reason unit 0-A, narrow data literal 1811B is (namely by data 2) the narrow data literal received by random access memory 122 is taken advantage of with the 1023 narrow word 1 of row of weight random access units 124 Product;The rest may be inferred, until narrow neural processing unit 511-B adds accumulator 202B numerical value 217B by narrow neural processing unit After the rotation received by the multitask buffer 208A output 209A of 511-A, narrow data literal 1811B is (namely random by data 0) the access memorizer 122 narrow data literal that received is taken advantage of with the 1023 narrow word 1023 of row of weight random access units 124 Product.
The run function unit 212A/ of each in time-frequency cycle 1025, this 1024 narrow neural processing units 212B can execute the run function instruction for being located at address 3 in Figure 20.Finally, in the time-frequency cycle 1026, at this 1024 narrow nerves Each meeting in reason unit will be relative in the row 16 of its narrow result 133A/133B write back data random access memory 122 Narrow word is answered, and is instructed with executing the write run function unit that address 4 is located in Figure 20.That is, neural processing unit 0-A's is narrow As a result 133A can be written into the narrow word 0 of data random access memory 122, the narrow result 133B meeting of neural processing unit 0-B The narrow word 1 of data random access memory 122 is written into, the rest may be inferred, until the narrow result of neural processing unit 511-B 133B can be written into the narrow word 1023 of data random access memory 122.Figure 22 is shown aforementioned corresponding to Figure 21 with block chart Computing.
Figure 22 is the block schematic diagram of the neutral net unit 121 for showing Fig. 1, and this neutral net unit 121 has such as schemes Neural processing unit 126 shown in 18 is executing the program of Figure 20.It is single that this neutral net unit 121 includes that 512 nerves are processed First 126, i.e., 1024 narrow neural processing unit, data random access memory 122, and weight random access memory 124, Data random access memory 122 receives its address input 123, and weight random access memory 124 receives its address input 125.Although do not show in figure, but, in the time-frequency cycle 0, this 1024 narrow neural processing units can all execute the first of Figure 20 Beginningization is instructed.As shown in FIG., in the time-frequency cycle 1,1024 8 data literals of row 17 can be from data random access memory 122 read and provide to this 1024 narrow neural processing units.In the time-frequency cycle 1 to 1024,1024 8 of row 0 to 1023 Weight word can read from weight random access memory 124 respectively and provide to this 1024 narrow neural processing units.Although Do not show in figure, but, in the time-frequency cycle 1, this 1024 narrow neural processing units can be to the data literal of loading and weight Word executes its corresponding multiply-accumulate computing.In the time-frequency cycle 2 to 1024, many of this 1024 narrow neural processing units The rotator of the running of business buffer 208A/208B such as same 1024 8 words, can be by the previously loaded data random access The data literal of the row 17 of memorizer 122 is rotated to neighbouring narrow neural processing unit, and these narrow neural processing units can be right After corresponding rotation, data literal and being executed by the corresponding narrow weight word that weight random access memory 124 is loaded is taken advantage of Method accumulating operation.Although do not show in figure, in the time-frequency cycle 1025, this 1024 narrow run function unit 212A/212B meetings Execute enabled instruction.In the time-frequency cycle 1026, this 1024 narrow neural processing units can be by its 1024 corresponding 8 result The row 16 of 133A/133B write back datas random access memory 122.
It is possible thereby to find, compared to the embodiment of Fig. 2, the embodiment of Figure 18 allows program designer that there is elasticity can select Select and calculating is executed with weight word (such as 16) and narrow data with weight word (such as 8) using wide data, with response to specific For the demand of accuracy under application.From one towards from the point of view of, for the application of narrow data, the embodiment of Figure 18 compared to The embodiment of Fig. 2 can provide the efficiency of twice, but must increase extra narrow component (such as multitask buffer 208B, caching Device 205B, narrow ALU 204B, narrow accumulator 202B, narrow run function unit 212B) used as cost, these are extra Narrow component can make neural processing unit 126 increase by about 50% area.
Three moulds nerve processing unit
Figure 23 be show Fig. 1 can dynamic configuration neural processing unit 126 another embodiment block schematic diagram.Figure 23 neural processing unit 126 may not only be applied to wide configuration and narrow configuration, also may be used to the third configuration, hereon referred to as " funnel (funnel) " configure.Neural processing unit 126 of the neural processing unit 126 of Figure 23 similar to Figure 18.But, in Figure 18 Wide adder 244A is replaced by one three input width adder 2344A in the neural processing unit 126 of Figure 23, this three input Wide adder 2344A receives one the 3rd addend 2399, and which is an extension version of the output of narrow multiplexer 1896B.There is Figure 23 Neural processing unit neutral net unit performed by program similar to Figure 20 program.But, address 0 is wherein located at These neural processing units 126 can be initialized as funnel configuration, rather than narrow configuration by the neural processing unit instruction of initialization.This Outward, the count value positioned at the multiply-accumulate rotation instruction of address 2 is 511 rather than 1023.
When configuring in funnel, the running of neural processing unit 126 similar to narrow configuration is in, in executing such as Figure 20 During the multiply-accumulate instruction of location 1, neural processing unit 126 can receive two narrow data literal 207A/207B and two narrow weights Word 206A/206B;Data literal 209A and weight word 203A can be multiplied to produce wide multiplexer by wide multiplier 242A Product 246A that 1896A is selected;Data literal 209B and weight word 203B can be multiplied to produce narrow many by narrow multiplier 242B Product 246B that work device 1896B is selected.But, wide adder 2344A can by product 246A (by wide multiplexer 1896A select) with And product 246B/2399 (being selected by wide multiplexer 1896B) is all added with wide accumulator 202A output 217A, and narrow adder 244B is then not start with narrow accumulator 202B.Additionally, configuring in funnel and executing such as the multiply-accumulate rotation of address 2 in Figure 20 When turning instruction, control input 213 can make multitask buffer 208A/208B rotate two narrow word (such as 16), that is to say, that Multitask buffer 208A/208B can select its corresponding input 211A/211B, just as the same in wide configuration.But, wide Data literal 209A and weight word 203A can be multiplied to produce the product that wide multiplexer 1896A is selected by multiplier 242A 246A;Data literal 209B and weight word 203B can be multiplied to produce narrow multiplexer 1896B and be selected by narrow multiplier 242B Product 246B;Also, wide adder 2344A can be by product 246A (being selected by wide multiplexer 1896A) and product 246B/2399 (being selected by wide multiplexer 1896B) is all added with wide accumulator 202A output 217A, and narrow adder 244B and narrow accumulator 202B as aforementioned, is not start.Finally, when the run function for executing address 3 in such as Figure 20 in funnel configuration is instructed, wide Run function unit 212A can execute run function to produce narrow result 133A to result sum 215A, and narrow run function list First 212B is then not start.Thus, the narrow neural processing unit for being only denoted as A can produce narrow result 133A, the narrow of B is denoted as Narrow result 133B produced by neural processing unit is then invalid.Therefore, the row (instruction such as address 4 in Figure 20 of write-back result 16) indicated row can include cavity, this is because only narrow result 133A is effectively, narrow result 133B is then invalid.Therefore, exist Conceptive, in each time-frequency cycle, each neuron (the neural processing unit of Figure 23) can execute two link data inputs, i.e., Two narrow data literals are multiplied by its corresponding weight and by the two product additions, in comparison, the enforcement of Fig. 2 and Figure 18 Example only carries out a link data input within each time-frequency cycle.
Deposited it is found that producing simultaneously write back data random access memory 122 or weight at random in the embodiment of Figure 23 The quantity of the result word (neuron output) of access to memory 124 is subduplicate the one of received data input (link) quantity Half, and the row that write back of result have cavity, i.e., be exactly invalid every narrow text results, more precisely, be denoted as the narrow of B Neural processing unit result does not have meaning.Therefore, the embodiment of Figure 23 for the neutral net with continuous two-layer especially effective Rate, for example, the neuronal quantity that ground floor has for the second layer twice (for example ground floor have 1024 neurons fill Divide 512 neurons for being connected to the second layer).Additionally, other performance elements 122 (such as media units, such as x86 senior to Amount expanding element) when necessary, union operation (pack operation) can be executed to dispersion results row (there is cavity) So which is closely (not having cavity).Subsequently when neural processing unit 121, in execution, other are associated with data random access storage During the calculating of other row of device 122 and/or weight random access memory 124, you can the data after this is processed are arranged based on Calculate.
Hybrid neural networks unitary operation:Convolution and common source operational capability
The advantage of the neutral net unit 121 described in the embodiment of the present invention is that this neutral net unit 121 can be simultaneously Operated in the way of executing oneself internal processes similar to a coprocessor and with the process list similar to a processor Unit executes issued framework instruction (or the microcommand gone out by framework instruction translation).Framework instruction is included in nerve In framework program performed by the processor of NE 121.Thus, neutral net unit 121 can be operated in a mixed manner, And the high usage of neural processing unit 121 can be maintained.For example, Figure 24 to Figure 26 shows that neutral net unit 121 is executed The running of convolution algorithm, wherein, neutral net unit is fully utilized, and Figure 27 to Figure 28 shows that neutral net unit 121 is executed The running of common source computing.The application that convolutional layer, common source layer and other numerical datas are calculated, for example image processing is (as edge is detectd Survey, sharpened, obfuscation, identification/classification) need to use these computings.But, the hybrid operation of neural processing unit 121 It is not limited to execute convolution or common source computing, this composite character can also be used for executing other computings, such as described in Fig. 4 to Figure 13 The multiply-accumulate computing of traditional neural network and run function computing.That is, 100 (more precisely, reservation station of processor 108) MTNN instructions 1400 and MFNN instruction 1500 to neutral nets unit 121 can be issued, in response to the instruction that this issues, nerve net Network unit 121 can write data into 122/124/129 and by result from the memorizer write by neutral net unit 121 Read in 122/124, at the same time, in order to execute processor 100 (instructing through MTNN1400) write-in program memorizer 129 Program, neutral net unit 121 can read and write memorizer 122/124/129.
Figure 24 be a block schematic diagram, show by Fig. 1 neutral net unit 121 using the data for executing convolution algorithm One example of structure.This block chart includes the data random access memory of convolution kernel 2402, data array 2404 and Fig. 1 122 with weight random access memory 124.For a preferred embodiment, data array 2404 (for example corresponds to image picture Element) it is loaded into the system storage (not shown) that is connected to processor 100 and by processor 100 through executing MTNN instructions 1400 The weight random access memory 124 of loading neutral net unit 121.First array is rolled up by convolution algorithm with the second array Product, this second array are convolution kernel as herein described.As described herein, convolution kernel is a coefficient matrix, and these coefficients also may be used Referred to as weight, parameter, element or numerical value.For a preferred embodiment, this convolution kernel 2042 is the frame performed by processor 100 The static data of structure program.
Two-dimensional array of this data array 2404 for a data value, and each data value (such as image pixel value) is big The size (such as 16 or 8) of the little word for being data random access memory 122 or weight random access memory 124. In this example, data value is 16 words, and neutral net unit 121 is the neural processing unit for being configured with 512 wide configurations 126.Additionally, in this embodiment, neural processing unit 126 includes that multitask buffer is deposited from weight random access memory to receive The weight word 206 of reservoir 124, the multitask buffer 705 of such as Fig. 7, uses to being connect by weight random access memory 124 The string data value of receipts executes collective's rotator computing, and this part can be described in more detail in following sections.In this example, Data array 2404 is the pel array of 2560 row X1600 row.As shown in FIG., when framework program is by data array 2404 When carrying out convolutional calculation with convolution kernel 2402, data array 2402 can be divided into 20 data blocks, and each data block is respectively The data array 2406 of 512x400.
In this example, convolution kernel 2402 is for one by coefficient, weight, parameter or element, the 3x3 arrays of composition.These The first row of coefficient is denoted as C0, and 0;C0,1;With C0,2;The secondary series of these coefficients is denoted as C1, and 0;C1,1;With C1,2;This 3rd row of a little coefficients are denoted as C2, and 0;C2,1;With C2,2.For example, the convolution kernel with following coefficient can be used to execute Edge detection:0,1,0,1, -4,1,0,1,0.In another embodiment, the convolution kernel with following coefficient can be used to execute Gauss Fuzzy operation:1,2,1,2,4,2,1,2,1.In this example, it will usually which the numerical value after to finally adding up executes a division again, Wherein, totalling of the divisor for the absolute value of each element of convolution kernel 2042, is 16 in this example.In another example, remove Number can be the number of elements of convolution kernel 2042.In another example, divisor can be that convolution algorithm is compressed to a target The numerical value used by numerical range, this divisor by the element numerical value of convolution kernel 2042, target zone and execute convolution algorithm The scope of input value array is determined.
Refer to Figure 24 and Figure 25 of wherein details be described in detail in detail, framework program by the coefficient write data of convolution kernel 2042 with Machine accesses memorizer 122.For a preferred embodiment, the continuous nine row (convolution kernel 2402 of data random access memory 122 Interior number of elements) each column on all words, can be added as its primary sequence with arranging using the different elements of convolution kernel 2402 To write.That is, as shown in FIG., same row each word with the first coefficient C0,0 write;Next column be then with Second coefficient C0,1 write;Next column is then with the 3rd coefficient C0,2 writes;Next column is then with the 4th coefficient C1,0 write again; The rest may be inferred, until each word of the 9th row is with the 9th coefficient C2,2 writes.In order to be partitioned into data array 2404 The data matrix 2406 of data block carries out convolution algorithm, and neural processing unit 126 can repeat reading data according to order and deposit at random Nine row of 2042 coefficient of convolution kernel are loaded in access to memory 122, and this part is particularly corresponding to the portion of Figure 26 A in following sections Point, can be described in more detail.
Refer to Figure 24 and Figure 25 of wherein details is described in detail in detail, the numerical value of data matrix 2406 is write weight by framework program Random access memory 124.When neutral net unit program executes convolution algorithm, result array can be write back weight random access memory Memorizer 124.For a preferred embodiment, the first data matrix 2406 can be write weight random access memory by framework program Device 124 simultaneously makes neutral net unit 121 come into operation, when neutral net unit 121 is to the first data matrix 2406 and convolution When core 2402 executes convolution algorithm, the second data matrix 2406 can be write weight random access memory 124 by framework program, such as This, after neutral net unit 121 completes the convolution algorithm of the first data matrix 2406, you can start to execute the second data matrix 2406 convolution algorithm, this part are described in more detail at Figure 25 in follow-up corresponding to.In this way, framework program can come and go In two regions of weight random access memory 124, to guarantee that neutral net unit 121 is fully used.Therefore, Figure 24 Example shows the first data matrix 2406A and the second data matrix 2406B, and the first data matrix 2406A is corresponding to occupying First data block of row 0 to 399 in weight random access memory 124, and the second data matrix 2406B is corresponding to occupying power Second data block of row 500 to 899 in weight random access memory 124.Additionally, as shown in FIG., 121 meeting of neutral net unit The row 900-1299 and row 1300-1699 that the result of convolution algorithm is write back weight random access memory 124, subsequent framework Program can read these results from weight random access memory 124.It is loaded into the data square of weight random access memory 124 The data value of battle array 2406 is denoted as " Dx, y ", and wherein " x " is 124 columns of weight random access memory, and " y " is that weight is deposited at random The word of access to memory claims line number.For example, the data literal 511 positioned at row 399 is denoted as D399 in fig. 24, 511, this data literal is received by the multitask buffer 705 of neural processing unit 511.
Figure 25 is a flow chart, shows that the processor 100 of Fig. 1 executes framework program to utilize neutral net unit 121 pairs The data array 2404 of Figure 24 executes the convolution algorithm of convolution kernel 2042.This flow process starts from step 2502.
In step 2502, processor 100 executes the processor 100 for having framework program, can be by the convolution kernel of Figure 24 2402 write data random access memory 122 in the way of describing shown by Figure 24.Additionally, framework program can be at the beginning of by variable N Begin to turn to numerical value 1.The data block that neutral net unit 121 is being processed in variable N unlabeled datas array 2404.Additionally, framework Variable NUM_CHUNKS can be initialized as numerical value 20 by program.Next flow process advances to step 2504.
In step 2504, as shown in figure 24, processor 100 can by the data matrix 2406 of data block 1 write weight with Machine access memorizer 124 (such as the data matrix 2406A of data block 1).Next flow process advances to step 2506.
In step 2506, processor 100 can use a specified function 1432 with the MTNN of write-in program memorizer 129 Instruction 1400, by 121 program storage 129 of convolution program write neutral net unit.Processor 100 subsequently can be referred to using one The MTNN for determining function 1432 to start configuration processor instructs 1400, to start neutral net unit convolution program.Neutral net list One example of first convolution program can be described in more detail at corresponding to Figure 26 A.Next flow process advances to step 2508.
In steps in decision-making 2508, whether the numerical value of framework program validation variable N is less than NUM_CHUNKS.If so, flow process meeting Advance to step 2512;Step 2514 is otherwise proceeded to.
In step 2512, as shown in figure 24, it is random that the data matrix 2406 of data block N+1 is write weight by processor 100 Access memorizer 124 (such as the data matrix 2406B of data block 2).Therefore, when neutral net unit 121 is to current data When block executes convolution algorithm, the data matrix 2406 of next data block can be write weight random access memory and be deposited by framework program Reservoir 124, thus, after the convolution algorithm for completing current data block, i.e., after write weight random access memory 124, nerve NE 121 can immediately begin to execute convolution algorithm to next data block.
In step 2514, neutral net unit program that processor 100 confirms to be carrying out (for data block 1 but from Step 2506 starts to execute, and is then to start to execute from step 2518 for data block 2-20) whether complete to execute.Just For one preferred embodiment, processor 100 reads 121 status register of neutral net unit through MFNN instructions 1500 are executed 127 being confirmed whether to have completed to execute.In another embodiment, neutral net unit 121 can produce interruption, represent complete Into convolution program.Next flow process advances to steps in decision-making 2516.
In steps in decision-making 2516, whether the numerical value of framework program validation variable N is less than NUM_CHUNKS.If so, flow process Advance to step 2518;Step 2522 is otherwise proceeded to.
In step 2518, processor 100 can update convolution program to be implemented in data block N+1.More precisely, locate Reason device 100 can be by the train value of neural for the initialization for corresponding to address 0 in weight random access memory 124 processing unit instruction more The new first row for data matrix 2406 (for example, is updated to the row of the row 0 or data matrix 2406B of data matrix 2406A 500), and can update output row (being for example updated to row 900 or 1300).Can start to execute after this renewal with preprocessor 100 Neutral net unit convolution program.Next flow process advances to step 2522.
In step 2522, neutral net list of the processor 100 from 124 read block N of weight random access memory The implementing result of first convolution program.Next flow process advances to steps in decision-making 2524.
In steps in decision-making 2524, whether the numerical value of framework program validation variable N is less than NUM_CHUNKS.If so, flow process Advance to step 2526;Otherwise just terminate.
In step 2526, the numerical value of N can be increased by one by framework program.Next flow process returns to steps in decision-making 2508.
Program listings of Figure 26 A for neutral net unit program, convolution kernel of this neutral net unit program using Figure 24 The convolution algorithm of 2402 execution data matrixes 2406 is simultaneously write back weight random access memory 124.This program by address 1 to The instruction cycles constituted by 9 instruction circulate certain number of times.The initialization nerve processing unit instruction for being located at address 0 specifies each Neural processing unit 126 executes the number of times of this instruction cycles, and the loop count having in the example of Figure 26 A is 400, corresponding Columns in the data matrix 2406 of Figure 24, and the recursion instruction (positioned at address 10) for being located at circulation terminal can make previous cycle Count value is successively decreased, if result is nonzero value, is just returned to the top (returning to the instruction of address 1) of instruction cycles.Initially Change neural processing unit instruction and also accumulator 202 can be cleared to zero.For a preferred embodiment, positioned at the circulation of address 10 Accumulator 202 also can be cleared to zero by instruction.In addition, the such as aforementioned multiply-accumulate instruction positioned at address 1 also can be by accumulator 202 It is cleared to zero.
For the execution each time of instruction cycles in program, this 512 neural processing units 126 can execute 512 simultaneously The convolution algorithm of 512 corresponding 3x3 submatrixs of 3x3 convolution kernels and data matrix 2406.Convolution algorithm is by convolution The totalling of nine products that the corresponding element in the element of core 2042 and corresponding submatrix is calculated.Reality in Figure 26 A Apply in example, the origin of each (central element) of this 512 corresponding 3x3 submatrixs is data literal Dx+1, y in Figure 24 + 1, wherein y (line number) are that neural processing unit 126 is numbered, and x (column number) is present weight random access memory 124 In the column number that read by the multiply-accumulate instruction of address 1 in the program of Figure 26 A (this column number also can be by the initial of address 0 Changing neural processing unit instruction carries out initialization process, also can be incremented by when the multiply-accumulate instruction for being located at address 3 and 5 is executed, The decrement commands that address 9 can be also located at update).Thus, in each circulation of this program, this 512 nerves process list Unit 126 can calculate 512 convolution algorithms and the result of this 512 convolution algorithms is write back weight random access memory 124 Instruction column.Edge treated (edge handling) is omitted herein to simplify explanation, but it should be noted that utilizes this Collective's hyperspin feature of a little nerve processing units 126 can cause (for the image processor i.e. image of data matrix 2406 Data matrix) multirow data in have two rows from the vertical edge of its side to another vertical edge (for example from left side , to right side edge, vice versa for edge) produce around (wrapping).Illustrate now for instruction cycles.
Address 1 is multiply-accumulate instruction, and this instruction can be specified the row 0 of data random access memory 122 and be utilized in the dark The row of present weight random access memory 124, this row are preferably loaded in sequencer 128 (and by being located at the instruction of address 0 It is initialized with zero to execute the computing of first time instruction cycles transmission).That is, the instruction positioned at address 1 can make often Individual neural processing unit 126 reads its corresponding word from the row 0 of data random access memory 122, random from present weight The access row of memorizer 124 read its corresponding word, and execute a multiply-accumulate computing to this two words.Thus, citing comes Say, by C0,0 and Dx, 5 are multiplied (wherein " x " is that present weight random access memory 124 is arranged) neural processing unit 5, by result 202 numerical value 217 of accumulator is added, and sum is write back accumulator 202.
Address 2 is a multiply-accumulate instruction, and this instruction can specify the row of data random access memory 122 to be incremented by (i.e. Increase to 1), subsequently this row is read from the incremental rear address of data random access memory 122 again.This instructs and can specify will be per Numerical value in the multitask buffer 705 of individual neural processing unit 126 is rotated to neighbouring neural processing unit 126, in this model The row of the value of data matrix 2406 for being the instruction in response to address 1 in example and reading from weight random access memory 124.In figure In the embodiment of 24 to Figure 26, these neural processing units 126 in order to by the numerical value of multitask buffer 705 to anticlockwise, Rotate to neural processing unit J-1 from neural processing unit J, rather than if aforementioned Fig. 3, Fig. 7 and Figure 19 are from neural processing unit J Rotate to neural processing unit J+1.It should be noted that in 126 dextrorotary embodiment of neural processing unit, framework program Can be that numerical value writes data random access memory 122 (for example around its central row rotation) with different order by convolution kernel 2042 To reach the purpose of similar convolution results.Additionally, when needed, framework program can perform extra convolution kernel pretreatment (for example Mobile (transposition)).Additionally, the count value that specifies is 2.Therefore, the instruction positioned at address 2 can make each god Its corresponding word is read through processing unit 126 from the row 1 of data random access memory 122, by received text after rotation extremely Multitask buffer 705, and multiply-accumulate computing is executed to the two words.Because count value is 2, this instruction can also make each Neural processing unit 126 repeats aforementioned running.That is, sequencer 128 can make 122 column address of data random access memory 123 are incremented by (increasing to 2), and each neural processing unit 126 can read which from the row 2 of data random access memory 122 Corresponding word and will received text at most task buffer device 705 after rotation, and the two words are executed multiply-accumulate Computing.Thus, for example, it is assumed that present weight random access memory 124 is classified as 27, after the instruction for executing address 2, god Through processing unit 5 can by C0,1 and D27,6 product and C0,2 and D27,7 product accumulation is to its accumulator 202.Thus, complete Into after the instruction of address 1 with address 2, C0,0 and D27,5 product, C0,1 and D27,6 product and C0,2 and D27,7 will tire out Add to accumulator 202, add other all from first front transfer instruction cycles accumulated value.
Instruction of the computing performed by the instruction of address 3 and 4 similar to address 1 and 2, using weight random access memory Effect of 124 row increment pointers, these instructions can carry out computing, and this to the next column of weight random access memory 124 A little instructions can carry out computing to follow-up three row of data random access memory 122, i.e. row 3 to 5.That is, at nerve Reason unit 5 as a example by, after completing the instruction of address 1 to 4, C0,0 and D27,5 product, C0,1 and D27,6 product, C0,2 with D27,7 product, C1,0 and D28,5 product, C1,1 and D28,6 product and C1,2 and D28,7 product can be added to Accumulator 202, add other all from first front transfer instruction cycles accumulated value.
Instruction of the computing performed by the instruction of address 5 and 6 similar to address 3 and 4, these instructions can be deposited at random to weight The next column of access to memory 124, and follow-up three row of data random access memory 122, i.e. row 6 to 8, carry out computing.? That is, by taking neural processing unit 5 as an example, after completing the instruction of address 1 to 6, C0,0 and D27,5 product, C0,1 and D27,6 Product, C0,2 and D27,7 product, C1,0 and D28,5 product, C1,1 and D28,6 product, C1,2 and D28,7, C2,0 With D29,5 product, C2,1 and D29,6 product and C2,2 and D29,7 product can be added to accumulator 202, add which The accumulated value of its all instruction cycles from first front transfer.That is, after completing the instruction of address 1 to 6, it is assumed that instruction is followed When ring starts, weight random access memory 124 is classified as 27, by taking neural processing unit 5 as an example, it will using convolution kernel 2042 pairs Following 3x3 submatrixs carry out convolution algorithm:
In general, after completing the instruction of address 1 to 6, this 512 neural processing units 126 are all using convolution kernel 2042 pairs of following 3x3 submatrixs carry out convolution algorithm:
When wherein r is that instruction cycles start, the column address value of weight random access memory 124, and n is that nerve process is single The numbering of unit 126.
The instruction of address 7 can pass through run function unit 121 and transmit 202 numerical value 217 of accumulator.This transmission function can be transmitted One word, its size (in bits) are equal to by data random access memory 122 and weight random access memory 124 words for reading (i.e. 16 in this example).For a preferred embodiment, user may specify output format, for example In carry-out bit, how many position is decimal (fractional) position, and this part can be described in more detail in following sections.In addition, this Specify and may specify a division run function, and non-designated transmission run function, this division run function can be by accumulator 202 numerical value 217 divided by a divisor, as described in herein corresponding to Figure 29 A and Figure 30, such as using " divider " of Figure 30 One of 3014/3016.For example, for a convolution kernel 2042 with coefficient, there are 16 points Ru aforementioned One of coefficient Gaussian Blur core, the instruction of address 7 can specify a division run function (for example divided by 16), and non-designated one Transmission function.In addition, framework program can be before data random access memory 122 be write by convolution kernel coefficient, to convolution kernel 2042 coefficients execute this computing divided by 16, and the position of the binary point of adjustment 2042 numerical value of convolution kernel according to this, for example Data binary point 2922 using Figure 29 as described below.
The instruction of address 8 can by the output of run function unit 212 write weight random access memory 124 by exporting Row specified by the currency of row buffer.This currency can be initialized by the instruction of address 0, and by the incremental finger in instruction Pin is just incremented by this numerical value often passing through one cycle.
As described in the example that Figure 24 to Figure 26 has 3x3 convolution kernels 2402, when neural processing unit 126 is every about three The frequency cycle can read weight random access memory 124 to read the string of data matrix 2406, and during every about 12 The frequency cycle can be by convolution kernel matrix of consequence write weight random access memory 124.Furthermore, it is assumed that in one embodiment, have Such as write and the read buffers of the buffer 1704 of Figure 17, while neural processing unit 126 is read out with write, place Reason device 100 can be read out and write to weight random access memory 124, and buffer 1704 is every about 16 time-frequency weeks Phase can execute to weight random access memory and once read and write activity, to read data matrix and write convolution respectively Core matrix of consequence.Therefore, the approximately half of frequency range of weight random access memory 124 can be by neutral net unit 121 with mixed The convolution kernel operation that conjunction mode is executed is consumed.This example includes a 3x3 convolution kernel 2042, and but, the present invention is not limited to This, convolution kernel of other sizes, such as 2x2,4x4,5x5,6x6,7x7,8x8 etc. are equally applicable to different neutral net units Program.In the case of using larger convolution kernel, because the rotation version of the multiply-accumulate instruction (address 2,4 and 6 such as Figure 26 A Instruction, larger convolution kernel may require that using these instruction) have larger count value, neural processing unit 126 read power The time accounting of weight random access memory 124 can reduce, therefore, the frequency range of weight random access memory 124 use than Can reduce.
In addition, framework program can make neutral net unit program to no longer needing, in input data matrix 2406, the row for using Override, rather than by convolution algorithm result write back weight random access memory 124 different lines (as row 900-1299 with 1300-1699).For example, for the convolution kernel of a 3x3, data matrix 2406 can be write weight by framework program The row 2-401 of random access memory 124, and write-not row 0-399, and neural processing unit program then can be deposited from weight at random The row 0 of access to memory 124 start to write convolution algorithm result, and often pass through once command circulation and be just incremented by columns.Such as This, the row that neutral net unit program can only will no longer be required to use are override.For example, finger is passed through in first time After order circulation (or more precisely, after the instruction for executing address 1, which loads the row of weight random access memory 124 0), the data of row 0 can be written, but, the data that arrange 1-3 need to leave for the computing that passes through instruction cycles for the second time and Can not be written;Similarly, after instruction cycles are passed through for the second time, the data of row 1 can be written, and but, arrange 2-4 Data need leave for third time pass through the computing of instruction cycles and can not be written;The rest may be inferred.In this embodiment, The height (such as 800 row) of each data matrix 2406 (data block) can be increased, thus less data block can be used.
In addition, framework program can make neutral net unit program write back the result of convolution algorithm above convolution kernel 2402 Data random access memory 122 arrange (for example above row 8), rather than convolution algorithm result write back weight random access memory deposit Reservoir 124, when neutral net unit 121 writes result, framework program can read knot from data random access memory 122 Really (for example using data random access memory in Figure 26 122 be most recently written 2606 address of row).This configuration is suitable for having Single port weight random access memory 124 and the embodiment of dual port data random access memory.
Computing according to neutral net unit 121 in the embodiment of Figure 24 to Figure 26 A is it is found that the program of Figure 26 A Execute every time and may require that about 5000 time-frequency cycles, thus, in Figure 24 the data array 2404 of whole 2560x1600 convolution Computing needs about 100,000 time-frequency cycle, hence it is evident that less than the time-frequency cycle for being executed required for same task in a conventional manner Number.
Figure 26 B are an embodiment of some fields of the control buffer 127 of the neutral net unit 121 for showing Fig. 1 Block schematic diagram.This status register 127 includes a field 2602, it is indicated that quilt recently in weight random access memory 124 The address of the row of the write of neural processing unit 126;One field 2606, it is indicated that quilt recently in data random access memory 122 The address of the row of the write of neural processing unit 126;One field 2604, it is indicated that quilt recently in weight random access memory 124 The address of the row that neural processing unit 126 reads;And a field 2608, it is indicated that in data random access memory 122 most The address of the row for closely being read by neural processing unit 126.Thus, the framework program for being implemented in processor 100 just can confirm that god Through the processing progress of NE 121, when entering to data random access memory 122 and/or weight random access memory 124 When the reading of row data and/or write.Using this ability, add as aforementioned select input data matrix is override (or Data random access memory is write the result into 122) Ru aforementioned, as described in following example, the data array 2404 of Figure 24 is just The data block of 5 512x1600 can be considered as to execute, rather than the data block of 20 512x400.Processor 100 is random from weight The row 2 of access memorizer 124 start the data block for writing first 512x1600, and make neutral net unit program start (this Program has the cycle count that a numerical value is 1600, and weight random access memory 124 is exported row initialization for 0).When When neutral net unit 121 executes neutral net unit program, processor 100 can monitor weight random access memory 124 Outgoing position/address, uses (1) (instructing 1500 using MFNN) and reads in weight random access memory 124 with by nerve The row of effective convolution operation result that NE 121 (by row 0) writes;And (2) are by second 512x1600 data Matrix 2406 (starting from row 2) is override in the effective convolution operation result being read, and so works as neutral net unit 121 pairs Neutral net unit program is completed in first 512x1600 data block, processor 100 can update nerve when necessary immediately NE program is simultaneously again started up neutral net unit program to be implemented in second 512x1600 data block.This program can be again Execute in triplicate and be left three 512x1600 data blocks, so that neutral net unit 121 fully can be used.
In one embodiment, run function unit 212 has that effective to 202 numerical value 217 of accumulator can to execute one effective The ability of division arithmetic, this part are especially corresponded in following sections and are had in more detail at Figure 29 A, Figure 29 B and Figure 30 Bright.For example, the run function neutral net unit instruction for carrying out the division arithmetic divided by 16 to 202 numerical value of accumulator can use Gaussian Blur matrix in described below.
Convolution kernel 2402 used in the example of Figure 24 is a small-sized static for being applied to whole data matrix 2404 Convolution kernel, but, the present invention is not limited to this, and this convolution kernel is alternatively a large-scale matrix, with specific weight corresponding to number According to the different pieces of information value of array 2404, for example, it is common in the convolution kernel of convolutional neural networks.When neutral net unit 121 is with this side When formula is used, framework program can by the location swap of data matrix and convolution kernel, also will data matrix be positioned over data with Convolution kernel is positioned in weight random access memory 124 in machine access memorizer 122, and executes neutral net unit journey The columns processed needed for sequence also can be relatively fewer.
Figure 27 is a block schematic diagram, shows a model of the weight random access memory 124 for inserting input data in Fig. 1 Example, this input data execute common source computing (pooling operation) by the neutral net unit 121 of Fig. 1.Common source computing is Executed by the common source layer of artificial neural network, through the subregion or submatrix the maximum of calculated sub-matrix that obtain input matrix Value or meansigma methodss are common source matrix with matrix as a result, to reduce input data matrix (such as image after image or convolution) Size (dimension).In examples of the Figure 27 with Figure 28, common source computing calculates the maximum of each submatrix.Common source computing For such as executing, object is classified or the artificial neural network of detecting is particularly useful.In general, common source computing can essentially make First prime number of submatrix of the factor of input matrix reduction by detecting, particularly can be by each dimension direction of input matrix All reduce first prime number in the corresponding dimension direction of submatrix.In the example of Figure 27, input data is a wide word (such as 16 Position) 512x1600 matrixes, be stored in the row 0 to 1599 of weight random access memory 124.In figure 27, these words with Its column row location mark, e.g., positioned at 0 row 0 of row word indicating be D0,0;Be located at 0 row 1 of row word indicating be D0,1; Be located at 0 row 2 of row word indicating be D0,2;The rest may be inferred, positioned at 0 row 511 of row word indicating be D0,511.In the same manner, position In 1 row 0 of row word indicating be D1,0;Be located at 1 row 1 of row word indicating be D1,1;1 row of row, 2 word indicating is located at for D1,2; The rest may be inferred, positioned at 1 row 511 of row word indicating be D1,511;So the rest may be inferred, positioned at the word indicating of 1599 row 0 of row For D1599,0;Be located at 1599 row 1 of row word indicating be D1599,1 be located at 1599 row 2 of row word indicating be D1599,2;According to This analogizes, positioned at 1599 row 511 of row word indicating be D1599,511.
Program listings of the Figure 28 for neutral net unit program, this neutral net unit program execute the input data of Figure 27 The common source computing of matrix is simultaneously write back weight random access memory 124.In the example of Figure 28, common source computing can calculate defeated Enter the maximum of each 4x4 submatrix in data matrix.This program can be performed a plurality of times the instruction cycles being made up of instruction 1 to 10. The initialization nerve processing unit instruction for being located at address 0 can specify the number of times of each 126 execute instruction of neural processing unit circulation, Loop count in the example of Figure 28 is 400, and the recursion instruction in circulation end (in address 11) can make previous cycle Count value is successively decreased, and if produced result is nonzero value, the top for being just returned to instruction cycles (returns to address 1 Instruction).Input data matrix in weight random access memory 124 substantially can be considered as 400 by neutral net unit program The individual mutual exclusion group being made up of four adjacent columns, i.e., row 0-3, row 4-7, row 8-11, the rest may be inferred, until arrange 1596-1599.Per One group being made up of four adjacent columns includes 128 4x4 submatrixs, these submatrixs thus four row of group and four phases The 4x4 submatrixs formed by the infall element of adjacent rows, these adjacent lines at once 0-3, row 4-7, row 8-11, so on up to Row 508-511.The 4th neural processing unit 126 (in this 512 neural processing units 126, per four for one group of calculating I.e. 128 altogether) meeting 4x4 submatrixs corresponding to execute common source computing, and other three neural processing units 126 are not then made With.More precisely, neural processing unit 0,4,8, so on up to neural processing unit 508,4x4 that can be corresponding to which Submatrix executes common source computing, and the leftmost side line number of this 4x4 submatrix is numbered corresponding to neural processing unit, and lower section row Corresponding to the train value of present weight random access memory 124, this numerical value can be initialized as zero simultaneously by the initialization directive of address 0 And 4 can be increased after each instruction cycles are repeated, this part can be described in more detail in following sections.This 400 times instructions are followed The palikinesia of ring is corresponding, and to the 4x4 submatrixs group number in the input data matrix of Figure 27, (i.e. input data matrix has 1600 row are divided by 4).Accumulator 202 can be also removed in the neural processing unit instruction of initialization makes which be zeroed.With regard to a preferred embodiment Speech, the recursion instruction of address 11 can also remove accumulator 202 makes which be zeroed.In addition, the maxwacc instructions of address 1 can be specified clearly Except accumulator 202 makes which be zeroed.
Every time in the instruction cycles of configuration processor, this 128 neural processing units 126 for being used can be to input data 128 other 4x4 submatrixs in the current four row group of matrix, while execute 128 common source computings.Furthermore, it is understood that This common source computing can confirm the maximum element in 16 elements of this 4x4 submatrix.In the embodiment of Figure 28, for this For each neural processing unit y in 128 neural processing units 126 for being used, the lower left element of 4x4 submatrixs For element Dx, y in Figure 27, wherein x is the columns of present weight random access memory 124 when instruction cycles start, and this By the maxwacc instruction readings of address 1 in the program of Figure 28, (this columns also can be processed column data by the initialization of address 0 nerve Unit instruction is initialized, and is incremented by when the maxwacc for executing address 3,5 and 7 every time is instructed).Therefore, for this program Each circulation for, this 128 neural processing units 126 that used can work as corresponding 128 4x4 of prostatitis group The maximum element of submatrix, writes back the specified row of weight random access memory 124.Retouched below for this instruction cycles State.
The maxwacc instructions of address 1 can be arranged using present weight random access memory 124 in the dark, and this row is preferably filled Be loaded in sequencer 128 (and be initialized with zero by the instruction for being located at address 0 and pass through instruction cycles for the first time to execute Computing).The instruction of address 1 can make each neural processing unit 126 from weight random access memory 124 when prostatitis is read Its corresponding word, by this word compared with 202 numerical value 217 of accumulator, and the maximum of the two numerical value is stored in cumulative Device 202.So that it takes up a position, for example, neural processing unit 8 can confirm 202 numerical value 217 of accumulator and data literal Dx, 8 (wherein " x " That present weight random access memory 124 is arranged) in maximum and write back accumulator 202.
Address 2 is a maxwacc instruction, and this instructs to specify and caches the multitask of each neural processing unit 126 Numerical value in device 705 is rotated to neighbouring to neural processing unit 126, and here is just random from weight in response to the instruction of address 1 The string input data array of values that access memorizer 124 reads.In the embodiment of Figure 27 to Figure 28, neural processing unit 126 , namely rotate to neural processing unit J-1 from neural processing unit J to anticlockwise in order to by 705 numerical value of multiplexer, such as right above Should be described in the chapters and sections in Figure 24 to Figure 26.Additionally, it is 3 that this instruction can specify a count value.Thus, the instruction of address 2 can make often Individual neural processing unit 126 by the at most task buffer device 705 of received text after rotation and confirms word and accumulator after this rotation Then this computing is repeated two more times by the maximum in 202 numerical value.That is, each neural processing unit 126 can be executed By the at most task buffer device 705 of received text after rotation and confirm word and maximum in 202 numerical value of accumulator after rotation three times Computing.Thus, for example, it is assumed that when starting this instruction cycles, present weight random access memory 124 is classified as 36, with As a example by neural processing unit 8, after the instruction for executing address 1 and 2, neural processing unit 8 will be stored in its accumulator 202 Accumulator 202 and four 124 word D36 of weight random access memory when circulation starts, 8, D36,9, D36,10 and D36, Maximum in 11.
Computing performed by the maxwacc instructions of address 3 and 4 is deposited using weight random access memory similar to the instruction of address 1 Effect that 124 row increment pointers of reservoir have, address 3 can hold to the next column of weight random access memory 124 with 4 instruction OK.That is, it is assumed that when instruction cycles start, the row of present weight random access memory 124 are 36, with neural processing unit 8 As a example by, after the instruction for completing address 1 to 4, neural processing unit 8 will store tired when circulation starts in its accumulator 202 Plus device 202 and eight 124 word D36 of weight random access memory, 8, D36,9, D36,10, D36,11, D37,8, D37, 9th, D37,10 and D37, the maximum in 11.
Instruction of the computing performed by the maxwacc instructions of address 5 to 8 similar to address 1 to 4, the instruction of address 5 to 8 Lower two row of weight random access memory 124 can be executed.That is, it is assumed that when instruction cycles start, present weight is random The access row of memorizer 124 are 36, by taking neural processing unit 8 as an example, after the instruction for completing address 1 to 8, neural processing unit 8 Accumulator 202 and 16 124 words of weight random access memory when circulation starts will be stored in its accumulator 202 D36,8, D36,9, D36,10, D36,11, D37,8, D37,9, D37,10, D37,11, D38,8, D38,9, D38,10, D38, 11st, D39,8, D39,9, D39,10 and D39, the maximum in 11.That is, it is assumed that when instruction cycles start present weight with The machine access row of memorizer 124 are 36, by taking neural processing unit 8 as an example, after the instruction for completing address 1 to 8, neural processing unit 8 will complete the maximum that confirms following 4x4 submatrixs:
Substantially, after the instruction for completing address 1 to 8, each in this 128 neural processing units 126 for being used Individual neural processing unit 126 will complete the maximum for confirming following 4x4 submatrixs:
Wherein r is the column address value of present weight random access memory 124 when instruction cycles start, and n is neural process Unit 126 is numbered.
The instruction of address 9 can pass through run function unit 212 and transmit 202 numerical value 217 of accumulator.This transmission function can be transmitted One word, its size (in bits) are equal to the word read by weight random access memory 124 (in this example I.e. 16).For a preferred embodiment, user may specify output format, such as in carry-out bit, how many position is decimal (fractional) position, this part can be described in more detail in following sections.
202 numerical value 217 of accumulator can be write slow by output row in weight random access memory 124 by the instruction of address 10 Row specified by the currency of storage, this currency can be initialized by the instruction of address 0, and using the incremental finger in instruction This numerical value is incremented by after circulation is passed through every time by pin.Furthermore, it is understood that the instruction of address 10 can be wide by the one of accumulator 202 Word (such as 16) write weight random access memory 124.For a preferred embodiment, this instruction can by this 16 positions according to According to output binary point 2916 being write, this part following corresponding to Figure 29 A and Figure 29 B at have more detailed Explanation.
It has been observed that the row of iteration once command recurrent wrIting weight random access memory 124 can include has invalid value Cavity.That is, the wide word 1 to 3 of result 133,5 to 7,9 to 11, the rest may be inferred, until wide word 509 to 511 all It is invalid or untapped.In one embodiment, run function unit 212 includes that result is incorporated into row buffering by multiplexer enable The adjacent word of device, the column buffer 1104 of such as Figure 11 are arranged with writing back output weight random access memory 124.With regard to one compared with For good embodiment, run function instruction can specify word number in each cavity, and the word numerical control multiplexing in this cavity Device amalgamation result.In one embodiment, empty number can be designed to numerical value 2 to 6, with merge common source 3x3,4x4,5x5,6x6 or The output of 7x7 submatrixs.In addition, the framework program for being implemented in processor 100 can read institute from weight random access memory 124 Sparse (there is cavity) the result row for producing, and other performance elements 112 are utilized, for example merge the media of instruction using framework Unit, such as x86 single-instruction multiple-data stream (SIMD)s extension (SSE) instruction, executes pooling function.By similar to aforementioned while carrying out in the way of And the mixing essence using neutral net unit 121, the framework program for being implemented in processor 100 can be with reading state buffer 127 are most recently written row (field 2602 of such as Figure 26 B) with produced by reading with monitor weight random access memory 124 One sparse result is arranged, and is merged and write back the same row of weight random access memory 124, so just completes to prepare and can make For an input data matrix, there is provided use to next layer of neutral net, such as convolutional layer or traditional neural network layer (namely Multiply-accumulate layer).Additionally, embodiment as herein described executes common source computing with 4x4 submatrixs, but the present invention is not limited to This, the neutral net unit program of Figure 28 can be adjusted, and the submatrix with other sizes, such as 3x3,5x5,6x6 or 7x7, hold Row common source computing.
As aforementioned it is found that the quantity that the result of write weight random access memory 124 is arranged is input data matrix Columns a quarter.Finally, in this example and be not used data random access memory 122.But, it is possible with counting According to random access memory 122, rather than weight random access memory 124, execute common source computing.
In embodiments of the Figure 27 with Figure 28, the maximum in common source computing accounting operator region.But, the program of Figure 28 can The adjusted meansigma methodss to calculate subregion, profit enter to pass through by maxwacc instructions with sumwacc instruction replacements (by weight word Add up with 202 numerical value 217 of accumulator) and be divided by each sub-regions by accumulation result by the run function instruction modification of address 9 First prime number (preferably pass through multiplying reciprocal as described below), be 16 in this example.
By in computing of the neutral net unit 121 according to Figure 27 and Figure 28 it is found that executing the program of Figure 28 each time Need to execute a common source computing using about 6000 time-frequency cycles to the whole 512x1600 data matrixes shown in Figure 27, The time-frequency periodicity used by this computing is considerably less than the time-frequency periodicity that traditional approach executes similar required by task.
In addition, framework program can make neutral net unit program by the result write back data random access memory of common source computing Device 122 is arranged, rather than results back into weight random access memory 124, when neutral net unit 121 write the result into data with (for example the address of row 2606 is most recently written using the data random access memory 122 of Figure 26 B) during machine access memorizer 122, Framework program can read result from data random access memory 122.This configuration is suitable for, and there is single port weight random access memory to deposit Reservoir 124 and the embodiment of dual port data random access memory 122.
Fixed point arithmetic computing, provides binary point with user, and full precision fixed point is cumulative, and user specifies inverse Value, the random rounding-off of accumulator value, and optional startup/output function
In general, the hardware cell for executing arithmetical operation in digital computing system executes the right of arithmetical operation according to which As for integer or floating number, being commonly divided into " integer " unit and " floating-point " unit.Floating number has numerical value (magnitude) (or mantissa) and index, generally also symbol.Index is radix (radix) point (usually binary point) relative to numerical value Position pointer.In comparison, integer does not have index, and only there is numerical value, generally also symbol.Floating point unit can allow Program designer can obtain its work numeral to be used from very large-scale difference numerical value, and hardware is then It is responsible for when needed adjusting this digital exponential quantity, processes without program designer.For example, it is assumed that two floating numbers 0.111x 1029With 0.81x 1031It is multiplied.(although floating point unit typically operates in the floating number based on 2, institute in this example Use decimal fraction, or the floating number based on 10.) floating point unit can be responsible for automatically mantissa multiplication, index be added, Result is normalized to numerical value .8911x 10 again subsequently59.In another example, it is assumed that two same floating numbers are added.Floating Dot element can be responsible for the binary point of mantissa aligns to produce numerical value as .81111x 10 before addition automatically31Total Number.
But, it is well known that so complicated computing and the size of floating point unit can be caused to increase, power consumption increase, often refer to Time-frequency periodicity needed for order increases and/or cycle time is elongated.For this reason that, many devices are (such as embedded processing Device, microcontroller and relatively low cost and/or lower powered microprocessor) there is no floating point unit.Can be with by previous cases It was found that, to be associated with floating add (i.e. right with the logic of the Index for Calculation of multiplication/division comprising executing for the labyrinth of floating point unit The index of operand executes plus/minus computing to produce the adder of the exponential number of floating-point multiplication/division, by operand index phase Subtract to confirm the subtractor of the binary point alignment offset amount of floating add), comprising in order to reach mantissa in floating add Binary point alignment deviator, comprising the deviator is standardized by floating point result.Additionally, flow process is entered Row generally also need to the rounding-off computing for executing floating point result logic, execute integer data format between floating-point format and different floating-points The logic of conversion, leading zero between form (such as amplification precision, double precision, single precision, half precision) with leading one detector, And the logic of the special floating number of process, such as outlying observation, nonumeric and infinite value.
Additionally, the accurateness checking with regard to floating point unit can be big because of needing the numerical space being verified to increase in design Width increases its complexity, and can extend product development cycle and Time To Market.Additionally, it has been observed that floating-point operation arithmetic needs are right The mantissa field of each floating number for calculating is stored with exponent field respectively and is used, and the storage area needed for increasing And/or reduce degree of accuracy in the case where given storage area is to store integer.Many of which shortcoming can pass through integer list Unit executes arithmetical operation to avoid.
Program designer generally needs to write the program for processing decimal, and decimal is the numerical value of incomplete number.This program May need to execute on the processor without floating point unit, although or processor has floating point unit, but by processing The integer unit of device executes integer instructions can be than very fast.In order to utilize advantage of the integer processor in efficiency, program designer Known fixed point arithmetic computing can be used to fixed-point value (fixed-point numbers).Such program can include executing In integer unit processing the instruction of integer or integer data.Software knows that data are decimals, and this software is simultaneously right comprising instruction Integer data executes computing and processes the problem that this data is actually decimal, such as alignment offset device.Substantially, pinpoint soft Part can manually perform the function that some or all floating point unit can be executed.
Herein, one " fixed point " number (or value or operand or input or output) is a numeral, its bit of storage quilt It is interpreted as being referred to here as " decimal place " comprising position to represent a fractional part of this fixed-point number, this position.The bit of storage bag of fixed-point number It is contained in memorizer or buffer, such as 8 or 16 words of in memorizer or buffer.Additionally, the storage of fixed-point number Position is deposited all for expressing a numerical value, and in some cases, one of position can be used for expression symbol, but, not have The bit of storage of one fixed-point number can be used for the index for expressing this number.Additionally, the decimal place quantity of this fixed-point number or title binary system Scaling position is specified in a storage area for being different from fixed-point number bit of storage, and is referred in the way of shared or general Go out the quantity of decimal place or claim binary point position, be shared with a fixed-point number set comprising this fixed-point number, for example defeated Enter the set of the output result of operand, accumulating values or pe array.
In embodiment described here, ALU is integer unit, and but, run function unit is then comprising floating Point arithmetic hardware auxiliary accelerates.Can so make ALU part become less and more quick, be beneficial to Using more ALUs on fixed chip space.This is also illustrated in unit chip and can spatially arrange more god Through unit, and it is particularly conducive to neutral net unit.
Additionally, be required for index bit of storage compared to each floating number, the fixed-point number in embodiment as herein described is with one Individual pointer gauge belongs to the quantity of the bit of storage of decimal place up in whole digital collections, and but, this pointer is single, common positioned at one The storage area enjoyed and widely point out all numerals of whole set, the input set of for example a series of computings, a series of fortune The set of the cumulative number of calculation, the wherein set of output, the quantity of decimal place.For a preferred embodiment, neutral net unit User can specify the quantity of decimal bit of storage to this digital collection.Although it is understood, therefore, that in many cases (mathematics as), the term of " integer " refer to that a tape symbol is completely counted, that is, a numeral without fractional part, But, in the venation of this paper, the term of " integer " can represent the numeral with fractional part.Additionally, in the venation of this paper, The term of " integer " is its part position meeting each in storage area in order to make a distinction with floating number, for floating number It is used for the index for expressing floating number.Similarly, the multiplication of integers or addition of integer arithmetic computing, such as integer unit execution compare Computing, it is assumed that in operand have index, therefore, the whole array part of integer unit, such as integer multiplier, integer adder, Integer comparator, avoids the need for processing index comprising logic, for example, need not move mantissa for addition or comparison operation To be directed at binary point, it is not necessary to be added index for multiplying.
Additionally, embodiment as herein described includes a large-scale hardware integer accumulator with to the whole of large series Number computing is added up (such as 1000 multiply-accumulate computings) without losing degree of accuracy.Can so avoid at neutral net unit Reason floating number, while cumulative number can be made again to maintain full precision, without making its saturation or producing inaccurate knot because of overflow Really.Once this series of integers computing adds up out a result and is input into this full precision accumulator, this fixed-point hardware auxiliary can execute necessity Scaling and saturation arithmetic, use little required for the accumulated value decimal place quantity pointer and output valve that is specified using user This full precision accumulated value is converted to an output valve by numerical digit quantity, and this part can be described in more detail in following sections.
Be input into for use in the one of run function or be used for passing when needing to be compressed accumulated value from full precision form Pass, for a preferred embodiment, run function unit optionally can execute random rounding-off computing, this part to accumulated value Can be described in more detail in following sections.Finally, according to the different demands of the given layer of neutral net, neural processing unit can Optionally to receive to indicate the accumulated value with using different run functions and/or many multi-forms of output.
Figure 29 A are the block schematic diagram of an embodiment of the control buffer 127 for showing Fig. 1.This control buffer 127 can Including multiple control buffers 127.As shown in FIG., this control buffer 127 includes fields:Configuration 2902, tape symbol Data 2912, tape symbol weight 2914, data binary point 2922, weight binary point 2924, arithmetical logic list Meta-function 2926, rounding control 2932, run function 2934, inverse 2942, side-play amount 2944, output random access memory 2952nd, output binary point 2954 and output order 2956.The control value of buffer 127 can be instructed using MTNN 1400 carry out write activity with the instruction of NNU programs, such as enabled instruction.
2902 values of configuration specify neutral net unit 121 to be belonging to narrow configuration, wide configuration or funnel configuration, such as front institute State.Configuration 2902 also sets the input text received by data random access memory 122 and weight random access memory 124 The size of word.In narrow configuration with funnel configuration, the size for being input into word is narrow (such as 8 or 9), but, matches somebody with somebody in width In putting, the size for being input into word is then wide (such as 12 or 16).Additionally, configuration 2902 also set with input word big The size of little identical output result 133.
When tape symbol data value 2912 is genuine, that is, represent the data text received by data random access memory 122 Word is signed value, if false, then it represents that these data literals are not signed value.When tape symbol weighted value 2914 is genuine Wait, that is, represent that the weight word received by weight random access memory 122 is signed value, if false, then it represents that these power Word is weighed for not signed value.
2922 value of data binary point represents that the two of the data literal received by data random access memory 122 enter Scaling position processed.For a preferred embodiment, for the position of binary point, data binary point 2922 values represent the position number of positions that binary point is calculated from right side.In other words, 2922 table of data binary point Belong to the quantity of decimal place, that is, the digit being located on the right side of binary point in the least significant bit for showing data literal.Similarly, 2924 value of weight binary point represents the binary point of the weight word received by weight random access memory 124 Position.For a preferred embodiment, when ALU function 2926 is that a multiplication is cumulative with cumulative or output, nerve It is little that the digit being loaded on the right side of the binary point of the numerical value of accumulator 202 is defined as data binary system by processing unit 126 Several points 2922 and the totalling of weight binary point 2924.If so that it takes up a position, for example, data binary point 2922 It is worth for 5 and the value of weight binary point 2924 is 3, the value in accumulator 202 there will be 8 on the right side of binary point Position.When ALU function 2926 be a sum/maximum accumulator with data/weight word or transmission data/ Digit on the right side of the binary point of the numerical value for being loaded into accumulator 202 can be distinguished by weight word, neural processing unit 126 It is defined as data/weight binary point 2922/2924.In another embodiment, then refer to that one accumulator two of order enters Arithmetic point processed 2923, and do not go to specify an other data binary point 2922 and weight binary point 2924.This portion Divide and can be described in more detail at Figure 29 B in follow-up corresponding to.
ALU function 2926 specifies the function executed by the ALU 204 of neural processing unit 126. It has been observed that ALU function 2926 may include following computing but be not limited to:By data literal 209 and weight word 203 It is multiplied and this product is added with accumulator 202;Accumulator 202 is added with weight word 203;By accumulator 202 and data Word 209 is added;Maximum in accumulator 202 and data literal 209;Maximum in accumulator 202 and weight word 209 Value;Output accumulator 202;Transmission data literal 209;Transmission weight word 209;Output null value.In one embodiment, this arithmetic Logical block function 2926 is specified by neutral net unit initialization directive, and by ALU 204 using with In response to an execute instruction (not shown).In one embodiment, this ALU function 2926 is by individual other neutral net list Metainstruction is specified, and such as aforementioned multiply-accumulate and maxwacc is instructed.
Rounding control 2932 specifies the form for being rounded computing used by rounder 3004 (in Figure 30).In an embodiment In, assignable rounding mode is included but is not limited to:Unrounded, be rounded up to most recent value and random be rounded.With regard to a preferable enforcement For example, processor 100 includes that random order originates 3003 (refer to Figure 30) to produce random order 3005, these random orders 3005 Sampled in order to executing random rounding-off to reduce the probability for producing rounding-off deviation.In one embodiment, when rounding bit 3005 is One and to stick (sticky) position be zero, if the random order 3005 of sampling is true, neural processing unit 126 will be rounded up to, if Be the random order 3005 of sampling be false, neural processing unit 126 would not be rounded up to.In one embodiment, random order source 3003 are sampled based on the random characteristic electron that processor 100 has to produce random order 3005, these random characteristic electrons Such as the thermal noise of semiconductor diode or resistance, but the present invention is not limited to this.
Run function 2934 specifies function for 202 numerical value 217 of accumulator to produce the output of neural processing unit 126 133.As described herein, run function 2934 is included but is not limited to:S type functions;Hyperbolic tangent function;Soft plus function;Correction letter Number;Divided by two specified power side;It is multiplied by reciprocal value that a user specifies to reach equivalent division;Transmission is whole cumulative Device;And accumulator is transmitted with standard size, this part can be described in more detail in sections below.In one embodiment, Run function is specified by neutral net unit starting function instruction.In addition, run function also can specified by initialization directive, And use in response to output order, such as the run function unit output order of address 4 in Fig. 4, in this embodiment, positioned at figure In 4, the run function instruction of address 3 can be contained in output order.
2942 values reciprocal are specified one to be multiplied to reach with 202 numerical value 217 of accumulator and 202 numerical value 217 of accumulator are carried out The numerical value of division arithmetic.That is, 2942 value of inverse specified by user can be falling for the divisor that actually wishes to carry out Number.This is conducive to arrange in pairs or groups convolution as described herein or common source computing.For a preferred embodiment, user can be by inverse 2942 values are appointed as two parts, and this can be described in more detail at Figure 29 C in follow-up corresponding to.In one embodiment, control Buffer 127 includes that a field (not shown) allows user specify one in multiple built-in divider values and carries out division, this The sizableness of built-in divider value is in the size of conventional convolution kernel a bit, and such as 9,25,36 or 49.In this embodiment, start letter Counting unit 212 can store the inverse of these built-in divisors, in order to be multiplied with 202 numerical value 217 of accumulator.
Side-play amount 2944 specifies the digit that 202 numerical value 217 of accumulator can be moved to right by the shift unit of run function unit 212, With reach by its divided by two power side computing.This is conducive to the convolution kernel of the power side that collocation size is two to carry out computing.
Output random access memory 2952 value can be in data random access memory 122 and weight random access memory Specify one to receive output result 133 in 124.
The output value of binary point 2954 represents the position of the binary point of output result 133.With regard to a preferable reality For applying example, for the position of the binary point of output result 133, output binary point 2954 is worth and represents From the position number of positions that right side calculates.In other words, output binary point 2954 represents the least significant bit of output result 133 In belong to the quantity of decimal place, that is, the digit being located on the right side of binary point.Run function unit 212 can be entered based on output two The numerical value of arithmetic point processed 2954 (in most cases, also can be based on data binary point 2922, weight binary system little The numerical value of several points 2924, run function 2934 and/or configuration 2902) execute the computing of rounding-off, compression, saturation and size conversion.
Output order 2956 can be from many Control-oriented output results 133.In one embodiment, run function unit 121 Standard-sized concept can be utilized, standard size is the twice for configuring the 2902 width sizes (in bits) that specifies.Thus, citing For, if configuration 2902 sets the input received by data random access memory 122 and weight random access memory 124 The size of word is 8, and standard size will be 16;In another example, if 2902 setting of configuration is random by data The size of the input word that access memorizer 122 is received with weight random access memory 124 is 16, and standard size will be 32.As described herein, the size of accumulator 202 is larger, and (for example, narrow accumulator 202B is 28, and wide cumulative Device 202A is then 41) to maintain intermediate computations, the multiply-accumulate instruction of such as 1024 and 512 neutral net units, full precision. Thus, 202 numerical value 217 of accumulator will be more than standard size (in bits), and the most of numerical value for run function 2934 (except transmitting whole accumulator), run function unit 212 (for example below corresponding to Figure 30 paragraph described in standard size pressure 3008) 202 numerical value 217 of accumulator will be compressed to standard-sized size by contracting device.First default value of output order 2956 Can indicate that run function unit 212 executes the run function 2934 specified to produce internal result and by this internal result as defeated Go out result 133 to export, the size of this internal result is equal to the size for being originally inputted word, i.e., standard-sized half.Output life Make 2956 the second default value indicate that run function unit 212 executes the run function 2934 specified to produce internal result simultaneously The lower half of this internal result is exported as output result 133, the size of this internal result is equal to and is originally inputted the big of word Little twice, i.e. standard size;And the 3rd default value for exporting order 2956 can indicate run function unit 212 by standard size The first half of inside result export as output result 133.4th default value of output order 2956 can indicate run function Undressed minimum effective word of accumulator 202 is exported by unit 212 as output result 133;And export order 2956 The 5th default value can indicate run function unit 212 using effective for the undressed centre of accumulator 202 word as output As a result 133 output;6th default value of output order 2956 can indicate run function unit 212 by accumulator 202 without place The effective word of the highest of reason (its width by configure specified by 2902) is exported as output result 133, and this corresponds to Fig. 8 above Chapters and sections to Figure 10 are described in more detail.It has been observed that whole 202 size of accumulator of output or standard-sized internal result Contribute to allowing other performance elements 112 of processor 100 execute run function, such as soft very big run function.
Field described by Figure 29 A (and Figure 29 B and Figure 29 C) inside control buffer 127, is but located at, the present invention This is not limited to, wherein one or more fields may be alternatively located at the other parts of neutral net unit 121.With regard to a preferred embodiment For, many of which field may be embodied in neutral net unit instruction internal, and be decoded by sequencer 128 micro- to produce 3416 (refer to Figure 34) of instruction control ALU 204 and/or run function unit 212.Additionally, these fields May be embodied in be stored in micro- computing 3414 of media cache 118 and (refer to Figure 34), to control ALU 204 And/or run function unit 212.This embodiment can reduce the use for initializing the instruction of neutral net unit, and at other This initialization neutral net unit instruction then can remove in embodiment.
It has been observed that the instruction of neutral net unit can be specified and memory operand (is such as stored from data random access Device 122 and/or the word of weight random access memory 123) or one rotation after operand (such as from multitask buffer 208/705) arithmetical logic ordering calculation is executed.In one embodiment, neutral net unit instruction can also be by an operand The buffer for being appointed as run function is exported (such as the output of the buffer 3038 of Figure 30).Additionally, it has been observed that neutral net unit Instruction can specify make data random access memory 122 or weight random access memory 124 when top address is incremented by. In one embodiment, the instruction of neutral net unit may specify that signed integer difference is added when prostatitis is incremented by reach or passs immediately The purpose of numerical value beyond subtracting one.
Figure 29 B are the block schematic diagram of another embodiment of the control buffer 127 for showing Fig. 1.The control caching of Figure 29 B Control buffer 127 of the device 127 similar to Figure 29 A, but, the control buffer 127 of Figure 29 B includes an accumulator binary system Arithmetic point 2923.Accumulator binary point 2923 represents the binary point position of accumulator 202.With regard to a preferable enforcement For example, the value of accumulator binary point 2923 represents position number of positions of this binary point position from right side.Change speech It, belongs to the quantity of decimal place, that is, is located at two in the least significant bit of the expression accumulator 202 of accumulator binary point 2923 Position on the right side of system arithmetic point.In this embodiment, accumulator binary point 2923 is explicitly indicated, rather than such as Figure 29 A Embodiment is to confirm in the dark.
Figure 29 C are to show with the block schematic diagram of an embodiment of the inverse 2942 of two section store Figure 29 A.First Part 2962 is a deviant, represents that user is wanted repressed in the true reciprocal value for be multiplied by 202 numerical value 217 of accumulator The quantity 2962 of leading zero.The quantity of leading zero is an immediately proceeding on the right side of binary point continuously arranged zero quantity.Second Part 2694 is leading null suppression reciprocal value, that is, the true reciprocal value after all leading zeroes are removed.In one embodiment, Suppressed leading zero quantity 2962 is stored with 4, and leading null suppression reciprocal value 2964 is then with 8 not signed value storages.
As an example it is assumed that user wants the reciprocal value that 202 numerical value 217 of accumulator is multiplied by numerical value 49.Numerical value 49 It will be 0.0000010100111 that reciprocal value is presented and is set 13 decimal places with two dimension, wherein have five leading zeroes.Thus, Suppressed leading zero quantity 2962 can be inserted numerical value 5 by user, and leading null suppression reciprocal value 2964 is inserted numerical value 10100111.In multiplier reciprocal " divider A " 3014 (refer to Figure 30) by 202 numerical value 217 of accumulator and leading null suppression After reciprocal value 2964 is multiplied, produced product can be moved to right according to leading zero quantity 2962 is suppressed.Such embodiment is helped In the requirement for expressing 2942 values reciprocal using relatively small number of position and reaching pinpoint accuracy.
Figure 30 is the block schematic diagram of an embodiment of the run function unit 212 for showing Fig. 2.This run function unit 212 127, positive type transducer (PFC) of control logic comprising Fig. 1 and output binary point aligner (OBPA) 3002 is little with output binary system to receive 202 numerical value 217 of accumulator to receive 202 numerical value of accumulator, 217, rounder 3004 The pointer of the bit quantity that several aligners 3002 are removed, random order as the aforementioned originate 3003 with produce random order 3005, Output and house of one the first multiplexer 3006 to receive positive type transducer with export binary point aligner 3002 Enter the output of device 3004, standard size compressor (CCS) and saturator 3008 with receive the first multiplexer 3006 output, One digit selector and saturator 3012 are receiving output, a corrector 3018 of standard size compressor and saturator 3008 To receive standard size compressor and the output of saturator 3008, a multiplier 3014 reciprocal receiving standard size compressor Output with saturator 3008, a right shift device 3016 with receive standard size compressor and saturator 3008 output, One tanh (tanh) module 3022 with receive the output of digit selector and saturator 3012, a S patterns block 3024 with Receive digit selector and the output of saturator 3012, one soft add module 3026 to receive the defeated of digit selector and saturator 3012 Go out, second multiplexer 3032 to be receiving tanh module 3022, S patterns block 3024, soft plus module 3026, corrector 3018th, the output of multiplier 3014 reciprocal and right shift device 3016 and standard size compressor and saturator 3008 are transmitted Standard size export 3028, symbol restorer 3034 to receive the output of the second multiplexer 3032, a size conversion Device and saturator 3036 are receiving the output of symbol restorer 3034, one the 3rd multiplexer 3037 receiving size converter and full Output with device 3036 with accumulator output 217 and an output state 3038 to receive the output of multiplexer 3037, and Which exports the result 133 being in Fig. 1.
Positive type transducer receives 202 value 217 of accumulator with output binary point aligner 3002.With regard to a preferable reality For applying example, it has been observed that 202 value 217 of accumulator is a full precision value.That is, accumulator 202 has enough storages To load cumulative number, this cumulative number is by a series of product phases produced by integer multiplier 242 by integer adder 244 to digit Plus produced sum, and this computing does not give up in each sum of indivedual products or adder of multiplier 242 any one Individual position is maintaining degree of accuracy.For a preferred embodiment, accumulator 202 at least has enough digits to load neutral net Unit 121 can be programmed the maximum quantity for executing the product accumulation for producing.For example, the program of Fig. 4 is refer to, is matched somebody with somebody in width Put down, it is 512 that neutral net unit 121 can be programmed the maximum quantity of the product accumulation for executing generation, and cumulative number 202 Width is 41.In another example, the program of Figure 20 is refer to, under narrow configuration, neutral net unit 121 can be programmed and hold The maximum quantity of the product accumulation that row is produced is 1024, and 202 bit width of cumulative number is 28.Substantially, full precision accumulator 202 There is at least Q position, wherein Q is M and log2The totalling of P, wherein M are that the bit width of the integer multiplication of multiplier 242 (is illustrated and Say, be 16 for narrow multiplier 242, be 32 for wide multiplier 242), and P is accumulator 202 can tire out Plus product maximum allowable quantity.For a preferred embodiment, the maximum quantity of product accumulation is according to neutral net list Specified by the program specification of the program designer of unit 121.In one embodiment, it is assumed that a previous multiplications accumulated instruction in order to from Data/weight 122/124 loading data of random access memory/weight word 206/207 arranges (such as the instruction of address 1 in Fig. 4) On the basis of, sequencer 128 can execute counting of the multiply-accumulate neutral net unit instruction (such as the instruction of address 2 in Fig. 4) Maximum is such as 511.
There is enough bit widths using one and cumulative fortune can be executed to the full precision value of allowed cumulative maximum quantity The accumulator 202 of calculation, you can simplify the design of the ALU 204 of nerve processing unit 126.Particularly, so process Can relax the demand for needing sum integer adder 244 produced using logic to execute saturation arithmetic, because integer adds Musical instruments used in a Buddhist or Taoist mass 244 can make a small-sized accumulator produce overflow, and need to keep track the binary point position of accumulator with true Recognize and whether produce overflow to be confirmed whether to need to execute saturation arithmetic.For example, for non-full precision accumulator but tool For having saturation logic to process the design of the overflow of non-full precision accumulator, it is assumed that there is situations below.
(1) scope of data literal value is between 0 and 1 and all bit of storage are all in order to storing decimal place.Weight text The scope of word value is all bit of storage between -8 and+8 and in addition to three all in order to storing decimal place.As one The scope of the accumulated value of the input of tanh run function is and all storages in addition to three between -8 and 8 Position is all in order to storing decimal place.
(2) bit width of accumulator is non-full precision (such as the bit width of only product).
(3) assume that accumulator is full precision, final accumulated value is also big to date between -8 and 8 (such as+4.2);But, exist Product in this sequence before " point A " relatively frequently can be produced on the occasion of and the product after point A then can relatively frequently produce negative value.
In the case, it is possible to obtain incorrect result (such as the result beyond+4.2).This is because in front of point A Some points, when needing to make accumulator to reach the numerical value for exceeding its saturation maximum+8, such as+8.2, will lose and have more 0.2.Accumulator can even make remaining product accumulation result maintain saturation value, and can lose more on the occasion of.Therefore, accumulator End value may be less than the numerical value (be less than+4.2) calculated using the accumulator with full precision bit width.
Positive type transducer 3004 can be converted into positive type, and produce volume when 202 numerical value 217 of accumulator is for bearing Point out that the positive and negative of script numerical value, this meeting are passed down to 212 pipeline of run function unit with herewith numerical value in outer position.By negative Be converted to the computing that positive type can simplify follow-up run function unit 121.For example, after this treatment, only on the occasion of meeting Input tanh module 3022 and S patterns block 3024, thus the design of these modules can be simplified.In addition it is also possible to simplify Rounder 3004 and saturator 3008.
Output binary point aligner 3002 can move right or scale this positive type value so as to slow in alignment with control The output binary point 2954 that specifies in storage 127.For a preferred embodiment, binary point aligner is exported 3002 decimal digits that can calculate 202 numerical value 217 of accumulator are (for example by accumulator binary point 2923 is specified or number Totalling according to binary point 2922 and weight binary point 2924) decimal digits of output is deducted (for example by exporting Specified by binary point 2954) difference as side-play amount.So, for example, if 202 binary fraction of accumulator It is 3 that point 2923 exports binary point 2954 for 8 (i.e. above-described embodiments), and output binary point aligner 3002 is just Can this positive type numerical value be moved to right 5 positions to produce the result provided to multiplexer 3006 and rounder 3004.
Rounder 3004 can execute rounding-off computing to 202 numerical value of accumulator 217.For a preferred embodiment, rounder 3004 can align one rounding-off of positive type numerical value generation that type transducer is produced with output binary point aligner 3002 Version afterwards, and by this be rounded after version provide to multiplexer 3006.Rounder 3004 can be executed according to aforementioned rounding control 2932 Rounding-off computing, as described herein, aforementioned rounding control can include the random rounding-off using random order 3005.Multiplexer 3006 can be according to According to rounding control 2932 (as described herein, random rounding-off can be included), select first, namely from just in its multiple input Type transducer is with the positive type numerical value of output binary point aligner 3002 or after the rounding-off of rounder 3004 Version, and by selection after numerical value be supplied to standard size compressor and saturator 3008.For a preferred embodiment, if It is that rounding control is specified not to be rounded, multiplexer 3006 will select positive type transducer to be aligned with output binary point The output of device 3002, will otherwise select the output of rounder 3004.In other embodiments, also can be by run function unit 212 execute extra rounding-off computing.For example, in one embodiment, when digit selector 3012 to standard size compressor with When output (as be described hereinafter) position of saturator 3008 is compressed, digit selector 3012 can be rounded based on the low cis-position position that loses Computing.In another example, the product of multiplier reciprocal 3014 (as be described hereinafter) can be subjected to rounding-off computing.In another example In, size converter 3036 needs to change out appropriate Output Size (as be described hereinafter), and this conversion may relate to lose some and be used for Determine the low cis-position position of rounding-off, carry out rounding-off computing.
3006 output valve of multiplexer can be compressed to standard size by standard size compressor 3008.If so that it takes up a position, for example, Be neural processing unit 126 be in narrow configuration or funnel configuration 2902, standard size compressor 3008 can be by the multiplexing of 28 3006 output valve of device is compressed to 16;And if neural processing unit 126 is in wide configuration 2902, standard size compressor 3006 output valve of multiplexer of 41 can be compressed to 32 by 3008.But, before standard size is compressed to, if value before compression The maximum that can be expressed more than standard type, saturator 3008 will make this compress front value and be filled up to standard type to express Maximum.For example, if before compression in value positioned at highest be effectively compressed before any position on the left of value position be all numerical value 1, Saturator 3008 will be filled up to maximum and (such as fill up as whole 1).
For a preferred embodiment, tanh module 3022, S patterns block 3024 and soft plus module 3026 are all wrapped Contain look-up table, such as programmable logic array (PLA), read only memory (ROM), combination logic lock etc..In one embodiment, In order to simplify and reduce the size of these modules 3022/3024/3026, there is provided the input value to these modules has 3.4 type The integer character of formula, i.e., three and four decimal places, namely there are input value four positions to be located on the right side of binary point and have There are three positions to be located on the left of binary point.Because in the extreme place of the input value scope (- 8 ,+8) of 3.4 patterns, output valve These numerical value can therefore be selected progressively near its min/max.But, the present invention is not limited to this, and the present invention can also be answered Binary point is placed on the embodiment of diverse location for other, such as with 4.3 patterns or 2.5 patterns.Digit selector 3012 selection can select the position for meeting 3.4 pattern specifications in standard size compressor with the position of the output of saturator 3008, and this relates to And compression is processed, that is, some positions can be lost, because standard type then has more digit.But ,/compression mark is being selected Before object staff cun compressor and 3008 output valve of saturator, if the maximum that value can be expressed more than 3.4 patterns before compression, full Value before compression will be made to be filled up to the maximum that 3.4 patterns can be expressed with device 3012.For example, if being worth middle position before compression Any position on the left of the effective 3.4 pattern position of highest is all numerical value 1, and saturator 3012 will be filled up to maximum and (such as be filled up to Whole 1).
Tanh module 3022, S patterns block 3024 can be to standard size compressor and saturators with soft plus module 3026 3.4 pattern numerical value of 3008 outputs execute corresponding run function (described above) to produce a result.With regard to a preferred embodiment For, produced by tanh module 3022 and S patterns block 3024 is 7 results of 0.7 pattern, i.e. zero integer word Unit and seven decimal places, namely there are input value seven positions to be located on the right side of binary point.For a preferred embodiment, soft Plus module 3026 produce be 3.4 patterns 7 results, i.e. its pattern is identical with the entry type of this module 3026.Just For one preferred embodiment, tanh module 3022, S patterns block 3024 can be extended to mark with the output of soft plus module 3026 Pseudotype formula (for example adding leading zero when necessary) is simultaneously aligned and binary point is counted by binary point 2954 is exported Value is specified.
Corrector 3018 can produce version after the correction of the output valve of standard size compressor and saturator 3008.Namely Say, if standard size compressor is negative, corrector with the output valve (as its symbol aforementioned is moved down with pipeline) of saturator 3008 3018 can export null value;Otherwise, corrector 3018 will be inputted value output.For a preferred embodiment, corrector 3018 are output as standard type the binary point having specified by output 2954 numerical value of binary point.
Multiplier reciprocal 3014 can by the output of standard size compressor and saturator 3008 be specified in reciprocal value 2942 User specify reciprocal value be multiplied, to produce standard-sized product, this product actually be standard size compressor with The output valve of saturator 3008, using the quotient that the inverse of reciprocal value 2942 is calculated as divisor.With regard to a preferred embodiment Speech, multiplier reciprocal 3014 are output as standard type and with the binary system that is specified by output 2954 numerical value of binary point Arithmetic point.
Right shift device 3016 can be by the output of standard size compressor and saturator 3008, to be specified in offset value 2944 user specifies digit to move, to produce standard-sized quotient.For a preferred embodiment, right shift Device 3016 is output as standard type and with the binary point that is specified by output 2954 numerical value of binary point.
Multiplexer 3032 select run function 2934 be worth specified be properly entered, and is selected to provide to symbol recovery Device 3034, if 202 numerical value 217 of the accumulator of script is negative value, multiplexer 3032 will be exported by symbol restorer 3034 Positive type numerical value is converted to negative type, for example, be converted to two complement code types.
Size converter 3036 can be according to the numerical value of the output order 2956 as described in Figure 29 A, by symbol restorer 3034 Output conversion to appropriate size.For a preferred embodiment, the output of symbol restorer 3034 has one by exporting The binary point that 2954 numerical value of binary point is specified.For a preferred embodiment, for the first of output order For default value, size converter 3036 can give up the upper portion of the output of symbol restorer 3034.If additionally, symbol recovers Device 3034 is output as the maximum that character size can be expressed that just and more than configuration 2902 is specifying, or is output as bearing simultaneously And the minima that can be expressed less than character size, saturator 3036 will output it fill up so far character size respectively can Expression maximum/minimum value.For second and the 3rd default value, size converter 3036 can transmit the defeated of symbol restorer 3034 Go out.
Multiplexer 3037 can be according to output order 2956, in data converter and the output of saturator 3036 and accumulator 202 Select one to be supplied to output state 3038 in output 217.Furthermore, it is understood that for output order 2956 first and the Two default values, multiplexer 3037 can select the lower section word of size converter and the output of saturator 3036, and (size is by configuring 2902 specify).For the 3rd default value, multiplexer 3037 can select the top text of size converter and the output of saturator 3036 Word.For the 4th default value, multiplexer 3037 can select the lower section word of 202 numerical value 217 of undressed accumulator;For Five default values, multiplexer 3037 can select the midamble of 202 numerical value 217 of undressed accumulator;And for the 6th acquiescence Value, multiplexer 3037 can select the top word of 202 numerical value 217 of undressed accumulator.It has been observed that with regard to a preferred embodiment For, run function unit 212 can add null value upper position in the top word of 202 numerical value 217 of undressed accumulator.
Figure 31 is an example of the running of the run function unit 212 for showing Figure 30.As shown in FIG., neural processing unit 126 configuration 2902 is set as narrow configuration.Additionally, signed number is true according to 2912 with 2914 value of tape symbol weight.Additionally, data The value of binary point 2922 represents that, for 122 word of data random access memory, its binary point position is right The exemplary values that there is 7 positions, the first data literal that neural processing unit 126 is received side are rendered as 0.1001110.Additionally, Weight binary point 2924 value represent for 124 word of weight random access memory, its binary point position The exemplary values that putting right side has 3 positions, the first weight word that neural processing unit 126 is received are rendered as 00001.010.
First data are rendered as with 16 products (this product can be added with the initial zero value of accumulator 202) of weight word 000000.1100001100.Because data binary point 2912 is 7 and weight binary point 2914 is 3, for institute For implicit 202 binary point of accumulator, 10 positions on the right side of which, are had.In the case of narrow configuration, such as the present embodiment Shown, accumulator 202 has 28 bit wides.For example, after completing all arithmetic logical operations, (such as Figure 20 is whole 1024 Multiply-accumulate computing), the numerical value 217 of accumulator 202 can be 000000000000000001.1101010100.
2954 value of output binary point to be represented and have 7 positions on the right side of the binary point of output.Therefore, defeated in transmission After going out binary point aligner 3002 and standard size compressor 3008,202 numerical value of accumulator 217 can scaled, house The numerical value for entering and being compressed to standard type, i.e., 000000001.1101011.In this example, binary fraction dot address is exported 7 decimal places of expression, and accumulator 202 binary point positional representation, 10 decimal places.Therefore, binary point is exported Aligner 3002 can calculate difference 3, and 202 numerical value 217 of accumulator is moved to right 3 positions to zoom in and out which by transmission.In figure Show in 31 that 202 numerical value 217 of accumulator can lose 3 least significant bits (binary number 100).Additionally, in this example, house Enter 2932 values of control to represent using random rounding-off, and assume that sampling random order 3005 is true in this example.Thus, such as front State, least significant bit will be rounded up to, this is because the rounding bit of 202 numerical value 217 of accumulator (this 3 because accumulator The zoom operation of 202 numerical value 217 and the highest significant position in the position that is moved out of) for one, and (this 3 because accumulator 202 for glutinous position The zoom operation of numerical value 217 and in the position that is moved out of, the boolean of 2 least significant bits or operation result) be zero.
In this example, run function 2934 represents used S type functions.Thus, digit selector 3012 will be selected Select the position of standard type value and make the input of S patterns block 3024 that there are three integer characters and four decimal places, it has been observed that i.e. institute The numerical value 001.1101 for showing.The output numerical value of S patterns block 3024 can be put in standard type, i.e., shown numerical value 000000000.1101110.
The first default value, i.e., the character size that output configuration 2902 is represented, here are specified in the output order 2956 of this example In the case of i.e. narrow word (8).Thus, standard S type output valve can be converted to one 8 amounts by size converter 3036, which has There is an implicit binary point, i.e., there are 7 positions on the right side of this binary point, and produce an output valve 01101110, as shown in FIG..
Figure 32 is second example of the running of the run function unit 212 for showing Figure 30.The example of Figure 32 is described to work as and is opened When dynamic function 2934 is represented with standard size transmission 202 numerical value 217 of accumulator, the computing of run function unit 212.Such as institute in figure Show, this configuration 2902 is set as the narrow configuration of neural processing unit 216.
In this example, the width of accumulator 202 is 28 positions, has on the right side of the position of 202 binary point of accumulator 10 positions are (this is because data binary point 2912 with the totalling of weight binary point 2914 is in one embodiment 10, or 10) accumulator binary point 2923 is clearly designated as with numerical value in another embodiment.For example, After all arithmetic logical operations are executed, 202 numerical value 217 of accumulator shown in Figure 32 is 000001100000011011.1101111010.
In this example, the output value of binary point 2954 represents for output there is 4 on the right side of binary point Individual position.Therefore, after transmission output binary point aligner 3002 with standard size compressor 3008, accumulator 202 The meeting saturation of numerical value 217 is simultaneously compressed to shown standard type value 111111111111.1111, and this numerical value is connect by multiplexer 3032 Receive using as standard size delivery value 3028.
Show two output orders 2956 in this example.The second default value is specified in first output order 2956, i.e., defeated Go out the lower section word of standard type size.Because the size indicated by configuration 2902 is narrow word (8), standard size will be 16, and size converter 3036 can selection standard size delivery value 3028 the position of lower section 8 producing 8 as illustrated in the drawing Numerical value 11111111.The top word of the 3rd default value, i.e. outputting standard pattern size is specified in second output order 2956.Such as This, size converter 3036 can selection standard size delivery value 3028 the position of top 8 producing 8 bit value as illustrated in the drawing 11111111.
Figure 33 is the 3rd example of the running of the run function unit 212 for showing Figure 30.The example of Figure 33 is disclosed to work as and is opened Dynamic function 2934 represents the running of run function unit 212 when will transmit entirely undressed 202 numerical value 217 of accumulator.Such as Shown in figure, this configuration 2902 is set as the wide configuration (the input words of such as 16) of neural processing unit 126.
In this example, the width of accumulator 202 is 41 positions, and accumulator has 8 on the right side of 202 binary point position Individual position (this is because data binary point 2912 is 8 with the totalling of weight binary point 2914 in one embodiment, Or 8) accumulator binary point 2923 is clearly designated as with numerical value in another embodiment.For example, holding After all arithmetic logical operations of row, 202 numerical value 217 of accumulator shown in Figure 33 is 001000000000000000001100000011011.11011110.
Show three output orders 2956 in this example.First output order specify the 4th default value, that is, export without The lower section word of 202 numerical value of accumulator of process;The 5th default value is specified in second output order, that is, export undressed tired Plus the midamble of 202 numerical value of device;And the 6th default value is specified in the 3rd output order, that is, export undressed accumulator The top word of 202 numerical value.Because the size indicated by configuration 2902 is wide word (16), as shown in figure 33, in response to first Output order 2956, multiplexer 3037 can select 16 place values 0001101111011110;In response to the second output order 2956, multiplexing Device 3037 can select 16 place values 0000000000011000;And in response to the 3rd output order 2956, multiplexer 3037 can select 16 Place value 0000000001000000.
It has been observed that neutral net unit 121 can be implemented in integer data rather than floating data.Thus, contributing to letter Change each and every one neural processing unit 126, or 204 part of ALU at least within.For example, this arithmetical logic list Unit 204 avoids the need for for multiplier 242 and is incorporated in floating-point operation the adder for needing to be added for the index by multiplier.Class As, this ALU 204 is avoided the need for for adder 234 and being incorporated in floating-point operation needs for being directed at addend Binary point shift unit., when being understood that, floating point unit is often very multiple for art tool usually intellectual Miscellaneous;Therefore, exemplifications set out herein is simplified only for ALU 204, using described with hardware fixed point auxiliary And allow user to may specify that the integer embodiment of associated binary arithmetic point can also be used for simplifying other parts.Compared to The embodiment of floating-point, uses integer unit produce as ALU 204 at the nerve of less (and very fast) Reason unit 126, and be conducive to 126 array of neural processing unit large-scale by be integrated in neutral net unit 121.Start The part of function unit 212 can based on user specify, cumulative number need decimal place quantity and output valve need little Numerical digit quantity, processes the scaling and saturation arithmetic of 202 numerical value 217 of accumulator, and preferably is specified based on user.Any volume Outer complexity is increased with adjoint size, and the energy in the fixed-point hardware auxiliary of run function unit 212 and/or time consumption Damage, can be shared through the mode of shared run function unit 212 between ALU 204, this is because As shown in the embodiment of Figure 11, the quantity of run function unit 1112 can be reduced using the embodiment of sharing mode.
Embodiment as herein described can be enjoyed many utilization integer arithmetic units to reduce the advantage (phase of hardware complexity Compared with using floating point arithmetic unit), and can be used for the arithmetical operation of decimal simultaneously, that is, there is the numeral of binary point. The advantage of floating-point arithmetic is that it can provide date arithmetic and fall in a very wide numerical value to the individual number of data In the range of (be actually limited only in the size of index range, therefore can be a very big scope).That is, each is floating Points have its potential unique exponential quantity.But, embodiment as herein described understands and utilizes and has in some applications There is input data highly parallel and fall within the range of a relative narrower and make all panel datas that there is the spy of identical " index " Property.Thus, these embodiments allow user binary point position is once assigned to all of input value and/or is added up Value.Similarly, through the characteristic for understanding and having using parallel output similar scope, these embodiments allow user by binary system Scaling position is once assigned to all of output valve.Artificial neural network is an example of this kind of application, but the present invention Embodiment also apply be applicable to the calculating for executing other application.Through by binary point position be once assigned to multiple input and Non- to individual other input number, compared to floating-point operation is used, embodiments of the invention can be efficiently empty using memory Between (if desired for less memorizer) and/or lift precision in the case of the memorizer using similar quantity, this is because Position for the index of floating-point operation can be used to lift numerical precision.
Additionally, embodiments of the invention understand that (such as overflow is lost less in the integer arithmetic to large series Important decimal place) precision may be lost when executing cumulative, therefore provide a solution, mainly use one sufficiently large Accumulator come avoid precision lose.
The direct execution of the micro- computing of neutral net unit
Figure 34 is the block schematic diagram of the part details of the processor 100 and neutral net unit 121 that show Fig. 1.God Through the pipeline stages 3401 that NE 121 includes neural processing unit 126.Each pipeline stages 3401 is distinguished with level buffer, and Including combination logic reaching the computing of the neural processing unit 126 of this paper, such as Boolean logic lock, multiplexer, adder, multiplication Device, comparator etc..Pipeline stages 3401 receive micro- computing 3418 from multiplexer 3402.Micro- computing 3418 can flow downward to pipeline Level 3401 simultaneously controls its combination logic.Micro- computing 3418 is a position set.For a preferred embodiment, micro- computing 3418 is wrapped Include the position of 122 storage address 123 of data random access memory, 124 storage address 125 of weight random access memory Position, the position of 129 storage address 131 of program storage, also 208/705 control signal 213/713 of multitask buffer, many The field (the control buffer of such as Figure 29 A to Figure 29 C) of control buffer 217.In one embodiment, micro- computing 3418 includes About 120 positions.Multiplexer 3402 receives micro- computing from three different sources, and selects one of conduct to be supplied to pipeline Micro- computing 3418 of level 3401.
Sequencer 128 of the one micro- computing source of multiplexer 3402 for Fig. 1.The meeting of sequencer 128 will be by program storage The 129 neutral net unit Instruction decodings for receiving one micro- computing 3416 of generation according to this provide first defeated to multiplexer 3402 Enter.
Second of multiplexer 3402 micro- computing source is to receive microcommand 105 and from general from the reservation station 108 of Fig. 1 Buffer 116 receives the decoder 3404 of operand with media cache 118.For a preferred embodiment, it has been observed that micro- finger 105 are made to be instructed produced by 1500 translation with MFNN in response to MTNN instructions 1400 by instruction translator 104.Microcommand 105 can be wrapped Include an immediate field to specify a specific function (being instructed specified by 1400 or a MFNN instruction 1500 by a MTNN), example As the beginning of 129 internal program of program storage with stop executing, directly execute a micro- computing or such as from media cache 118 The memorizer of aforementioned read/write neutral net unit.Decoder 3404 can decode microcommand 105 and generation one is micro- according to this Computing 3412 is provided and is input into the second of multiplexer.For a preferred embodiment, for MTNN instruction 1400/MFNN instructions For 1500 some functions 1432/1532, decoder 3404 need not produce a micro- computing 3412 and be sent to downwards pipeline 3401, such as write control buffer 127, the program started in executive memory 129, time-out executive memory The program in program, wait program storage 129 in 129 completes to execute, reads from status register 127 and reset nerve NE 121.
3rd micro- computing source of multiplexer 3402 is media cache 118 itself.For a preferred embodiment, such as Above corresponding to described in Figure 14, MTNN instructions 1400 may specify that a function executes one to indicate that neutral net unit 121 is direct There is provided by media cache 118 to the 3rd micro- computing 3414 being input into of multiplexer 3402.Directly execute by framework media buffer Micro- computing 3414 that device 118 is provided is conducive to testing neutral net unit 121, such as built-in self-test (BIST), or Except wrong action.
For a preferred embodiment, decoder 3404 can produce the choosing that a mode pointer 3422 controls multiplexer 3402 Select.When MTNN instructions 1400 specify a function to start to execute a program from program storage 129,3404 meeting of decoder Produce micro- computing 3416 that 3422 value of a mode pointer makes multiplexer 3402 select from sequencer 128, until making a mistake or Encountering a MTNN instruction 1400 until decoder 3404 specifies a function to stop executing the journey from program storage 129 Sequence.When MTNN instructions 1400 specify a function to indicate what the 121 direct execution of neutral net unit was provided by media cache 118 One micro- computing 3414, decoder 3404 can produce 3422 value of mode pointer makes multiplexer 3402 select from specified matchmaker Micro- computing 3414 of body buffer 118.Otherwise, decoder 3404 will produce 3422 value of mode pointer and make multiplexer 3402 Select the micro- computing 3412 from decoder 3404.
Variable rate neutral net unit
In many cases, it is pending holding state (idle) etc. will to be entered after 121 configuration processor of neutral net unit The thing of the process of device 100 some needs process before next program is executed.As an example it is assumed that being in one similar to Fig. 3 To the situation described in Fig. 6 A, neutral net unit 121 (alternatively referred to as can before award nerve net to a multiply-accumulate run function program Network layers program (feed forward neural network layer program)) continuously perform two or more times.Compare In the time spent by 121 configuration processor of neutral net unit, 100 significant need of processor spend longer time by The weighted value write weight random access memory 124 of 512KB is for neutral net unit program use next time.In other words, Neutral net unit 121 understands configuration processor at short notice, is subsequently put into holding state, until next processor 100 will be Weighted value write weight random access memory 124 for the use of program performing next time.This situation can refer to Figure 36 A, in detail such as Aftermentioned.In the case, when neutral net unit 121 can adopt relatively low, frequency operation is used with extending the time of configuration processor Make the energy resource consumption needed for configuration processor be dispersed to longer time scope, and make neutral net unit 121, or even whole place Reason device 100, maintains lower temperature.This situation is referred to as mitigation pattern, can refer to Figure 36 B, and the details will be described later.
Figure 35 is a block chart, shows the processor 100 with variable rate neutral net unit 121.This 100 class of processor The processor 100 of Fig. 1 is similar to, and in figure, has the component of identical label also similar.The processor 100 of Figure 35 and have when Frequency produces the functional unit that logic 3502 is coupled to processor 100, and these functional units instruct acquisition unit 101, and instruction is fast Take 102, instruction translator 104, renaming unit 106, reservation station 108, neutral net unit 121, other performance elements 112, Memory sub-system 114, general caching device 116 and media cache 118.Time-frequency produces logic 3502 includes time-frequency generator, Such as phase-locked loop (PLL), with produce one have main when frequency or claim time-frequency frequency time frequency signal.For example, this When main, frequency can be 1GHz, 1.5GHz, 2GHz etc..When frequency represent periods per second, such as time frequency signal is in height Concussion number of times between low state.It is preferred that there is this time frequency signal equilibration period (duty cycle), the i.e. half in this cycle to be High state and second half be low state;In addition, this time frequency signal can also have the non-equilibrium cycle, that is, time frequency signal is in height The time of state is longer than the time which is in low state, and vice versa.It is preferred that frequency when phase-locked loop is in order to produce multiple Main time frequency signal.It is preferred that processor 100 includes power management module, according to many factors adjust automatically main when frequency, These factors include the dynamic detection operation temperature of processor 100, utilization rate (utilization), and from systems soft ware (such as operating system, basic input output system (BIOS)) indicates the order of required efficiency and/or energy-saving index.In an embodiment In, power management module includes the microcode of processor 100.
Time-frequency produces logic 3502 and including time-frequency distribution network, or time-frequency tree (clock tree).Time-frequency tree can be by master Time frequency signal is wanted to be disseminated to the functional unit of processor 100, as shown in figure 35, this distribution action is exactly by time frequency signal 3506-1 Instruction acquisition unit 101 is sent to, time frequency signal 3506-2 is sent to instruction cache 102, time frequency signal 3506-10 is transmitted To instruction translator 104, time frequency signal 3506-9 is sent to renaming unit 106, time frequency signal 3506-8 is sent to guarantor Station 108 is stayed, time frequency signal 3506-7 is sent to neutral net unit 121, time frequency signal 3506-4 is sent to other execution Time frequency signal 3506-3 is sent to memory sub-system 114 by unit 112, and time frequency signal 3506-5 is sent to general caching Device 116, and time frequency signal 3506-6 is sent to media cache 118, these signals are referred to collectively as time frequency signal 3506.This Time-frequency tree has node or line, to transmit main time frequency signal 3506 to its corresponding functional unit.Additionally it is preferred that when Frequency produces logic 3502 and may include time-frequency buffer, when needing to provide cleaner time frequency signal and/or need to lift main During the voltage quasi position of frequency signal, especially for node farther out, time-frequency buffer can regenerate main time frequency signal.This Outward, each functional unit there is the period of the day from 11 p.m. to 1 a.m frequency of its own to set, regenerates and/or lifted received corresponding when needed The voltage quasi position of main time frequency signal 3506.
Neutral net unit 121 includes that time-frequency reduces logic 3504, and time-frequency reduces logic 3504 and receives mitigation pointer 3512 With main time frequency signal 3506-7, to produce the second time frequency signal.Frequency when second time frequency signal has.If not now frequency phase Frequency when being same as main, be in mitigation pattern from main when frequency reduce numerical value with reduce heat energy generation, this mathematical program Change to mitigation pointer 3512.Time-frequency reduces logic 3504 and produces logic 3502 similar to time-frequency, and which has time-frequency distribution network, or Time-frequency tree, to spread the several functions square of the second time frequency signal to neutral net unit 121, this distribution action is exactly by time-frequency Signal 3508-1 is sent to neural pe array 126, and time frequency signal 3508-2 is sent to sequencer 128 with will time-frequency Signal 3508-3 is sent to interface logic 3514, and these signals are referred to collectively as the second time frequency signal 3508.It is preferred that these are neural Processing unit 126 includes multiple pipeline stages 3401, and as shown in figure 34, pipeline stages 3401 include pipeline hierarchical cache device, in order to from Time-frequency reduces logic 3504 and receives the second time frequency signal 3508-1.
Neutral net unit 121 also has interface logic 3514 to receive main time frequency signal 3506-7 and the second time-frequency letter Number 3508-3.Interface logic 3514 is coupled to lower part (such as reservation station 108, the media cache 118 of 100 front end of processor With general caching device 116) and between the several functions square of neutral net unit 121, frequency reduces logic to these function blocks in real time 3504, data random access memory 122, weight random access memory 124, program storage 129 and sequencer 128.Connect Mouth logic 3514 includes data random access memory buffering 3522, and weight random access memory buffers translating for 3524, Figure 34 Code device 3404, and relax pointer 3512.Relax pointer 3512 and load a numerical value, this numerical value specifies nerve pe array 126 can execute the instruction of neutral net unit program with speed how slowly.It is preferred that relaxing pointer 3512 specifies divider value N, when Frequency reduce logic 3504 by main time frequency signal 3506-7 divided by this divider value to produce the second time frequency signal 3508, thus, the The when frequency of two time frequency signals will be 1/N.It is preferred that the numerical value of N is programmable to any in multiple different default values Individual, these default values can make time-frequency reduce logic 3504 and correspondingly produce multiple the second time frequency signals with frequency during difference 3508, these when frequency less than main when frequency.
In one embodiment, time-frequency reduces logic 3504 includes time-frequency divider circuit, in order to by main time frequency signal 3506-7 is divided by mitigation 3512 numerical value of pointer.In one embodiment, time-frequency reduces logic 3504 includes time-frequency lock (such as AND locks), Time-frequency lock can pass through an enabling signal to gate main time frequency signal 3506-7, and enabling signal is in the every N number of of main time frequency signal A true value can be produced only in cycle.By one comprising enumerator by taking the circuit for producing enabling signal as an example, this enumerator can be to On count up to N.When the output that adjoint logic circuit detects enumerator is mated with N, logic circuit will be believed in the second time-frequency Numbers 3,508 one true value pulses of generation are laid equal stress on and set enumerator.It is preferred that relaxing 3512 numerical value of pointer can give program by framework instruction Change, the MTNN instructions 1400 of such as Figure 14.It is preferred that indicating that neutral net unit 121 starts to execute nerve net in framework program Before network unit program, the framework program for operating on processor 100 can be by the sequencing of mitigation value to pointer 3512 is relaxed, and this part exists Subsequently correspond to and can be described in more detail at Figure 37.
Weight random access memory buffering 3524 is coupled to weight random access memory 124 and media cache 118 Between as data transfer therebetween buffering.It is preferred that buffering of the weight random access memory buffering 3524 similar to Figure 17 One or more embodiments of device 1704.It is preferred that weight random access memory buffering 3524 is received from media cache 118 The part of data using with main when frequency main time frequency signal 3506-7 as time-frequency, and weight random access memory is slow Punching 3524 is from the part of 124 receiving data of weight random access memory with the second time frequency signal with frequency when second 3508-3 as time-frequency, when second frequency can according to sequencing in relax the numerical value of pointer 3512 from main when frequency downgrade or No, namely be implemented in mitigation or normal mode according to neutral net unit 121 to be downgraded or no.In one embodiment, weigh Weight random access memory 124 is single port, and as described in Figure 17 above, weight random access memory 124 can also be delayed by media Storage 118 is through weight random access memory buffering 3524, and is buffered by the row of neural processing unit 126 or Figure 11 1104, accessed in the mode of arbitrating (arbitrated fashion).In another embodiment, weight random access memory 124 For dual-port, as described in Figure 16 above, each port can be buffered through weight random access memory by media cache 118 3524 and accessed by neural processing unit 126 or column buffer 1104 in a parallel fashion.
Similar to weight random access memory buffering 3524, data random access memory buffering 3522 is coupled to data The buffering transmitted as data therebetween between random access memory 122 and media cache 118.It is preferred that data are deposited at random One or more embodiments of the access to memory buffering 3522 similar to the buffer 1704 of Figure 17.It is preferred that data random access Memorizer buffer 3522 from the part of 118 receiving data of media cache with main when frequency main time frequency signal 3506-7 is used as time-frequency, and data random access memory buffering 3522 is from 122 receiving data of data random access memory When part is to have second, used as time-frequency, when second, frequency can be according to sequencing in mitigation for the second time frequency signal 3508-3 of frequency The numerical value of pointer 3512 from main when frequency downgrade or no, namely be implemented in mitigation or normal mode according to neutral net unit 121 Formula is being downgraded or no.In one embodiment, data random access memory 122 is single port, as described in Figure 17 above, number Also 3522 can be buffered by media cache 118 through data random access memory according to random access memory 122, and by god Through the row buffering 1104 of processing unit 126 or Figure 11, accessed in the mode of arbitrating.In another embodiment, data random access is deposited Reservoir 122 is dual-port, and as described in Figure 16 above, each port can be stored through data random access by media cache 118 Device buffers 3522 and is accessed by neural processing unit 126 or column buffer 1104 in a parallel fashion.
No matter it is preferred that data random access memory 122 and/or weight random access memory 124 be single port or Dual-port, interface logic 3514 can include that data random access memory buffering 3522 is buffered with weight random access memory 3524 with synchronous main time-frequency domain and the second time-frequency domain.It is preferred that data random access memory 122, weight random access memory is deposited Reservoir 124 and program storage 129 all have static RAM (SRAM), wherein enable letter comprising other a reading Number, write enables signal and selects to enable signal with memorizer.
It has been observed that neutral net unit 121 is the performance element of processor 100.Performance element is execution frame in processor Microcommand or the functional unit of execution framework instruction itself that structure instruction translation goes out, for example, execute framework in Fig. 1 and instruct 103 turns The microcommand 105 for translating or framework instruction 103 itself.Performance element receives operand, example from the general caching device of processor Such as from general caching device 116 and media cache 118.Performance element can produce a result after executing microcommand or framework instruction, This result can be written into general caching device.Figure 14 is instructed for framework with the MTNN instructions 1400 described in Figure 15 and MFNN instructions 1500 103 example.Microcommand is instructed in order to realize framework.More precisely, performance element one that framework instruction translation is gone out Or the collective of multiple microcommands executes, will be that the fortune of framework instruction is executed for the input of framework instruction Calculate, to produce the result of framework instruction definition.
Figure 36 A are a sequential chart, and there is video-stream processor 100 neutral net unit 121 to operate on a fortune of general modfel Make example, this general modfel i.e. with main when frequency operation.In sequential chart, the process of time is right by a left side.Processor 100 With main when frequency execute framework program.More precisely, the front end of processor 100 (for example instructs acquisition unit 101, instruction Cache 102, instruction translator 104, renaming unit 106 and reservation station 108) with main when frequency seize, decode and issue frame Structure is instructed to neutral net unit 121 and other performance elements 112.
Originally, framework program performing framework instruction (such as MTNN instructions 1400), this framework instruction is sent out by processor front end 100 Cloth starts to execute the neutral net list in its program storage 129 to neutral net unit 121 to indicate neutral net unit 121 Metaprogram.Before, framework program can execute framework instruction and the numerical value write of frequency when specifying main is relaxed pointer 3512, Even if neutral net unit is in general modfel.More precisely, sequencing to the numerical value for relaxing pointer 3512 drops can time-frequency Low logic 3504 with main time frequency signal 3506 main when frequency produce the second time frequency signal 3508.It is preferred that in this example In, time-frequency reduces the voltage quasi position that the time-frequency buffer of logic 3504 lifts merely main time frequency signal 3506.In addition before, Framework program can execute framework instruction to write data random access memory 122, weight random access memory 124 by god Through NE program write-in program memorizer 129.In response to neutral net unit program MTNN instructions 1400, neutral net unit 121 can start with main when frequency execute neutral net unit program, this is because relax pointer 3512 be with main when frequency Value gives sequencing.Neutral net unit 121 start execute after, framework program can continue with main when frequency execute framework refer to Order, including mainly being deposited with weight random access memory with 1400 write of MTNN instructions and/or reading data random access memory 122 Reservoir 124, to complete the example next time (instance) for neutral net unit program, or claims to call (invocation) Or execute the preparation of (run).
In the example of Figure 36 A, complete for data random access memory 122 and weight are random compared to framework program Access memorizer 124 writes/reads the spent time, and neutral net unit 121 can be with the substantially less time (such as four / mono- time) complete the execution of neutral net unit program.For example, with main when frequency operation in the case of, god About 1000 time-frequency cycles are spent through NE 121 to execute neutral net unit program, but, framework program can spend About 4000 time-frequency cycles.Thus, neutral net unit 121 would be at holding state within the remaining time, in this model In example, this is a considerable time, such as about 3000 main when frequency cycle.As shown in the example of Figure 36 A, according to god Size through network is different from configuration, can execute previous mode again, and may continuously carry out many times.Because neutral net Unit 121 is a quite big and intensive functional unit of transistor in processor 100, and the running of neutral net unit 121 will Substantial amounts of heat energy can be produced, especially with main when frequency operation when.
Figure 36 B are a sequential chart, and there is video-stream processor 100 neutral net unit 121 to operate on a fortune of mitigation pattern Make example, frequency when frequency is less than main during the running of mitigation pattern.The sequential chart of Figure 36 B similar to Figure 36 A, in Figure 36 A, Processor 100 with main when frequency execute framework program.This example assumes framework program and neutral net unit journey in Figure 36 B Sequence is same as the framework program of Figure 36 A and neutral net unit program.But, before neutral net unit program is started, frame Structure program can execute MTNN instructions 1400 and relax pointer 3512 with a mathematical programization, and this numerical value can make time-frequency reduce logic 3504 With less than main when frequency second when frequency produce the second time frequency signal 3508.That is, framework program can make nerve net Mitigation pattern of the network unit 121 in Figure 36 B, rather than the general modfel of Figure 36 A.Thus, neural processing unit 126 will be with the When two, frequency executes neutral net unit program, under mitigation pattern, frequency when frequency is less than main when second.False in this example Surely relax pointer 3512 be with one by frequency when second be appointed as a quarter main when frequency numerical value give sequencing.Such as This, it can be which in general mould that neutral net unit 121 executes the time spent by neutral net unit program under mitigation pattern Time taking four times are spent under formula, and as shown in Figure 36 A and Figure 36 B, compared with this two figure, tranmittance can find that neutral net unit 121 is in The time span of holding state significantly can shorten.Thus, neutral net unit 121 executes neutral net unit journey in Figure 36 B The big appointment of sequence catabiotic persistent period be in Figure 36 A neutral net unit 121 four times of configuration processor under general modfel. Therefore, in Figure 36 B, neutral net unit 121 executes the big appointment of heat energy that neutral net unit program produced within the unit interval and is The a quarter of Figure 36 A, and there is advantage as herein described.
Figure 37 is a flow chart, shows the running of the processor 100 of Figure 35.The running of this flow chart description is similar to above Running corresponding to Figure 35, Figure 36 A and Figure 36 B figures.This flow process starts from step 3702.
In step 3702, processor 100 executes MTNN instructions 1400 and weight is write weight random access memory 124 and write data into data random access memory 122.Next flow process advances to step 3704.
In step 3704, processor 100 executes MTNN instructions 1400 and relaxes pointer 3512 with a numerical value sequencing, The when frequency of frequency when this numerical value specifies one to be less than main, even if also neutral net unit 121 is in mitigation pattern.Next Flow process advances to step 3706.
In step 3706, processor 100 executes MTNN instructions 1400 and indicates that neutral net unit 121 starts to execute nerve NE program, that is, be similar to the mode are presented by Figure 36 B.Next flow process advances to step 3708.
In step 3708, neutral net unit 121 starts to execute this neutral net unit program.Meanwhile, processor 100 MTNN instructions 1400 can be executed and new data (can also may be write by new weight write weight random access memory 124 Enter data random access memory 122), and/or execute MFNN instructions 1500 and read from data random access memory 122 As a result (also result may can be read from weight random access memory 124).Next flow process advances to step 3712.
In step 3712, processor 100 executes MFNN and instructs 1500 (such as reading state buffers 127), to detect Neutral net unit 121 has terminated program performing.Assume framework one good 3512 numerical value of mitigation pointer of procedure Selection, nerve net The time spent by the execution neutral net unit program of network unit 121 will be same as 100 executable portion framework program of processor To access the time spent by weight random access memory 124 and/or data random access memory 122, such as Figure 36 B institutes Show.Next flow process advances to step 3714.
In step 3714, processor 100 executes MTNN instructions 1400 and utilizes a mathematical programization to relax pointer 3512, this Frequency when numerical value specifies main, even if also neutral net unit 121 is in general modfel.Next step 3716 is advanced to.
In step 3716, processor 100 executes MTNN instructions 1400 and indicates that neutral net unit 121 starts to execute nerve NE program, that is, be similar to the mode are presented by Figure 36 A.Next flow process advances to step 3718.
In step 3718, neutral net unit 121 starts to execute neutral net unit program with general modfel.This flow process Terminate at step 3718.
It has been observed that compared under general modfel execute neutral net unit program (i.e. with processor main when frequency Execute), execute under mitigation pattern and can disperse the execution time and be avoided that generation high temperature.Furthermore, it is understood that working as neutral net Unit relax pattern configuration processor when, neutral net unit be with relatively low when frequency produce heat energy, these heat energy can be suitable Sharply via packaging body and the cooling of neutral net unit (base material of such as semiconductor device, metal level and lower section) and surrounding Mechanism's (such as fin, fan) discharges, and also therefore, the device (such as transistor, electric capacity, wire) in neutral net unit just compares May operate at a lower temperature.On the whole, relax mode also contribute to reduce processor crystal grain other Unit temp in part.Relatively low operational temperature, especially for the junction temperature of these devices for, electric leakage can be mitigated The generation of stream.Additionally, because the magnitude of current flowed in the unit interval reduces, inductance noise can also reduce with IR pressure drops noise.This Outward, temperature reduce for the metal-oxide half field effect transistor (MOSFET) in processor Negative Bias Temperature Instability (NBTI) with Positive bias unstability (PBSI) also has positive influences, and can lift the life-span of reliability and/or device and processor part. Temperature reduces and can mitigate Joule heat and electromigration effect in the metal level of processor.
With regard to neutral net unit shared resource framework program and nand architecture program between communication mechanism
It has been observed that in the example of Figure 24 to Figure 28 and Figure 35 to Figure 37, data random access memory 122 and weight with The resource of machine access memorizer 124 is shared.The front end shared data of neural processing unit 126 and processor 100 is deposited at random Access to memory 122 and weight random access memory 124.More precisely, before neural processing unit 126 and processor 100 End, such as media cache 118, all can be read out with weight random access memory 124 to data random access memory 122 With write.In other words, the framework program for being implemented in processor 100 and the neutral net list for being implemented in neutral net unit 121 Metaprogram meeting shared data random access memory 122 and weight random access memory 124, and in some cases, such as front Described, need for the flow process between framework program and neutral net unit program is controlled.The resource of program storage 129 exists To a certain degree lower is also shared, this is because framework program can be write to which, and sequencer 128 can be read to which Take.Embodiment as herein described provides a dynamical solution, to control between framework program and neutral net unit program The flow process of access shared resource.
In the embodiments described herein, neutral net unit program is also referred to as nand architecture program, and neutral net unit refers to Order is also referred to as nand architecture instruction, and neutral net unit instruction set (also referred to as neural processing unit instruction set as previously mentioned) is also referred to as For nand architecture instruction set.Nand architecture instruction set is different from framework instruction set.Will comprising instruction translator 104 in processor 100 Framework instruction translation goes out in the embodiment of microcommand, and nand architecture instruction set is also different from microinstruction set.
Figure 38 is a block chart, displays the details of the serial device 128 of neutral net unit 121.Serial device 128 provides memorizer Address is instructed with the nand architecture that selection is supplied to serial device 128, as previously mentioned to program storage 129.As shown in figure 38, deposit Memory address is loaded in the program counter 3802 of sequencer 128.Sequencer 128 would generally be with the ground of program storage 129 Location order is incremented by proper order, except non-sequencer 128 suffers from nand architecture instruction, for example, circulates or branch instruction, and in the case, Program counter 3802 can be updated to sequencer 128 destination address of control instruction, that is, be updated to the mesh positioned at control instruction The address of target nand architecture instruction.Therefore, can specify and currently seized for god in the address 131 for being loaded into program counter 3802 The nand architecture of the nand architecture program executed through processing unit 126 instructs the address in program storage 129.Program counter 3802 numerical value can be taken through the neutral net unit program counter field 3912 of status register 127 by framework program , as described in follow-up Figure 39.Can so make framework program according to the progress of nand architecture program, determine for data are stored at random The position of 124 reading/writing data of device 122 and/or weight random access memory.
Sequencer 128 also includes cycle counter 3804, and this cycle counter 3804 nand architecture recursion instruction of arranging in pairs or groups is entered Row running, such as in Figure 26 A address 10 be recycled to address 11 in 1 instruction and Figure 28 be recycled to 1 instruction.In Figure 26 A and figure In 28 example, numerical value in cycle counter 3804 specified by the nand architecture initialization directive of load address 0 for example loads number Value 400.Sequencer 128 suffers from recursion instruction and jumps to target instruction target word (as being located at the multiplication of address 1 in Figure 26 A each time The maxwacc for being located at address 1 in accumulated instruction or Figure 28 is instructed), sequencer 128 will make cycle counter 3804 successively decrease. Once cycle counter 3804 is reduced to zero, sequencer 128 goes to sort and instructs in Next nand architecture.In another enforcement In example, when suffering from recursion instruction first, the loop count that specifies in a recursion instruction can be loaded in cycle counter, To save the demand using nand architecture initialization directive loop initialization enumerator 3804.Therefore, the number of cycle counter 3804 Value would indicate that the circulation group of nand architecture program waits the number of times for executing.The numerical value of cycle counter 3804 can be passed through by framework program The cycle count field 3914 of status register 127 is obtained, as shown in follow-up Figure 39.Framework program can so be made according to nand architecture The progress of program, determines for data deposit 124 reading/writing data of memorizer 122 and/or weight random access memory at random Position.In one embodiment, the nest set that sequencer includes three extra cycle counters to arrange in pairs or groups in nand architecture program is followed Ring, the numerical value of these three cycle counters also can pass through status register 127 and read.There is one to indicate this in recursion instruction In four cycle counters, which is available to current recursion instruction use.
Sequencer 128 also includes iterationses enumerator 3806.The collocation nand architecture instruction of iterationses enumerator 3806, example Such as Fig. 4, the multiply-accumulate instruction of address 2 in Fig. 9, Figure 20 and Figure 26 A, and the maxwacc of address 2 is instructed in Figure 28, these Instruction will be referred to as " execution " instruction thereafter.In previous cases, each execute instruction respectively specifies that execution counts 511, 511,1023,2 and 3.When sequencer 128 suffers from the execute instruction that is specified a non-zero iteration count, 128 meeting of sequencer With this designated value loading iterationses enumerator 3806.Additionally, sequencer 128 can produce appropriate micro- computing 3418 with control figure Logic in 34 in 126 pipeline stages 3401 of neural processing unit is executed, and makes iterationses enumerator 3806 successively decrease.If repeatedly Generation counter 3806 is more than zero, and sequencer 128 can produce appropriate micro- computing 3418 again and control neural processing unit 126 Interior logic simultaneously makes iterationses enumerator 3806 successively decrease.Sequencer 128 can continue to operate in this way, until iterationses meter The numerical value zero of number device 3806.Therefore, the numerical value of iterationses enumerator 3806 is to specify in nand architecture execute instruction and waits (these computings such as accumulated value and data/weight word carry out multiply-accumulate, take maximum to the operation times of execution, add up Computing etc.).The numerical value of iterationses enumerator 3806 can utilize framework program to count through the iterationses of status register 127 Field 3916 is obtained, as described in follow-up Figure 39.Can so make framework program according to the progress of nand architecture program, determine for data The position of memorizer 122 and/or weight random access memory 124 reading/writing data is deposited at random.
Figure 39 is a block chart, shows the control of neutral net unit 121 and some fields of status register 127.This A little fields include that neural processing unit 126 executes the ground of the weight random access memory row that nand architecture program is most recently written Location 2602, the address 2604 of the weight random access memory row that the nand architecture program that neural processing unit 126 executes is read recently, Neural processing unit 126 executes the address 2606 of the data random access memory row that nand architecture program is most recently written, and god Through the address 2608 that processing unit 126 executes the data random access memory row that nand architecture program is read recently, such as earlier figures Shown in 26B.Additionally, these fields also include 3912 field of neutral net unit program enumerator, 3914 field of cycle counter, With 3916 field of iterationses enumerator.It has been observed that the digital independent in status register 127 can be delayed by framework program to media Storage 118 and/or general caching device 116, for example reading through MFNN instructions 1500 includes neutral net unit program enumerator 3912, the numerical value of cycle counter 3914 and 3916 field of iterationses enumerator.The numerical value of program counter field 3912 is anti- Reflect the numerical value of Figure 38 Programs enumerator 3802.The numerical value of cycle counter field 3914 reflects the number of cycle counter 3804 Value.The numerical value of the numerical value reflection iterationses enumerator 3806 of iterationses counter field 3916.In one embodiment, sequencing Device 128 is needing adjustment programme enumerator 3802, cycle counter 3804 every time, or during iterationses enumerator 3806, all can The numerical value of more new program counter field 3912, cycle counter field 3914 and iterationses counter field 3916, thus, When framework program reads, the numerical value of these fields will be numerical value instantly.In another embodiment, when neutral net unit When 121 execution framework instructions are with reading state buffer 127, neutral net unit 121 only obtains program counter 3802, follows The numerical value of inner loop counter 3804 and iterationses enumerator 3806 is simultaneously provided back into framework instruction and (for example provides to media buffer Device 118 or general caching device are 116).
It is possible thereby to find, the numerical value of the field of the status register 127 of Figure 39 can be understood as nand architecture and instruct by god During executing through NE, the information of its implementation progress.With regard to nand architecture program performing progress some specific towards, Such as 3802 numerical value of program counter, 3804 numerical value of cycle counter, 3806 numerical value of iterationses enumerator, nearest read/write 124 address 125 of weight random access memory field 2602/2604, and the recently data random access of read/write The field 2606/2608 of 122 address 123 of memorizer, is described in previous chapters and sections.It is implemented in the framework of processor 100 Program can read the nand architecture program progress value of Figure 39 from status register 127 and use such information for doing decision-making, for example Carry out with the instruction of the frameworks such as branch instruction through such as comparing.For example, framework program can determine for data random access Memorizer 122 and/or weight random access memory 124 carry out the row of the read/write of data/weight, with control data with The inflow and outflow of the data of machine access memorizer 122 or weight random access memory 124, in particular for large data group And/or the overlapping of different nand architecture instructions is executed.These can refer to herein using the example that framework program carries out decision-making before and after chapter The description of section.
For example, as described in Figure 26 A above, the result of convolution algorithm is write back number by framework program setting nand architecture program According to the row (above row 8) above convolution kernel 2402 in random access memory 122, and work as neutral net unit 121 using most During the address write result of nearly write 122 row 2606 of data random access memory, framework program can be deposited from data random access Reservoir 122 reads this result.
In another example, as described in Figure 26 B above, framework program is using 127 field of status register from Figure 38 Validation of information nand architecture program the data array 2404 of Figure 24 is divided into the data block of 5 512x 1600 to execute convolution fortune The progress of calculation.First 1600 data blocks of 512x write weight of this 1600 data array of 2560x is deposited by framework program at random Access to memory 124 simultaneously starts nand architecture program, and weight random access memory 124 is initialized defeated for 1600 for its cycle count Fall out as 0.When neutral net unit 121 executes nand architecture program, framework program is understood reading state buffer 127 to confirm weight Random access memory 124 is most recently written row 2602, such framework program just can read by nand architecture program write effective Convolution algorithm result, and this effective convolution operation result is override using 1600 data blocks of next 512x after reading, such as This, after neutral net unit 121 completes execution of the nand architecture program for first 1600 data block of 512x, processor 100 Nand architecture program can be updated when necessary immediately and be again started up nand architecture program to execute 1600 data of next 512x Block.
In another example, it is assumed that framework program makes neutral net unit 121 execute a series of typical neutral nets to take advantage of Method adds up run function, and wherein, weight is stored in weight random access memory 124 and result can be written back into data and deposit at random Access to memory 122.In the case, would not again to which after the string of framework program reading weight random access memory 124 It is read out.Thus, in current weight by nand architecture program reading/use after, it is possible to started using framework program By the weight on new weight manifolding weight random access memory 124, with nand architecture program is provided example next time (for example Next neural net layer) use.In the case, framework program meeting reading state buffer 127 is deposited at random with obtaining weight The address of the nearest reading row 2604 of access to memory writes new weight group in weight random access memory 124 to determine which Position.
In another example, it is assumed that framework program is known in nand architecture program including one with big iterationses counting Execute instruction, the multiply-accumulate instruction of the nand architecture of address 2 in such as Figure 20.In the case, framework program needs to know iteration Counting how many times 3916, can know that generally also needing to how many time-frequency cycles can just complete this nand architecture instruction to determine framework The whichever of following one of the two or more actions to be taken of program.For example, if needing long time complete Into execution, framework program will be abandoned control and give another framework program, such as operating system.Similarly, it is assumed that framework program Know that nand architecture program includes a circulation group with sizable cycle count, the nand architecture program of such as Figure 28.Here In the case of, framework program may require that knows cycle count 3914, can know generally also need to how many time-frequency cycles could Complete this nand architecture instruction to determine the whichever of next one of two or more actions to be taken.
In another example, it is assumed that framework program makes neutral net unit 121 perform similarly to described in Figure 27 and Figure 28 Common source computing, the data of wherein wanted common source be previously stored weight random access memory 124 and result can be written back into weight with Machine accesses memorizer 124.But, different from the example of Figure 27 and Figure 28, it is assumed that it is random that the result of this example can be written back into weight The top 400 of access memorizer 124 arranges, such as row 1600 to 1999.In the case, nand architecture program completes to read four row After 124 data of weight random access memory of its wanted common source, nand architecture program would not be read out again.Therefore, one Current four column data of denier all by nand architecture program reading/use after, you can started new data (such as non-frame using framework program The weight of the example next time of structure program, for example, for example execute typical multiply-accumulate run function computing to obtaining data Nand architecture program) overriding weight random access memory 124 data.In the case, framework program meeting reading state delays Storage 127 is weighed with the weight group write for determining new with obtaining the address of the nearest reading row 2604 of weight random access memory The position of weight random access memory 124.
Time recurrence (recurrent) neutral net accelerates
The memorizer that conventional feed forward neutral net is not previously entered with storage network.Feedforward neural network is normally used for It is respective independence to execute the multiple inputs for being input into network with the time in task, and multiple outputs are also task so.Compare Under, time recurrent neural network typically facilitates and executes the input sequence that is input into the time to neutral net in task and have The task of importance.(order herein is commonly known as time step.) therefore, time recurrent neural network includes a concept On memorizer or claim internal state, with the letter for being previously entered performed calculating and producing in loading network in response to sequence Breath, the output of time recurrent neural network are associated with the input of this internal state and next time step.Following task, such as language Sound is recognized, language model, and word is produced, language translation, and image description is produced and some form of handwriting identification, is to pass the time Return neutral net execute good example.
The example of three kinds of known time recurrent neural networks is Elman time recurrent neural networks, and the Jordan times pass Neutral net is returned to remember (LSTM) neutral net with shot and long term.Elman time recurrent neural networks include content node to remember The hiding layer state of time recurrent neural network in current time step, this state in next time step can as The input of hidden layer.Jordan time recurrent neural networks similar to Elman time recurrent neural networks, except content therein The output layer state rather than hiding layer state of node meeting recurrent neural network memory time.Shot and long term Memory Neural Networks include by The shot and long term memory layer that shot and long term memory cell is constituted.Each shot and long term memory cell have the current state of current time step with Current export, and one new or new state and the new output of follow-up time step.Shot and long term memory cell includes being input into Lock and output lock, and forget lock, forgeing lock can make neuron lose its state that is remembered.These three time recurrent neurals Network can be described in more detail in following sections.
As described herein, for time recurrent neural network, such as Elman or Jordan time recurrent neural networks, Neutral net unit execute every time all can use time step, obtaining one group of input layer value, and execute necessary calculating makes which Propagated through time recurrent neural network, to produce output layer nodal value and hidden layer and content layer nodal value.Therefore, Input layer value can be associated with calculating and hide, the time step of output and content layer nodal value;And hide, export and content layer Nodal value can be associated with the time step for producing these nodal values.Input layer value is that time recurrent neural network is emulated Systematic sampling value, such as image, phonetic sampling, the snapshot of commercial market data.For shot and long term Memory Neural Networks, god A period of time intermediate step can be used all through each execution of NE, obtain one group of memory cell input value and execute necessary calculating to produce Raw memory cell output valve (and memory cell state and input lock, forgetting lock and output lock numerical value), this is it can be appreciated that be Memory cell input value is propagated through shot and long term memory layer memory cell.Therefore, memory cell input value can be associated with calculating memory cell shape State and input lock, forget the time step of lock and output lock numerical value;And memory cell state and input lock, forget lock and output Lock numerical value can be associated with the time step for producing these nodal values.
Content layer nodal value, also referred to as state node, are the state values of neutral net, and this state value is based on being associated with previously The input layer value of time step, and only it is not associated with the input layer value of current time step.Neutral net unit For the calculating performed by time step is (for example for the hidden layer nodal value of Elman or Jordan time recurrent neural networks Calculate) be previous time steps produce content layer nodal value a function.Therefore, network-like state value when time step starts The output layer nodal value that (content node value) is produced during affecting now intermediate step.Additionally, at the end of time step Network-like state value when network-like state value can be started with time step by the input node value of now intermediate step is affected.Similar Ground, for shot and long term memory cell, memory cell state value is associated with the memory cell input value of previous time steps, rather than only It is associated with the memory cell input value of current time step.Because the calculating that neutral net unit is executed for time step is (for example The calculating of next memory cell state) be previous time steps produce memory cell state value function, when time step starts Network-like state value (memory cell state value) can affect in now intermediate step produce memory cell output valve, and now intermediate step knot Network-like state value during beam can be affected by the memory cell input value of now intermediate step and former network state value.
Figure 40 is a block chart, shows an example of Elman time recurrent neural networks.The Elman time recurrence of Figure 40 Neutral net includes input layer, or neuron, is denoted as D0, D1 to Dn, referred to collectively as multiple input layer D and indivedual It is commonly referred to as input layer D;Hiding node layer/neuron, is denoted as Z0, Z1 to Zn, referred to collectively as multiple hiding node layer Z and It is commonly referred to as individually hiding node layer Z;Output node layer/neuron, is denoted as Y0, Y1 to Yn, referred to collectively as multiple output layer sections Point Y and individually be commonly referred to as export node layer Y;And content node layer/neuron, C0, C1 to Cn is denoted as, referred to collectively as multiple Content node layer C and be commonly referred to as individually content node layer C.In the example of the Elman time recurrent neural networks of Figure 40, each There is hiding node layer Z an input to be linked to the output of each input layer D, and there is an input to be linked to each content layer The output of node C;There is each output node layer Y an input to be linked to the output of each hiding node layer Z;And each content layer There is node C an input to be linked to the output of corresponding hiding node layer Z.
In many aspects, the running of Elman time recurrent neural networks is similar to traditional feed forward-fuzzy control.? That is, for given node, each input of this node links can all an associated weight;Node is defeated one Enter to link the numerical value meeting for receiving and the multiplied by weight for associating to produce a product;This node can will be associated with all inputs and link Product addition is producing for one total (may also can include a shift term in this sum);In general, can also execute to this sum To produce the output valve of node, this output valve is sometimes referred to as the initiation value of this node to run function.For traditional feedforward network For, data always flow along the direction of input layer to output layer.That is, input layer provides a numerical value to hidden layer (generally having multiple hidden layers), and hidden layer can produce its output valve and provide to output layer, and output layer can be produced and can be taken Output.
But, different from traditional feedforward network, Elman time recurrent neural networks also include that some feedbacks link, It is exactly the link from hiding node layer Z to content node layer C in Figure 40.The running of Elman time recurrent neural networks is as follows, when Input layer D provides an input value to hiding node layer Z in a new time step, and content node C can provide a numerical value To hidden layer Z, this numerical value is for hiding node layer Z in response to being previously entered, that is, current time step, output valve.From this For in meaning, the content node C of Elman time recurrent neural networks is depositing based on the input value of previous time steps Reservoir.Figure 41 and Figure 42 will be associated with the neutral net list of the calculating of the Elman time recurrent neural networks of Figure 40 to execution The running embodiment of unit 121 is illustrated.
In order to the present invention is described, Elman time recurrent neural networks are one and include at least one input node layer, one The time recurrent neural network of concealed nodes layer, an output node layer and a content node layer.For a preset time walks Suddenly, content node layer can store concealed nodes layer and produce in previous time step and feed back to the result of content node layer.This The result for feeding back to content layer can be that the implementing result of run function or concealed nodes layer execute accumulating operation and be not carried out The result of run function.
Figure 41 is a block chart, to show and execute the Elman time recurrent neurals for being associated with Figure 40 when neutral net unit 121 During the calculating of network, in the data random access memory 122 of neutral net unit 121 and weight random access memory 124 Data configuration an example.Assume that the Elman time recurrent neural networks of Figure 40 are input into 512 in the example of Figure 41 Node D, 512 concealed nodes Z, 512 content node C, with 512 output node Y.Additionally, also assuming that this Elman times pass Return neutral net for linking completely, i.e., whole 512 input nodes D link each concealed nodes Z as input, whole 512 Individual content node C links each concealed nodes Z as input, and whole 512 concealed nodes Z link each output node Y is used as input.Additionally, this neutral net unit 121 is configured to 512 neural processing units 126 or neuron, for example, adopts width and match somebody with somebody Put.Finally, this example assumes that being associated with content node C is numerical value 1 to the weight of the link of concealed nodes Z, thus is not required to store up Deposit the weighted value that these are.
As shown in FIG., the lower section 512 of weight random access memory 124 arranges (row 0 to 511) loading and is associated with input section The weighted value of the link between point D and concealed nodes Z.More precisely, as shown in FIG., row 0 are loaded and are associated with by input node The weight that the input of D0 to concealed nodes Z links, that is, word 0 can be loaded be associated between input node D0 and concealed nodes Z0 Link weight, word 1 can load the weight of the link being associated between input node D0 and concealed nodes Z1, and word 2 can be filled The weight of the link being associated between input node D0 and concealed nodes Z2 is carried, the rest may be inferred, and word 511 can be loaded and be associated with input The weight of the link between node D0 and concealed nodes Z511;Row 1 are loaded and are associated with by the input of input node D1 to concealed nodes Z The weight of link, that is, word 0 can load the weight of the link being associated between input node D1 and concealed nodes Z0,1 meeting of word Load the weight for being associated with link between input node D1 and concealed nodes Z1, word 2 can load be associated with input node D1 with The weight of the link between concealed nodes Z2, the rest may be inferred, and word 511 can be loaded and be associated with input node D1 and concealed nodes Z511 Between link weight;Until row 511, row 511 are loaded and are associated with by the input link of input node D511 to concealed nodes Z Weight, that is, word 0 can load the weight of the link being associated between input node D511 and concealed nodes Z0, word 1 can be loaded The weight of the link being associated between input node D511 and concealed nodes Z1, word 2 can load be associated with input node D511 with The weight of the link between concealed nodes Z2, the rest may be inferred, and word 511 can be loaded and be associated with input node D511 and concealed nodes The weight of the link between Z511.This configuration is with purposes similar to above corresponding to the embodiment described in Fig. 4 to Fig. 6 A.
As shown in FIG., follow-up 512 row (row 512 to 1023) of weight random access memory 124 are the sides to be similar to Formula loads the weight of the link being associated between concealed nodes Z and output node Y.
Data random access memory 122 loads Elman time recurrent neural networks nodal value and supplies a series of time steps Use.Furthermore, it is understood that data random access memory 122 is supplied to the nodal value of timing intermediate step with three row as group loading. As shown in FIG., by taking a data random access memory 122 with 64 row as an example, this data random access memory 122 The nodal value used for 20 different time steps can be loaded.In the example of Figure 41, row 0 to 2 are loaded and are used for time step 0 Nodal value, row 3 to 5 load the nodal value that uses for time step 1, and the rest may be inferred, and row 57 to 59 are loaded for time step 19 The nodal value for using.First row in each group loads the numerical value of now input node D of intermediate step.Secondary series in each group is loaded The now numerical value of the concealed nodes Z of intermediate step.In each group the 3rd equips the numerical value for carrying the now output node Y of intermediate step.Such as Shown in figure, each luggage of data random access memory 122 carries the section of its corresponding neuron or neural processing unit 126 Point value.That is, row 0 loads the nodal value for being associated with node D0, Z0 and Y0, its calculating is held by neural processing unit 0 OK;Row 1 loads the nodal value for being associated with node D1, Z1 and Y1, and its calculating is performed by neural processing unit 1;The rest may be inferred, Row 511 is loaded and is associated with the nodal value of node D511, Z511 and Y511, its calculating be performed by neural processing unit 511, this Part can be described in more detail at Figure 42 in follow-up corresponding to.
As pointed by Figure 41, for a preset time step, the secondary series positioned at three row memorizer of each group is hidden The numerical value of node Z can be the numerical value of the content node C of next time step.That is, neural processing unit 126 is for the moment The numerical value of the node Z for calculating in intermediate step and writing, can become this neural processing unit 126 is used in next time step The numerical value (together with the numerical value of input node D of this next time step) of the node C used by the numerical value of calculate node Z.Interior The initial value (numerical value of the node C used in numerical value of the time step 0 in order to calculate the node Z in row 1) for holding node C assumes It is zero.This can be described in more detail in the related Sections subsequently corresponding to the nand architecture program of Figure 42.
It is preferred that input node D numerical value (row 0,3 in the example of Figure 41, the rest may be inferred to row 57 numerical value) by holding Row writes/inserts data random access memory 122 in the framework program of processor 100 through MTNN instructions 1400, and is Read/used, the nand architecture program of such as Figure 42 by the nand architecture program for being implemented in neutral net unit 121.On the contrary, hidden Tibetan/output node Z/Y numerical value (row 1 and 2,4 and 5 in the example of Figure 41, the rest may be inferred to row 58 with 59 numerical value) be then Data random access memory 122 is write/insert by the nand architecture program for being implemented in neutral net unit 121, and be by holding Row is read/uses through MFNN instructions 1500 in the framework program of processor 100.The example of Figure 41 assumes that this framework program can be held Row following steps:(1) for 20 different time steps, the numerical value of input node D is inserted data random access memory 122 (row 0,3, the rest may be inferred to row 57);(2) start the nand architecture program of Figure 42;(3) whether detecting nand architecture program has executed Finish;(4) (row 2,5, the rest may be inferred to row 59) to read the numerical value of output node Y from data random access memory 122;And (5) repeat step (1) to (4) is several times until completing task, the such as meter needed for being recognized to the language of cellie Calculate.
In another kind of executive mode, framework program can execute following steps:(1) to single time step, to be input into The numerical value of node D inserts data random access memory 122 (such as row 0);(2) nand architecture program (Figure 42 nand architecture programs are started Correction after version, be not required to circulate, and only access single group three of data random access memory 122 arranged);(3) detect non- Whether framework program is finished;(4) numerical value (such as row 2) of output node Y is read from data random access memory 122;With And (5) repeat step (1) to (4) is several times until completing task.This two kinds of mode whichever for excellent can be according to time recurrent neural Depending on the sampling mode of the input value of network.For example, if this task is allowed in multiple time steps and input is taken Sample (such as about 20 time steps) simultaneously executes calculating, and first kind of way is just ideal, because this mode may be brought more Many computing resource efficiency and/or preferably efficiency, but, if this task is only allowed in single time step and executes sampling, It is accomplished by using the second way.
3rd embodiment but, uses single group of three columns different from the second way similar to the aforementioned second way According to random access memory 122, the nand architecture program of this mode uses multigroup three row memorizer, that is, in each time step Using three row memorizeies of different groups, this part is similar to first kind of way.In this 3rd embodiment, it is preferred that framework program Include a step before step (2), in this step, framework program can be updated to which before nand architecture program starts, for example The row of data random access memory 122 in the instruction of address 1 are updated to point to next group of three row memorizeies.
Figure 42 is a form, shows a program of the program storage 129 for being stored in neutral net unit 121, this program Executed by neutral net unit 121, and the configuration according to Figure 41 reaches Elman time recurrent neural nets using data and weight Network.Some instructions in the nand architecture program of Figure 42 (and Figure 45, Figure 48, Figure 51, Figure 54 and Figure 57) (are for example taken advantage of as aforementioned in detail Method adds up (MULT-ACCUM), circulate (LOOP), initialization (INITIALIZE) instruction), paragraphs below assume these instruction with Preceding description content is consistent, unless otherwise noted.
The example program bag of Figure 42 is located at address 0 to 12 respectively containing 13 nand architecture instructions.The instruction of address 0 (INITIALIZE NPU, LOOPCNT=20) removes accumulator 202 and initializes cycle counter 3804 to numerical value 20, To execute 20 circulation groups (instruction of address 4 to 11).It is preferred that this initialization directive can be also made at neutral net unit 121 In wide configuration, thus, neutral net unit 121 will be configured to 512 neural processing units 126.As described in following sections, In the execution process instruction of address 1 to 3 and address 7 to 11, this 512 neural processing units 126 are corresponding as 512 Hiding node layer Z operated, and in the execution process instruction of address 4 to 6,126 conducts of this 512 neural processing units 512 corresponding output node layer Y are operated.
The instruction of address 1 to 3 is not belonging to the circulation group of program and only can execute once.These instructions are calculated hides node layer The initial value of Z is simultaneously written into the row 1 of data random access memory 122 and executes for the first time of the instruction of address 4 to 6 making With to calculate the output node layer Y of very first time step (time step 0).Additionally, these are calculated by the instruction of address 1 to 3 And write the numerical value confession that the numerical value of the hiding node layer Z of the row 1 of data random access memory 122 can become content node layer C The first time of the instruction of address 7 and 8 executes use, supplies the second time step (time step to calculate the numerical value of hiding node layer Z Rapid 1) use.
In the implementation procedure of the instruction of address 1 and 2, each nerve in this 512 neural processing units 126 processes single Unit 126 can execute 512 multiplyings, and 512 input node D numerical value for being located at 122 row 0 of data random access memory are taken advantage of The weight of the row of this neural processing unit 126 corresponding in the row 0 to 511 of upper weight random access memory 124, to produce 512 product accumulations are in the accumulator 202 of corresponding neural processing unit 126.In the implementation procedure of the instruction of address 3, this The numerical value of 512 accumulators 202 of 512 neural processing units can be passed and write data random access memory 122 Row 1.That is, the output order of address 3 can be by the tired of each the neural processing unit 512 in 512 neural processing units Plus 202 numerical value of device writes the row 1 of data random access memory 122, this numerical value is initial hidden layer Z numerical value, subsequently, this Instruction can remove accumulator 202.
Ground of the computing performed by the instruction of the address 1 to 2 of the nand architecture program of Figure 42 similar to the nand architecture instruction of Fig. 4 Computing performed by the instruction of location 1 to 2.Furthermore, it is understood that the instruction (MULT_ACCUM DR ROW 0) of address 1 can indicate this Each neural processing unit 126 in 512 neural processing units 126 is by the relative of the row 0 of data random access memory 122 Answer word to read in its multitask buffer 208, the corresponding word of the row 0 of weight random access memory 124 is read in which many Task buffer device 705, data literal is multiplied with weight word and is produced product and this product is added accumulator 202.Address 2 Instruction (MULT-ACCUM ROTATE, WR ROW+1, COUNT=511) indicates each god in this 512 neural processing units The word from adjacent nerve processing unit 126 is proceeded to its multitask buffer 208 (using by nerve through processing unit 126 512 208 collectives of multitask buffer of NE 121 operate the rotator of 512 words for constituting, and these buffers are Instruction for address 1 indicates the buffer for reading in the row of data random access memory 122), by weight random access memory The corresponding word of the next column of device 124 reads in its multitask buffer 705, and the data literal generation that is multiplied with weight word is taken advantage of This product is simultaneously added accumulator 202 by product, and executes foregoing operation 511 times.
Additionally, in Figure 42 address 3 single nand architecture output order (OUTPUT PASSTHRU, DR OUT ROW 1, CLR ACC) computing that run function is instructed can be merged with 4 write output order with address in Fig. 43 (although the program of Figure 42 is passed 202 numerical value of accumulator is passed, and the program of Fig. 4 is then run function to be executed to 202 numerical value of accumulator).That is, Figure 42's In program, the run function of 202 numerical value of accumulator is implemented in, if any, specify (also in address 6 and 11 in output order Output order in specify), rather than as the program of Fig. 4 be shown in specified during a different nand architecture run function is instructed.Fig. 4 Another embodiment of the nand architecture program of (and Figure 20, Figure 26 A and Figure 28), also will run function instruction computing and write Output order (such as the address 3 and 4 of Fig. 4) merges into the model that single nand architecture output order as shown in figure 42 falls within the present invention Farmland.The example of Figure 42 assumes that the node of hidden layer (Z) will not execute run function to accumulator value.But, hidden layer (Z) is right Accumulator value executes the embodiment of run function and also belongs to the scope of the present invention, and these embodiments can utilize the finger of address 3 and 11 Order carries out computing, such as S types, tanh, correction function etc..
Only can execute once compared to the instruction of address 1 to 3, the instruction of address 4 to 11 be then in program circulation and Some number of times can be performed, this number of times (such as 20) specified by cycle count.19 execution before the instruction of address 7 to 11 Calculate the numerical value of hiding node layer Z and be written into data random access memory 122 for the instruction of address 4 to 6 second to Execute using the output node layer Y (time step 1 to 19) for calculating step remaining time for 20 times.(the instruction of address 7 to 11 Last/execute for the 20th time and to calculate the row that hides the numerical value of node layer Z and be written into data random access memory 122 61, but, these numerical value are simultaneously not used by.)
Address 4 and 5 instruction (MULT-ACCUM DR ROW+1, WR ROW 512and MULT-ACCUM ROTATE, WR ROW+1, COUNT=511) first time execute in (corresponding to time step 0), in this 512 neural processing units 126 Each neural processing unit 126 can execute 512 multiplyings, by 512 of the row 1 of data random access memory 122 It is random that the numerical value (these numerical value are produced and write by single execution of the instruction of address 1 to 3) of concealed nodes Z is multiplied by weight The weight of the row of this neural processing unit 126 corresponding in the row 512 to 1023 of access memorizer 124, is tired out with producing 512 products It is added on the accumulator 202 of corresponding neural processing unit 126.Instruction (OUTPUT ACTIVATION in address 6 FUNCTION, DR OUT ROW+1, CLR ACC) first time execute, letters can be started for this 512 accumulating values are executed To calculate the numerical value of output node layer Y, implementing result can write data and deposit at random number (such as S types, tanh, correction function) The row 2 of access to memory 122.
In second execution of the instruction of address 4 and 5 (corresponding to time step 1), this 512 neural processing units Each neural processing unit 126 in 126 can execute 512 multiplyings, by the row 4 of data random access memory 122 The numerical value (these numerical value were executed and produced and write by the first time of the instruction of address 7 to 11) of 512 concealed nodes Z is multiplied by power The weight of the row of this neural processing unit 126 corresponding in the row 512 to 1023 of weight random access memory 124, to produce 512 Product accumulation in the accumulator 202 of corresponding neural processing unit 126, and in the executing for second of instruction of address 6, can be right Run function is executed to calculate the numerical value of output node layer Y in this 512 accumulating values, this result write data random access is deposited The row 5 of reservoir 122;In the third time of the instruction of address 4 and 5 is executed (corresponding to time step 2), this 512 nerves are processed Each neural processing unit 126 in unit 126 can execute 512 multiplyings, by the row of data random access memory 122 The numerical value (these numerical value are produced and write by second execution of the instruction of address 7 to 11) of 7 512 concealed nodes Z is taken advantage of The weight of the row of this neural processing unit 126 corresponding in the row 512 to 1023 of upper weight random access memory 124, to produce 512 product accumulations are in the accumulator 202 of corresponding neural processing unit 126, and the third time in the instruction of address 6 is executed In, run functions can be executed to calculate the numerical value of output node layer Y for this 512 accumulating values, this result write data with Machine accesses the row 8 of memorizer 122;The rest may be inferred, (corresponding to time step in the 20th execution of the instruction of address 4 and 5 19), each the neural processing unit 126 in this 512 neural processing units 126 can execute 512 multiplyings, by data with Machine access memorizer 122 row 58 512 concealed nodes Z numerical value (these numerical value by address 7 to 11 instruction the 19th Secondary execution and produce and write) be multiplied by this neural processing unit corresponding in the row 512 to 1023 of weight random access memory 124 The weight of 126 row, to produce 512 product accumulations in the accumulator 202 of corresponding neural processing unit 126, and in address 6 Execute the 20th time of instruction, can run functions be executed for this 512 accumulating values to calculate the number of output node layer Y Value, implementing result write the row 59 of data random access memory 122.
In the first time of the instruction of address 7 and 8 executes, each nerve in this 512 neural processing units 126 is processed The numerical value of 512 content node C of the row 1 of data random access memory 122 is added to its accumulator 202 by unit 126, this A little numerical value are produced by single execution of the instruction of address 1 to 3.Furthermore, it is understood that instruction (the ADD_D_ACC DR of address 7 ROW+0 each neural processing unit 126 in this 512 neural processing units 126) can be indicated data random access memory 122 read in its multitask buffer 208 when the corresponding word of prostatitis (being row 0 during executing in first time), and will This word adds accumulator 202.The instruction (ADD_D_ACC ROTATE, COUNT=511) of address 8 is indicated at this 512 nerves Word from adjacent nerve processing unit 126 is proceeded to its multitask and is delayed by each the neural processing unit 126 in reason unit 126 Storage 208 be (512 words constituted using 512 208 collectives of multitask buffer runnings by neutral net unit 121 Rotator, these multitask buffers are the caching of the row that the instruction of address 7 indicates to read in data random access memory 122 Device), this word is added accumulator 202, and executes foregoing operation 511 times.
In second execution of the instruction of address 7 and 8, each nerve in this 512 neural processing units 126 is processed The numerical value of the meeting just 512 content node C of the row 4 of data random access memory 122 of unit 126 is added to its accumulator 202, these numerical value produced by the first time of the instruction of address 9 to 11 executes and write;In the instruction of address 7 and 8 the 3rd In secondary execution, each the neural processing unit 126 in this 512 neural processing units 126 can just data random access storage The numerical value of 512 content node C of the row 7 of device 122 is added to its accumulator 202, instruction of these numerical value by address 9 to 11 Execute for second produced and write;The rest may be inferred, in the 20th execution of the instruction of address 7 and 8, this 512 nerves In 512 of each meeting of neural processing unit 126 just row 58 of data random access memory 122 in processing unit 126 The numerical value for holding node C is added to its accumulator 202, and these numerical value are produced by the 19th execution of the instruction of address 9 to 11 And write.
It has been observed that the example of Figure 42 assumes that the weight for being associated with content node C to the link of hiding node layer Z has for one Value.But, in another embodiment, these links being located in Elman time recurrent neural networks are weighed with non-zero Weight values, these weights are positioned over weight random access memory 124 (such as row 1024 to 1535) before the program performing of Figure 42, The programmed instruction of address 7 is MULT-ACCUM DR ROW+0, WR ROW 1024, and the programmed instruction of address 8 is MULT- ACCUM ROTATE, WR ROW+1, COUNT=511.It is preferred that the instruction of address 8 does not access weight random access memory 124, but the numerical value of multitask buffer 705 is read in the instruction of rotation address 7 from weight random access memory 124.511 Retain more frequency ranges by not entering line access to weight random access memory 124 in the time-frequency cycle of the individual instruction of execution address 8 Use for framework program access weight random access memory 124.
Address 9 and 10 instruction (0 and MULT-ACCUM ROTATE of MULT-ACCUM DR ROW+2, WR ROW, WR ROW+1, COUNT=511) first time execute in (corresponding to time step 1), in this 512 neural processing units 126 Each neural processing unit 126 can execute 512 multiplyings, by 512 of the row 3 of data random access memory 122 The numerical value of input node D is multiplied by the row of this neural processing unit 126 corresponding in the row 0 to 511 of weight random access memory 124 Weight to produce 512 products, the instruction together with address 7 and 8 is for the cumulative fortune performed by 512 content node C numerical value Calculate, the accumulator 202 added up in corresponding neural processing unit 126 to calculate the numerical value of hiding node layer Z, in the finger of address 11 In making the first time of (OUTPUT PASSTHRU, DR OUT ROW+2, CLR ACC) execute, this 512 neural processing units 126 512 202 numerical value of accumulator be passed and write the row 4 of data random access memory 122, and accumulator 202 can be clear Remove;In second execution of the instruction of address 9 and 10 (corresponding to time step 2), in this 512 neural processing units 126 Each neural processing unit 126 can execute 512 multiplyings, by 512 of the row 6 of data random access memory 122 The numerical value of input node D is multiplied by the row of this neural processing unit 126 corresponding in the row 0 to 511 of weight random access memory 124 Weight, to produce 512 products, the instruction together with address 7 and 8 is for the cumulative fortune performed by 512 content node C numerical value Calculate, the accumulator 202 added up in corresponding neural processing unit 126 to calculate the numerical value of hiding node layer Z, in the finger of address 11 During second of order is executed, 512 202 numerical value of accumulator of this 512 neural processing units 126 be passed and write data with Machine accesses the row 7 of memorizer 122, and accumulator 202 can then be eliminated;The rest may be inferred, the 19th of the instruction of address 9 and 10 the In secondary execution (corresponding to time step 19), each the neural processing unit 126 in this 512 neural processing units 126 can be held 512 multiplyings of row, by the numerical value of 512 input nodes D of the row 57 of data random access memory 122 be multiplied by weight with The weight of the row of this neural processing unit 126 corresponding in the row 0 to 511 of machine access memorizer 124, to produce 512 products, even For the accumulating operation performed by 512 content node C numerical value, adding up, it is single to process in corresponding nerve for instruction with address 7 and 8 The accumulator 202 of unit 126 to calculate the numerical value of hiding node layer Z, and in the executing for the 19th time of instruction of address 11, this 512 202 numerical value of accumulator of 512 neural processing units 126 are passed and write the row of data random access memory 122 58, and accumulator 202 can then be eliminated.As it was previously stated, produced by the 20th execution of the address 9 with 10 instruction and writing The numerical value of the hiding node layer Z for entering can't be used.
The instruction (LOOP 4) of address 12 can make cycle counter 3804 successively decrease and count in new cycle counter 3804 Value is more than the instruction for returning to address 4 in the case of zero.
Figure 43 is the example that a block chart shows Jordan time recurrent neural networks.The Jordan time recurrence of Figure 43 Elman time recurrent neural network of the neutral net similar to Figure 40, with input layer/neuron D, hiding node layer/ Neuron Z, exports node layer/neuron Y, with content node layer/neuron C.But, in the Jordan times recurrence god of Figure 43 In through network, content node layer C is input into as which using the output feedback from its corresponding output node layer Y and is linked, Er Feiru Output in the Elman time recurrent neural networks of Figure 40 from hiding node layer Z is input into as which and links.
In order to the present invention is described, Jordan time recurrent neural networks are one and include at least one input node layer, one The time recurrent neural network of individual concealed nodes layer, an output node layer and a content node layer.Walk in a preset time Rapid beginning, content node layer can store output node layer and produce in previous time step and be fed back to the knot of content node layer Really.This result for being fed back to content layer can be that the result of run function or output node layer execute accumulating operation and be not carried out The result of run function.
Figure 44 is a block chart, to show and execute the Jordan times recurrence god for being associated with Figure 43 when neutral net unit 121 Through network calculating when, the data random access memory 122 of neutral net unit 121 and weight random access memory 124 One example of interior data configuration.Assume that the Jordan time recurrent neural networks of Figure 43 have 512 in the example of Figure 44 Input node D, 512 concealed nodes Z, 512 content node C, with 512 output node Y.Additionally, also assuming that this Jordan For linking completely, i.e., whole 512 input nodes D link each concealed nodes Z as input to time recurrent neural network, entirely 512, portion content node C links each concealed nodes Z as input, and whole 512 concealed nodes Z to link each defeated Egress Y is used as input.Although the example of the Jordan time recurrent neural networks of Figure 44 can be imposed to 202 numerical value of accumulator and be opened To produce the numerical value of output node layer Y, but, this example assumes that can will impose the accumulator before run function 202 counts to dynamic function Value is transferred to content node layer C, and non-real output node layer Y numerical value.Additionally, neutral net unit 121 is provided with 512 Neural processing unit 126, or neuron, for example, take wide configuration.Finally, this example assumes to be associated with by content node C to hidden The weight for hiding the link of node Z is respectively provided with numerical value 1;Thus be not required to store the weighted value that these are.
Such as the example of Figure 41, as shown in FIG., the lower section 512 of weight random access memory 124 arranges (row 0 to 511) The weighted value of the link being associated between input node D and concealed nodes Z can be loaded, and after weight random access memory 124 Continuous 512 row (row 512 to 1023) can load the weighted value of the link being associated between concealed nodes Z and output node Y.
Data random access memory 122 loads Jordan time recurrent neural networks nodal value for a series of similar to figure Time step in 41 example is used;But, is provided with the memory loads of one group of four row preset time in the example of Figure 44 The nodal value of step.As shown in FIG., in the embodiment of the data random access memory 122 with 64 row, data are random Access memorizer 122 can load the nodal value needed for 15 different time steps.In the example of Figure 44, row 0 to 3 are loaded and are supplied The nodal value that time step 0 is used, row 4 to 7 load the nodal value used for time step 1, and the rest may be inferred, and row 60 to 63 are loaded For the nodal value that time step 15 is used.The first row of this four row storage stack loads now input node D of intermediate step Numerical value.The secondary series of this four row storage stack loads the numerical value of the now concealed nodes Z of intermediate step.This four row storage stack The 3rd equip the numerical value for carrying the now content node C of intermediate step.4th row of this four row storage stack are then to load now The numerical value of the output node Y of intermediate step.As shown in FIG., to carry which corresponding for each luggage of data random access memory 122 Neuron or the nodal value of neural processing unit 126.That is, row 0 is loaded is associated with node D0, the node of Z0, C0 and Y0 Value, its calculating is executed by neural processing unit 0;Row 1 is loaded and is associated with node D1, the nodal value of Z1, C1 and Y1, and its calculating is Executed by neural processing unit 1;The rest may be inferred, and row 511 is loaded and is associated with node D511, the nodal value of Z511, C511 and Y511, Its calculating is executed by neural processing unit 511.This part can be described in more detail at Figure 44 in follow-up corresponding to.
In Figure 44 preset time step the numerical value of content node C produce in the now intermediate step and as the next time The input of step.That is, the numerical value of the interior calculating of intermediate step at this moment of neural processing unit 126 the node C for writing, understands into The numerical value of the node C that the numerical value for being used for calculate node Z in next time step by this neural processing unit 126 is used (together with the numerical value of input node D of this next time step).(instant intermediate step 0 calculates row 1 to the initial value of content node C The numerical value of the node C used by the numerical value of node Z) it is assumed to zero.This part is in the nand architecture program for subsequently corresponding to Figure 45 Chapters and sections can be described in more detail.
As described in Figure 41 above, it is preferred that the numerical value of input node D (row 0,4 in the example of Figure 44, the rest may be inferred extremely The numerical value of row 60) deposited by being implemented in the framework program of processor 100 and write/insert data random access through MTNN instructions 1400 Reservoir 122, and read/used, the nand architecture journey of such as Figure 45 by the nand architecture program for being implemented in neutral net unit 121 Sequence.On the contrary, the numerical value of concealed nodes Z/ content node C/ output node Y (is respectively row 1/2/3,5/6/ in the example of Figure 44 7, the rest may be inferred to row 61/62/63 numerical value) write by the nand architecture program for being implemented in neutral net unit 121/insert data Random access memory 122, and read/used through MFNN instructions 1500 by the framework program for being implemented in processor 100. The example of Figure 44 assumes that this framework program can execute following steps:(1) for 15 different time steps, by input node D Numerical value insert data random access memory 122 (row 0,4, the rest may be inferred to row 60);(2) start the nand architecture journey of Figure 45 Sequence;(3) whether detecting nand architecture program is finished;(4) number of output node Y is read from data random access memory 122 (row 3,7, the rest may be inferred to row 63) for value;And (5) repeat step (1) to (4) is several times until completing task, such as to mobile phone The language of user recognized needed for calculating.
In another kind of executive mode, framework program can execute following steps:(1) to single time step, to be input into The numerical value of node D inserts data random access memory 122 (such as row 0);(2) nand architecture program (Figure 45 nand architecture programs are started Correction after version, be not required to circulate, and only access data deposit single group of four row of memorizer 122 at random);(3) non-frame is detected Whether structure program is finished;(4) numerical value (such as row 3) of output node Y is read from data random access memory 122;And (5) repeat step (1) to (4) is several times until completing task.This two kinds of mode whichever for excellent can be according to time recurrent neural net Depending on the sampling mode of the input value of network.For example, if this task is allowed in multiple time steps taking input Sample (such as about 15 time steps) simultaneously executes calculating, and first kind of way is just ideal because this mode can bring more Computing resource efficiency and/or preferably efficiency, but, if this task is only allowed in executing sampling in single time step, It is accomplished by using the second way.
3rd embodiment but, uses single group of four numbers different from the second way similar to the aforementioned second way Arrange according to random access memory 122, the nand architecture program of this mode uses multigroup four row memorizer, that is, in each time step Suddenly using four row memorizeies of different groups, this part is similar to first kind of way.In this 3rd embodiment, it is preferred that framework journey Sequence includes a step before step (2), and in this step, framework program can be updated to which before nand architecture program starts, The row of data random access memory 122 in the instruction of address 1 are updated to point to next group of four row memorizeies for example.
Figure 45 be a form, show be stored in neutral net unit 121 program storage 129 program, this program by Neutral net unit 121 is executed, and the configuration according to Figure 44 uses data and weight, to reach Jordan time recurrent neural nets Network.Nand architecture program of the nand architecture program of Figure 45 similar to Figure 42, the difference of the two can refer to the explanation of this paper related Sections.
The example program of Figure 45 includes 14 nand architecture instructions, is located at address 0 to 13 respectively.The instruction of address 0 is one Initialization directive, in order to remove accumulator 202 and initialize cycle counter 3804 to numerical value 15, to execute 15 circulation groups (instruction of address 4 to 12).It is preferred that this initialization directive neutral net unit 121 can be made to be configured in wide configuration 512 neural processing units 126.As described herein, in the execution process instruction of address 1 to 3 and address 8 to 12, this 512 Individual neural processing unit 126 is corresponding and is operated as 512 hiding node layer Z, and the instruction in address 4,5 and 7 is executed During, this 512 neural processing units 126 are corresponding and operated as 512 output node layer Y.
Address 1 to 5 is identical with the instruction of address 1 to 6 in Figure 42 with the instruction of address 7 and there is identical function.Address 1 to 3 instruction calculate hide the initial value of node layer Z and be written into the row 1 of data random access memory 122 for address 4,5 with The first time of 7 instruction executes use, to calculate the output node layer Y of very first time step (time step 0).
During the first time of the output order of address 6 executes, this 512 instructions by address 4 and 5 add up and produce 202 numerical value of accumulator (next these numerical value can by the output order of address 7 using calculate and write output node layer Y Numerical value) can be passed and write the row 2 of data random access memory 122, these numerical value are the step (time very first time Step 0) in the content node layer C numerical value that produces in the second time step (used in time step 1);Output in address 6 During second of instruction is executed, this 512 202 numerical value of accumulator for producing that added up by the instruction of address 4 and 5 (connect down Come, these numerical value can be by the output order of address 7 using the numerical value for calculating and writing output node layer Y) can be passed and write Enter the row 6 of data random access memory 122, these numerical value are the second time step (content produced in time step 1) Node layer C numerical value used in the 3rd time step (time step 2);The rest may be inferred, the tenth of the output order of address 6 the During five times execute, this 512 by cumulative 202 numerical value of accumulator for producing that instructs of address 4 and 5, (next these count Value can be by the output order of address 7 using calculating and write the numerical value that exports node layer Y) can be passed and to write data random The row 58 of access memorizer 122, these numerical value are the 15th time step (the content node layer C produced in time step 14) Numerical value (and read by the instruction of address 8, but will not be used).
The instruction of address 8 to 12 is roughly the same with the instruction of address 7 to 11 in Figure 42 and there is identical function, and the two only has There is a discrepancy.I.e., the instruction (ADD_D_ACC DR ROW+1) of address 8 in Figure 45 deposits can data random access to this discrepancy The columns of reservoir 122 increases by one, and the instruction (ADD_D_ACC DR ROW+0) of address 7 deposits can data random access in Figure 42 The columns of reservoir 122 increases by zero.The difference of this data configuration of difference inducement in data random access memory 122, especially It is that the configuration of one group of four row in Figure 44 includes that independent row are used (such as row 2,6,10 etc.) for content node layer C numerical value, and Figure 41 In the configuration of one group of three row then not there is this independently to arrange, but allow the numerical value of numerical value and hiding node layer Z of content node layer C to be total to Enjoy same row (such as row Isosorbide-5-Nitrae, 7 etc.).15 execution of the instruction of address 8 to 12 can calculate the numerical value of hidden layer node Z simultaneously Be written into data random access memory 122 (write row 5,9,13, so on up to row finger 57) for address 4,5 and 7 The second to ten six time of order is executed using the output node layer Y (time step 1 to 14) for calculating the second to ten five time step. (last/the 15th execution of the instruction of address 8 to 12 calculates the numerical value of hiding node layer Z and is written into data and deposit at random The row 61 of access to memory 122, but these numerical value be not used by.)
The recursion instruction of address 13 can make cycle counter 3804 successively decrease and big in new 3804 numerical value of cycle counter The instruction of address 4 is returned in the case of zero.
In another embodiment, the design of Jordan time recurrent neural networks loads output node Y using content node C Run function value, this run function value be run function execute after accumulated value.In this embodiment, because output node Y Numerical value identical with the numerical value of content node C, the nand architecture of address 6 is instructed and is not included in nand architecture program.Thus can be with Reduce the columns used in data random access memory 122.More precisely, each load contents node C number in Figure 44 Value row (such as row 2,6,59) be not present in the present embodiment.Additionally, each time step of this embodiment only needs data Three row of random access memory 122, and 20 time steps of arranging in pairs or groups, rather than 15, the instruction of nand architecture program in Figure 45 Address can also carry out appropriate adjustment.
Shot and long term memory cell
It is the concept known by the art that shot and long term memory cell is used for time recurrent neural network.For example, Long Short-Term Memory, Sepp Hochreiter and J ü rgen Schmidhuber, Neural Computation, November 15,1997, Vol.9, No.8, Pages 1735-1780;Learning to Forget: Continual Prediction with LSTM, Felix A.Gers, J ü rgen Schmidhuber, and Fred Cummins, Neural Computation, October 2000, Vol.12, No.10, Pages2451-2471;These documents Can obtain from Massachusetts science and engineering publishing house periodical (MIT Press Journals).Shot and long term memory cell can be configured as multiple Different types.The shot and long term memory cell 4600 of described below Figure 46 is with network address http://deeplearning.net/ Entitled shot and long term memory network (the LSTM Networks for for emotion analysis of tutorial/lstm.html Sentiment Analysis) study course described by shot and long term memory cell be model, the copy of this study course is in October, 2015 The U. S. application case data that downloads (hereinafter referred to as " shot and long term memory study course ") on 19th and be provided in this case are disclosed in old report book.This Shot and long term memory cell 4600 can be used to describing 121 embodiment of neutral net unit as herein described in general manner and effectively can execute It is associated with the ability of the calculating of shot and long term memory.It should be noted that the embodiment of these neutral net units 121, including figure Embodiment described in 49, can effectively execute other shot and long terms memories being associated with beyond the shot and long term memory cell described in Figure 46 The calculating of born of the same parents.
It is preferred that neutral net unit 121 may be used to there is shot and long term memory cell layer to link other levels for one Time recurrent neural network executes calculating.For example, in this shot and long term memory study course, network includes average common source layer to connect The output (H) of the shot and long term memory cell of shot and long term memory layer is received, and logistic regression layer is receiving the output of average common source layer.
Figure 46 is a block chart, shows an embodiment of shot and long term memory cell 4600.
As shown in FIG., this shot and long term memory cell 4600 includes that memory cell input (X), memory cell export (H), is input into lock (I), output lock (O), forgets lock (F), memory cell state (C) and candidate's memory cell state (C ').Input lock (I) can gate memory The signal transmission of born of the same parents' input (X) to memory cell state (C), and export lock (O) and memory cell state (C) can be gated to memory cell output (H) signal transmission.This memory cell state (C) can be fed back to candidate's memory cell state (C ') of a period of time intermediate step.Forget lock (F) This candidate's memory cell state (C ') can be gated, this candidate's memory cell state can be fed back and become the memory cell of next time step State (C).
The embodiment of Figure 46 calculates aforementioned various different numerical value using following equalities:
(1) I=SIGMOID (Wi*X+Ui*H+Bi)
(2) F=SIGMOID (Wf*X+Uf*H+Bf)
(3) C '=TANH (Wc*X+Uc*H+Bc)
(4) C=I*C '+F*C
(5) O=SIGMOID (Wo*X+Uo*H+Bo)
(6) H=O*TANH (C)
Wi and Ui is associated with the weighted value for being input into lock (I), and Bi is associated with the deviant for being input into lock (I).Wf and Uf The weighted value for forgeing lock (F) is associated with, and Bf is associated with the deviant for forgeing lock (F).Wo and Uo is associated with exporting lock (O) weighted value, and Bo is associated with the deviant for exporting lock (O).It has been observed that equation (1), (2) calculate input respectively with (5) Lock (I), forgets lock (F) with output lock (O).Equation (3) calculates candidate's memory cell state (C '), and equation (4) is calculated with current Memory cell state (C) be input candidate's memory cell state (C '), current memory cell state (C) the i.e. memory of current time step Born of the same parents' state (C).Equation (6) calculates memory cell output (H).But the present invention is not limited to this.Input is calculated using his mode of planting Lock, forgets lock, exports lock, candidate's memory cell state, the embodiment of the shot and long term memory cell that memory cell state is exported with memory cell Also covered by the present invention.
In order to illustrate that the present invention, shot and long term memory cell include that memory cell is input into, memory cell is exported, memory cell state, candidate Memory cell state, is input into lock, output lock and forgetting lock.For each time step, lock is input into, exports lock, forget lock and time Select memory cell state be current time step memorizer memory cell input with the memory cell of previous time steps export with related The function of weight.Now memory cell state of the memory cell state of intermediate step for previous time steps, candidate's memory cell state is defeated Enter the function of lock and output lock.In this sense, memory cell state can be fed back for calculating the note of next time step Recall born of the same parents' state.Now the memory cell output of intermediate step is the memory cell state that now intermediate step is calculated and the function for exporting lock. Shot and long term Memory Neural Networks are neutral nets with a shot and long term memory cell layer.
Figure 47 is a block chart, to show and execute the shot and long term memory nerve net for being associated with Figure 46 when neutral net unit 121 During the calculating of 4600 layers of the shot and long term memory cell of network, the data random access memory of neutral net unit 121 122 and weight with One example of the data configuration in machine access memorizer 124.In the example of Figure 47, neutral net unit 121 is configured to 512 Neural processing unit 126 or neuron, for example, adopt wide configuration, and but, only 128 neural processing units 126 are (such as nerve process Unit 0 to 127) produced by numerical value can be used, this is because this example shot and long term memory layer there was only 128 shot and long terms Memory cell 4600.
As shown in FIG., weight random access memory 124 can load the corresponding nerve process of neutral net unit 121 The weighted value of unit 0 to 127, deviant are worth between two parties.The row 0 to 127 of weight random access memory 124 loads neutral net The weighted value of the corresponding neural processing unit 0 to 127 of unit 121, deviant are worth between two parties.Each row in row 0 to 14 are then 128 following numerical value corresponding to previous equations (1) to (6) are loaded to be supplied to neural processing unit 0 to 127, these numerical value For:Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, C ', TANH (C), C, Wo, Uo, Bo.It is preferred that weighted value and deviant-Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo (positioned at row 0 to 8 and row 12 to 14)-by the frame for being implemented in processor 100 Structure program writes/inserts weight random access memory 124 through MTNN instructions 1400, and by being implemented in neutral net unit 121 nand architecture program reads/uses, the nand architecture program of such as Figure 48.It is preferred that value-C ', TANH (C) between two parties, C is (positioned at row 9 to 11)-write/insert weight random access memory 124 by the nand architecture program for being implemented in neutral net unit 121 and go forward side by side Row reads/uses, and the details will be described later.
As shown in FIG., data random access memory 122 loads input (X), exports (H), and lock is forgotten in input lock (I) (F) use for a series of time steps with output lock (O) numerical value.Furthermore, it is understood that this memorizer five row, one group of loading X, H, I, F With the numerical value of O for one preset time step use.By taking a data random access memory 122 with 64 row as an example, such as scheme Shown in, this data random access memory 122 can load the memory cell numerical value used for 12 different time steps.In Figure 47 Example in, row 0 to 4 load the memory cell numerical value that uses for time step 0, the note that the loading of row 5 to 9 is used for time step 1 Recall born of the same parents' numerical value, the rest may be inferred, row 55 to 59 load the memory cell numerical value used for time step 11.In this five row storage stack First row load the X numerical value of now intermediate step.Secondary series in this five row storage stack loads the H numbers of now intermediate step Value.In this five row storage stack the 3rd equips the I numerical value for carrying now intermediate step.The 4th row in this five row storage stack Load the F numerical value of now intermediate step.In this five row storage stack the 5th equips the O numerical value for carrying now intermediate step.Such as in figure Shown, each luggage in data random access memory 122 carries the number used for corresponding neuron or neural processing unit 126 Value.That is, row 0 loads the numerical value for being associated with shot and long term memory cell 0, and its calculating is performed by neural processing unit 0; Row 1 loads the numerical value for being associated with shot and long term memory cell 1, and its calculating is performed by neural processing unit 1;The rest may be inferred, OK 127 load the numerical value for being associated with shot and long term memory cell 127, and it is performed by neural processing unit 127, in detail as subsequently which calculates Described in Figure 48.
It is preferred that X numerical value (positioned at row 0,5,9, it is 55) saturating by the framework program for being implemented in processor 100 to row that the rest may be inferred Cross MTNN instructions 1400 and write/insert data random access memory 122, and by being implemented in the non-frame of neutral net unit 121 Structure program is read out/uses, nand architecture program as shown in figure 48.It is preferred that I numerical value, F numerical value is with O numerical value (positioned at row 2/ 3/4,7/8/9,12/13/14,57/58/59) the rest may be inferred is write by the nand architecture program for being implemented in neural processing unit 121 to row Enter/insert data random access memory 122, the details will be described later.It is preferred that H numerical value (positioned at row 1,6,10, the rest may be inferred to arranging 56) data random access memory 122 is write/insert by the nand architecture program for being implemented in neural processing unit 121 and read Take/use, and be read out through MFNN instructions 1500 by the framework program for being implemented in processor 100.
The example of Figure 47 assumes that this framework program can execute following steps:(1) for 12 different time steps, will be defeated The numerical value for entering X is inserted data random access memory 122 (row 0,5, the rest may be inferred to row 55);(2) start the nand architecture of Figure 48 Program;(3) whether detecting nand architecture program is finished;(4) numerical value of output H is read from data random access memory 122 (row 1,6, the rest may be inferred to row 59);And (5) repeat step (1) to (4) is for example made to mobile phone several times until completing task The language of user recognized needed for calculating.
In another kind of executive mode, framework program can execute following steps:(1) to single time step, to be input into X Numerical value insert data random access memory 122 (such as row 0);(2) start (correction of Figure 48 nand architecture programs of nand architecture program Version, is not required to circulate afterwards, and only single group five of access data random access memory 122 is arranged);(3) nand architecture journey is detected Whether sequence is finished;(4) numerical value (such as row 1) of output H is read from data random access memory 122;And (5) repeat to walk Suddenly (1) to (4) is several times until completing task.This two kinds of mode whichever are the excellent input X numerical value that can remember layer according to shot and long term Sampling mode depending on.For example, if this task is allowed in multiple time steps and is sampled (such as about 12 to input Individual time step) and calculating is executed, first kind of way is just ideal, because this mode may bring more computing resource efficiency And/or preferably efficiency, but, if this task is only allowed in single time step and executes sampling, it is necessary to using second The mode of kind.
3rd embodiment but, uses single group of five columns different from the second way similar to the aforementioned second way According to random access memory 122, the nand architecture program of this mode uses multigroup five row memorizer, that is, in each time step Using five different row storage stacks, this part is similar to first kind of way.In this 3rd embodiment, it is preferred that framework Program includes a step before step (2), and in this step, framework program can be updated to which before nand architecture program starts, The row of data random access memory 122 in the instruction of address 0 are updated to point to next group of five row memorizeies for example.
Figure 48 be a form, show be stored in neutral net unit 121 program storage 129 program, this program by Neutral net unit 121 is executed and the configuration according to Figure 47 uses data and weight, is associated with shot and long term memory cell layer to reach Calculating.The example program of Figure 48 includes that 24 nand architecture instructions are located at address 0 to 23 respectively.The instruction of address 0 (INITIALIZE NPU, CLRACC, LOOPCNT=12, DR IN ROW=-1, DR OUT ROW=2) can remove accumulator 202 and cycle counter 3804 is initialized to numerical value 12, to execute 12 circulation groups (instruction of address 1 to 22).This is initial Change instruction and can be numerical value -1 by the row initialization to be read of data random access memory 122, and the of the instruction in address 1 After once executing, it is zero that this numerical value can increase.The to be written of data random access memory 122 can simultaneously be fallen in lines by this initialization directive (buffer 2606 of such as Figure 26 and Figure 39) is initialized as row 2.It is preferred that this initialization directive neutral net unit can be made 121 in wide configuration, thus, neutral net unit 121 will be configured with 512 neural processing units 126.Such as following sections Described, in the execution process instruction of address 0 to 23,126 therein 128 nerves of this 512 neural processing units process single Unit 126 is corresponding and is operated as 128 shot and long term memory cells 4600.
In the first time of the instruction of address 1 to 4 executes, this 128 126 (i.e. neural processing units 0 of neural processing unit Each neural processing unit 126 into 127) can be for the step (time very first time of corresponding shot and long term memory cell 4600 Step 0) calculate input lock (I) numerical value and I numerical value is write the corresponding word of the row 2 of data random access memory 122;? Each 126 meeting of neural processing unit during second of the instruction of address 1 to 4 is executed, in this 128 neural processing units 126 (time step 1) calculates I numerical value and I numerical value is write data the second time step for corresponding shot and long term memory cell 4600 The corresponding word of the row 7 of random access memory 122;The rest may be inferred, in the 12nd execution of the instruction of address 1 to 4, Each neural processing unit 126 in this 128 neural processing units 126 can be directed to the of corresponding shot and long term memory cell 4600 (time step 11) calculates I numerical value and I numerical value is write the phase of the row 57 of data random access memory 122 12 time steps Corresponding word, as shown in figure 47.
Furthermore, it is understood that the multiply-accumulate instruction of address 1 can read data random access memory 122 when prostatitis rear Next column (execute first and be row 0, execute second and be row 5, the rest may be inferred, execute the 12nd and be row 55), this Row include memory cell input (X) value for being associated with current time step, and this instructs and can read weight random access memory 124 In the row 0 comprising Wi numerical value, and aforementioned reading numerical values are multiplied to produce the first product accumulation to just by the initial of address 0 Change the accumulator 202 that instruction or the instruction of address 22 are removed.Subsequently, the multiply-accumulate instruction of address 2 can read next data Random access memory 122 arranges and (executes first and be row 1, execute second and be row 6, the rest may be inferred, executes the 12nd As 56), this row includes memory cell output (H) value for being associated with current time step to row, and this instructs and to read weight random Row 1 comprising Ui numerical value in access memorizer 124, and aforementioned value is multiplied to produce the second product accumulation to accumulator 202.Be associated with the H numerical value of current time step by address 2 instruction (and address 6,10 with 18 instruction) random by data Access memorizer 122 reads, and previously time step is produced, and the output order write data random access storage by address 22 Device 122;But, in first time executes, the instruction of address 2 can write the row 1 of data random access memory with an initial value As H numerical value.It is preferred that initial H numerical value write data can be deposited before the nand architecture program for starting Figure 48 by framework program at random The row 1 (for example instructing 1400 using MTNN) of access to memory 122;But, the present invention is not limited to this, includes in nand architecture program The other embodiments for having initialization directive that initial H numerical value is write the row 1 of data random access memory 122 fall within the present invention Category.In one embodiment, this initial H numerical value is zero.Next, the instruction that weight word is added accumulator of address 3 (ADD_W_ACC WR ROW2) can read in weight random access memory 124 comprising Bi numerical value row 2 and be added into adding up Device 202.Finally, the output order (OUTPUT SIGMOID, DR OUT ROW+0, CLR ACC) of address 4 can be to accumulator 202 Numerical value executes a S types run function and arranges the current output that implementing result writes data random access memory 122 (first Execute and be row 2, execute second and be row 7, the rest may be inferred, row are in the 12nd execution and 57) and remove accumulator 202.
In the first time of the instruction of address 5 to 8 executes, each nerve in this 128 neural processing units 126 is processed Unit 126 can calculate which for the very first time step (time step 0) of corresponding shot and long term memory cell 4600 and forget lock (F) number Value the corresponding word by the row 3 of F numerical value write data random access memory 122;In the instruction of address 5 to 8 second In secondary execution, each the neural processing unit 126 in this 128 neural processing units 126 can be remembered for corresponding shot and long term (time step 1) calculates which and forgets lock (F) numerical value and deposit F numerical value write data random access the second time step of born of the same parents 4600 The corresponding word of the row 8 of reservoir 122;The rest may be inferred, in the 12nd execution of the instruction of address 5 to 8, this 128 god Through the 12nd time that each the neural processing unit 126 in processing unit 126 can be directed to corresponding shot and long term memory cell 4600 (time step 11) calculates which and forgets lock (F) numerical value and F numerical value is write the row 58 of data random access memory 122 step Corresponding word, as shown in figure 47.The instruction of address 5 to 8 calculates the instruction of the mode similar to aforementioned addresses 1 to 4 of F numerical value, But, the instruction of address 5 to 7 can respectively from the row 3 of weight random access memory 124, and row 4 read Wf, Uf and Bf number with row 5 Value is executing multiplication and/or additive operation.
In 12 execution of the instruction of address 9 to 12, at each nerve in this 128 neural processing units 126 Reason unit 126 can calculate its candidate's memory cell state (C ') for the corresponding time step of corresponding shot and long term memory cell 4600 Numerical value the corresponding word by the row 9 of C ' numerical value write weight random access memory 124.The instruction of address 9 to 12 is calculated Instruction of the mode of C ' numerical value similar to aforementioned addresses 1 to 4, but, the instruction of address 9 to 11 can be respectively from weight random access memory The row 6 of memorizer 124, row 7 read Wc, Uc and Bc numerical value with row 8 to execute multiplication and/or additive operation.Additionally, address 12 Output order can execute tanh run function rather than (output order such as address 4 is executed) S type run functions.
Furthermore, it is understood that the multiply-accumulate instruction of address 9 can read data random access memory 122 when prostatitis ( Execute for the first time and be row 0, execute at second and be row 5, the rest may be inferred, execute at the 12nd time and be row 55), this is current Row include memory cell input (X) value for being associated with current time step, and this instructs and can read weight random access memory 124 In comprising Wc numerical value row 6, and by aforementioned value be multiplied to produce the first product accumulation to just by address 8 instruction remove Accumulator 202.Next, the multiply-accumulate instruction of address 10 can read data random access memory 122 secondary string ( Execute for the first time and be row 1, execute at second and be row 6, the rest may be inferred, execute at the 12nd time and be row 56), this row bag Containing memory cell output (H) value for being associated with current time step, this instructs and can read in weight random access memory 124 and wraps Row 7 containing Uc numerical value, and aforementioned value is multiplied to produce the second product accumulation to accumulator 202.Next, address 11 By weight word add accumulator instruction can read in weight random access memory 124 include Bc numerical value row 8 and by its Add accumulator 202.Finally, the output order (OUTPUT TANH, WR OUT ROW 9, CLR ACC) of address 12 can be to cumulative 202 numerical value of device executes tanh run function and implementing result is write the row 9 of weight random access memory 124, and Remove accumulator 202.
In 12 execution of the instruction of address 13 to 16, at each nerve in this 128 neural processing units 126 Reason unit 126 can be directed to the corresponding time step of corresponding shot and long term memory cell 4600 and calculate new memory cell state (C) number Value the corresponding word by the row 11 of this new C numerical value write weight random access memory 122, each neural processing unit 126 can also calculate tanh (C) and be written into the corresponding word of the row 10 of weight random access memory 124.Further come Say, the multiply-accumulate instruction of address 13 can read next column of the data random access memory 122 when prostatitis rear (for the first time Execute and be row 2, execute at second and be row 7, the rest may be inferred, execute at the 12nd time and be row 57), this row is comprising association In input lock (I) numerical value of current time step, this instructs and reads in weight random access memory 124 remembers comprising candidate The row 9 (just being write by the instruction of address 12) of born of the same parents' state (C ') numerical value, and aforementioned value is multiplied to produce the first product The accumulator 202 that is just removed is added to by the instruction of address 12.Next, the multiply-accumulate instruction of address 14 can read data Random access memory 122 next column (execute in first time and be row 3, execute second and be row 8, the rest may be inferred, Execute for 12nd time and be row 58), comprising forgetting lock (F) numerical value for being associated with current time step, this instructs and reads this row Current memory cell state (C) numerical value that calculate in previous time steps is contained in weight random access memory 124 (by address The last execution of 15 instruction is write) row 11, and by aforementioned value be multiplied to produce the second product add tired Plus device 202.Next, the output order (OUTPUT PASSTHRU, WR OUT ROW11) of address 15 can transmit this accumulator 202 numerical value are simultaneously written into the row 11 of weight random access memory 124.It is to be appreciated that the instruction of address 14 is by data The C numerical value that the row 11 of random access memory 122 read is the instruction of address 13 to 15 and produces simultaneously in the last execution The C numerical value of write.The output order of address 15 can't remove accumulator 202, thus, its numerical value can be by the instruction of address 16 Use.Finally, the output order (OUTPUT TANH, WR OUT ROW 10, CLR ACC) of address 16 can be counted to accumulator 202 Value executes tanh run function and by the row 10 of its implementing result write weight random access memory 124 for address 21 Instruction exports (H) value using memory cell is calculated.The instruction of address 16 can remove accumulator 202.
In the first time of the instruction of address 17 to 20 executes, at each nerve in this 128 neural processing units 126 Reason unit 126 can calculate which for the very first time step (time step 0) of corresponding shot and long term memory cell 4600 and export lock (O) Numerical value the corresponding word by the row 4 of O numerical value write data random access memory 122;Instruction in address 17 to 20 During second executes, each the neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding shot and long term (time step 1) calculates which and exports lock (O) numerical value and deposit O numerical value write data at random second time step of memory cell 4600 The corresponding word of the row 9 of access to memory 122;The rest may be inferred, in the 12nd execution of the instruction of address 17 to 20, this Each neural processing unit 126 in 128 neural processing units 126 can be directed to the tenth of corresponding shot and long term memory cell 4600 (time step 11) calculates which and exports lock (O) numerical value and O numerical value is write data random access memory 122 two time steps The corresponding word of row 58, as shown in figure 47.The instruction of address 17 to 20 calculates the mode of O numerical value similar to aforementioned addresses 1 to 4 Instruction, but, the instruction of address 17 to 19 can respectively from the row 12 of weight random access memory 124, and row 13 are read with row 14 Take Wo, Uo and Bo numerical value to execute multiplication and/or additive operation.
In the first time of the instruction of address 21 to 22 executes, at each nerve in this 128 neural processing units 126 It is defeated that reason unit 126 can calculate its memory cell for the very first time step (time step 0) of corresponding shot and long term memory cell 4600 Go out (H) value and H numerical value is write the corresponding word of the row 6 of data random access memory 122;Instruction in address 21 to 22 Execute for second, each the neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding length Phase memory cell 4600 the second time step (time step 1) calculate its memory cell output (H) value and by H numerical value write data with The corresponding word of the row 11 of machine access memorizer 122;The rest may be inferred, in the 12nd execution of the instruction of address 21 to 22 In, each the neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding shot and long term memory cell 4600 The 12nd time step (time step 11) calculates its memory cell output (H) value and deposits H numerical value write data random access The corresponding word of the row 60 of reservoir 122, as shown in figure 47.
Furthermore, it is understood that the multiply-accumulate instruction of address 21 can read data random access memory 122 when prostatitis rear The 3rd row (execute in first time and be row 4, execute second and be row 9, the rest may be inferred, is in the 12nd execution 59), comprising output lock (O) numerical value for being associated with current time step, this instructs and reads weight random access memory this row row Include the row 10 (being write by the instruction of address 16) of tanh (C) numerical value in device 124, and aforementioned value is multiplied to produce one and take advantage of Accumulation adds to the accumulator 202 that is just removed by the instruction of address 20.Subsequently, the output order of address 22 can transmit accumulator 202 numerical value are simultaneously written into the following second output row 11 of data random access memory 122 and (execute in first time and be Row 6, execute at second and are row 11, and the rest may be inferred, execute at the 12nd time and are row 61), and remove accumulator 202. It is to be appreciated that the H numerical value arranged by the instruction write data random access memory 122 of address 22 (is executed i.e. in first time For row 6, execute at second and be row 11, the rest may be inferred, as row are executed at the 12nd time and 61) be address 2,6,10 and 18 Instruction subsequent execution in consumed/H numerical value for reading.But, the H numerical value for writing row 61 in executing for the 12nd time can't By address 2,6,10 consumed with the execution of 18 instruction/read;For a preferred embodiment, this numerical value can be by framework journey Sequence is consumed/reads.
The instruction (LOOP 1) of address 23 can make cycle counter 3804 successively decrease and count in new cycle counter 3804 Value is more than the instruction for returning to address 1 in the case of zero.
Figure 49 is a block chart, shows the embodiment of neutral net unit 121, the neural processing unit group of this embodiment Interior buffering with output is covered and feedback capability.Figure 49 shows at the single nerve being made up of four neural processing units 126 Reason cell group 4901.Although Figure 49 only shows single nerve processing unit group 4901, it is to be appreciated, however, that neural In a neural processing unit group 4901, therefore each neural processing unit 126 in NE 121 can all be contained in, N/J neural processing unit group 4901 is had altogether, and wherein N is that the quantity of neural processing unit 126 is (for example, just wide For configuration be 512, for narrow configuration for 1024) and J is the quantity of the neural processing unit 126 in single group 4901 (for example, it is four) for the embodiment of Figure 49.By four nerves in neural processing unit group 4901 in Figure 49 Processing unit 126 is referred to as neural processing unit 0, neural processing unit 1, neural processing unit 2 and neural processing unit 3.
Neural processing unit 126 of each the neural processing unit in the embodiment of Figure 49 similar to aforementioned Fig. 7, and scheme In have identical label component also similar.But, multitask buffer 208 is adjusted with comprising four extra inputs 4905, multitask buffer 705 is adjusted can be from original comprising four extra inputs 4907, to select input 213 adjusted Carry out selecting to provide to output 209 in this input 211 and 207 and additional input 4905, also, input 713 is selected through adjusting Whole and can carry out from the input 711 and 206 of script and additional input 4907 select provide to output 203.
As shown in FIG., the column buffer 1104 of Figure 11 is output buffer 1104 in Figure 49.Furthermore, it is understood that figure Shown in the word 0,1,2 and 3 of output buffer 1104 receive and be associated with four of neural processing unit 0,1,2 and 3 and start The corresponding output of function unit 212.The output buffer 1104 of this part is comprising N number of word corresponding to neural processing unit group Group 4901, these words are referred to as an output buffering word group.In the embodiment of Figure 49, N is four.Output buffer 1104 This four words feed back to multitask buffer 208 and 705, and as four additional inputs 4905 by multitask buffer 208 is received and received by multitask buffer 705 as four additional inputs 4907.Output buffering word group feedback To the feedback action of its corresponding neural processing unit group 4901, the arithmetic instruction of nand architecture program is enable from being associated with god Through selecting one or two in the word (i.e. output buffering word group) of the output buffer 1104 of processing unit group 4901 Word is input into as which, and its example refer to the nand architecture program of follow-up Figure 51, the finger of address 4,8,11,12 and 15 such as in figure Order.That is, 1104 word of output buffer being specified in nand architecture instruction can confirm to select input 213/713 to produce Numerical value.This ability actually allows output buffer 1104 as a classification scratch memory (scratch pad Memory), nand architecture program can be allowed to reduce write data random access memory 122 and/or weight random access memory 124 and the number of times that subsequently therefrom reads, such as the numerical value for producing between two parties and using during reducing.It is preferred that output buffering Device 1104, or claim column buffer 1104, including an one-dimensional cache array, in order to store 1024 narrow words or 512 Individual wide word.It is preferred that can execute within the single time-frequency cycle for the reading of output buffer 1104, and for output The write of buffer 1104 can also be executed within single time-frequency cycle.Data random access memory 122 is different from power Weight random access memory 124, can enter line access by framework program and nand architecture program, and output buffer 1104 cannot be by framework Program enters line access, and can only enter line access by nand architecture program.
Output buffer 1104 is adjusted to be input into (mask input) 4903 to receive shielding.It is preferred that shielding input 4903 include corresponding four words to output buffer 1104 in four positions, and this four character associatives are in neural processing unit group The neural processing unit 126 of four of 4901.If it is preferred that the shielding input of this corresponding word to output buffer 1104 4903 is true, and the word of this output buffer 1104 will maintain its currency;Otherwise, the word of this output buffer 1104 The output that function unit 212 will be activated is updated.If that is, this corresponds to the word to output buffer 1104 Shielding input 4903 is false, and the output of run function unit 212 will be written into the word of output buffer 1104.Thus, The output of run function unit 212 is optionally write certain of output buffer 1104 by the output order of nand architecture program A little words simultaneously make the current value of other words of output buffer 1104 remain unchanged, and its example refer to the non-of follow-up Figure 51 The instruction of address 6,10,13 and 14 in the instruction of framework program, such as figure.That is, the output being specified in nand architecture program The word of buffer 1104 certainly results from the numerical value of shielding input 4903.
For the purpose of simplifying the description, do not show the input 1811 of multitask buffer 208/705 (such as Figure 18, Figure 19 in Figure 49 With shown in Figure 23).But, at the same support can dynamic configuration nerve processing unit 126 and output buffer 1104 feedback/shielding Embodiment also belong to the scope of the present invention.It is preferred that in these embodiments, output buffering word group is can corresponding earthquake State is configured.
Although it is to be appreciated that neural processing unit 126 in the neural processing unit group 4901 of this embodiment Quantity is four, and but, the present invention is not limited to this, and in group, the more or less embodiment of 126 quantity of neural processing unit is equal Belong to scope of the invention.Additionally, for an embodiment with shared run function unit 1112, as shown in figure 52, In 126 quantity of neural processing unit and 212 group of run function unit in one neural processing unit group 4901 126 quantity of neural processing unit has collaboration to be affected.The masking of output buffer 1104 and feedback in neural processing unit group Ability is particularly helpful to lift the computational efficiency for being associated with shot and long term memory cell 4600, in detail as described in follow-up Figure 50 and Figure 51.
Figure 50 is a block chart, shows During the calculating of the level that born of the same parents 4600 are constituted, the data random access memory 122 of the neutral net unit 121 of Figure 49, weight One example of the data configuration in random access memory 124 and output buffer 1104.In the example of Figure 50, neutral net Unit 121 is configured to 512 neural processing units 126 or neuron, for example, take wide configuration.Model such as Figure 47 and Figure 48 Example, only has 128 shot and long term memory cells 4600 in the shot and long term memory layer in examples of the Figure 50 with Figure 51.But, in figure In 50 example, the numerical value that whole 512 nerves processing units 126 (such as neural processing unit 0 to 127) are produced can all be made With.When the nand architecture program of Figure 51 is executed, each the neural meeting of processing unit group 4901 collective is as a shot and long term Memory cell 4600 is operated.
As shown in FIG., data deposit at random memorizer 122 load memory cell input (X) with output (H) value for a series of when Intermediate step is used.Furthermore, it is understood that for one preset time step, have a pair liang of row memorizeies and load X numerical value and H numbers respectively Value.By taking a data random access memory 122 with 64 row as an example, as shown in FIG., this data random access memory The 122 memory cell numerical value for being loaded are available for 31 different time steps to use.In the example of Figure 50, row 2 and 3 are loaded and supply the time The numerical value that step 0 is used, row 4 and 5 load the numerical value used for time step 1, and the rest may be inferred, and row 62 and 63 are loaded and supply time step Rapid 30 numerical value for using.This loads the X numerical value of now intermediate step to the first row in two row memorizeies, and secondary series is then to load The now H numerical value of intermediate step.As shown in FIG., in data random access memory 122, four row of each group correspondingly processes list to nerve The numerical value that the memory loads of first group 4901 are used for its corresponding shot and long term memory cell 4600.That is, row 0 to 3 is loaded The numerical value of shot and long term memory cell 0 is associated with, which calculates is executed by neural processing unit 0-3, i.e., neural processing unit group 0 holds OK;Row 4 to 7 loads the numerical value for being associated with shot and long term memory cell 1, and which calculates is executed by neural processing unit 4-7, i.e., at nerve Reason cell group 1 is executed;The rest may be inferred, and row 508 to 511 is loaded and is associated with the numerical value of shot and long term memory cell 127, its calculate be by Neural processing unit 508-511 is executed, i.e., neural processing unit group 127 executes, in detail as shown in follow-up Figure 51.Such as institute in figure Show, row 1 are simultaneously not used by, row 0 load initial memory cell output (H) value, for a preferred embodiment, can be by framework program Null value is inserted, but, the present invention is not limited to this, initial memory cell output (H) number for row 0 being inserted using nand architecture programmed instruction Value falls within scope of the invention.
(positioned at row 2,4,6, that the rest may be inferred is 62) saturating by the framework program for being implemented in processor 100 to row it is preferred that X numerical value Cross MTNN instructions 1400 and write/insert data random access memory 122, and by being implemented in the non-frame of neutral net unit 121 Structure program is read out/uses, such as the nand architecture program shown in Figure 50.It is preferred that H numerical value (positioned at row 3,5,7, the rest may be inferred 63) write/insert data random access memory 122 by the nand architecture program for being implemented in neutral net unit 121 to row to go forward side by side Row reads/uses, and the details will be described later.It is preferred that H numerical value being instructed through MFNN by the framework program of processor 100 is implemented in 1500 are read out.It should be noted that the nand architecture program of Figure 51 assumes each of corresponding extremely neural processing unit group 4901 In four line storages of group (such as row 0-3, row 4-7, row 5-8, the rest may be inferred to row 508-511), in four X numerical value of given row Identical numerical value (for example being inserted) is inserted by framework program.Similarly, the nand architecture program of Figure 51 can be corresponding to nerve process In four line storage of each group of cell group 4901, calculate and four H numerical value to given row write identical numerical value.
As shown in FIG., weight random access memory 124 is loaded needed for the neural processing unit of neutral net unit 121 Weight, skew with memory cell state (C) value.(example in corresponding four line storage of each group to neural processing unit group 121 The rest may be inferred for such as row 0-3, row 4-7, row 5-8 to row 508-511):(1) line number divided by 4 the remainder row that is equal to 3, can at which Row 0,1,2 and 6 load the numerical value of Wc, Uc, Bc, with C respectively;(2) line number divided by 4 remainder be equal to 2 row, can in its row 3, 4 and 5 numerical value for loading Wo, Uo and Bo respectively;(3) line number divided by 4 the remainder row that is equal to 1, can be in its row 3,4 and 5 difference Load the numerical value of Wf, Uf and Bf;And (4) line number is divided by the 4 remainder row that is equal to 0, can be loaded with 5 respectively in its row 3,4 The numerical value of Wi, Ui and Bi.It is preferred that these weights and deviant-Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo (in row 0 to 5)-deposited by being implemented in the framework program of processor 100 and write/insert weight random access memory through MTNN instructions 1400 Reservoir 124, and be read out/used by the nand architecture program for being implemented in neutral net unit 121, the nand architecture journey of such as Figure 51 Sequence.It is preferred that C values between two parties write/insert weight random access memory by the nand architecture program for being implemented in neutral net unit 121 and deposit Reservoir 124 is simultaneously read out/uses, and the details will be described later.
The example of Figure 50 assumes that framework program can execute following steps:(1) for 31 different time steps, will input The numerical value of X inserts data random access memory 122, and (row 2,4, the rest may be inferred to row 62);(2) start the nand architecture journey of Figure 51 Sequence;(3) whether detecting nand architecture program is finished;(4) numerical value (row of output H are read from data random access memory 122 3,5, the rest may be inferred to row 63);And (5) repeat step (1) to (4) is several times until completing task, for example, mobile phone is used The language of person recognized needed for calculating.
In another kind of executive mode, framework program can execute following steps:(1) to single time step, to be input into X Numerical value insert data random access memory 122 (such as row 2);(2) start (correction of Figure 51 nand architecture programs of nand architecture program Version, is not required to circulate afterwards, and only the single of access data random access memory 122 is arranged to two);(3) nand architecture journey is detected Whether sequence is finished;(4) numerical value (such as row 3) of output H is read from data random access memory 122;And (5) repeat to walk Suddenly (1) to (4) is several times until completing task.This two kinds of mode whichever are the excellent input X numerical value that can remember layer according to shot and long term Sampling mode depending on.For example, if this task is allowed in multiple time steps and is sampled (such as about 31 to input Individual time step) and calculating is executed, first kind of way is just ideal, because this mode may bring more computing resource efficiency And/or preferably efficiency, but, if this task is only allowed in single time step and executes sampling, it is necessary to using second The mode of kind.
3rd embodiment but, is used different from the second way single to two columns similar to the aforementioned second way According to random access memory 122, the nand architecture program of this mode uses multipair memory column, that is, makes in each time step With difference to memory column, this part is similar to first kind of way.It is preferred that the framework program of this 3rd embodiment is in step (2) include a step before, in this step, framework program can be updated to which before nand architecture program starts, such as by ground The row of data random access memory 122 in the instruction of location 1 are updated to point to lower a pair liang of row memorizeies.
As shown in FIG., for the neural processing unit 0 to 511 of neutral net unit 121, in the nand architecture program of Figure 51 After the instruction of middle different address is executed, output buffer 1104 loads memory cell output (H), and candidate's memory cell state (C ') is defeated Enter lock (I), forget the value between two parties of lock (F), output lock (O), memory cell state (C) and tanh (C), each output buffering word (group of corresponding four words to neural processing unit group 4901 of such as output buffer 1104, such as word 0- in group The rest may be inferred for 3,4-7,5-8 to 508-511), word numbering divided by 4 remainder be 3 textual representation be OUTBUF [3], word Number divided by 4 remainder be 2 textual representation be OUTBUF [2], word numbering is that 1 textual representation is divided by 4 remainder OUTBUF [1], and word numbering divided by 4 remainder be 0 textual representation be OUTBUF [0].
As shown in FIG., after in the nand architecture program of Figure 51, the instruction of address 2 is executed, for each neural processing unit For group 4901, whole four words of output buffer 1104 can all write the initial of corresponding shot and long term memory cell 4600 Memory cell exports (H) value.After the instruction of address 6 is executed, for each neural processing unit group 4901, output buffering OUTBUF [3] word of device 1104 can write candidate's memory cell state (C ') value of corresponding shot and long term memory cell 4600, and defeated Other three words for going out buffer 1104 can then maintain its preceding numerical values.After the instruction of address 10 is executed, for each god For through processing unit group 4901, OUTBUF [0] word of output buffer 1104 can write corresponding shot and long term memory cell 4600 input lock (I) numerical value, OUTBUF [1] word can write forgetting lock (F) numerical value of corresponding shot and long term memory cell 4600, OUTBUF [2] word can write output lock (O) numerical value of corresponding shot and long term memory cell 4600, and OUTBUF [3] word is then Maintain its preceding numerical values.After the instruction of address 13 is executed, for each neural processing unit group 4901, output buffering OUTBUF [3] word of device 1104 can write new memory cell state (C) value of corresponding shot and long term memory cell 4600 (for defeated For going out buffer 1104, the C numerical value comprising groove (slot) 3 writes the row 6 of weight random access memory 124, in detail as subsequently Described in Figure 51), and three words of other of output buffer 1104 are then to maintain its preceding numerical values.Instruction in address 14 is executed Afterwards, for each neural processing unit group 4901, OUTBUF [3] word of output buffer 1104 can write corresponding Tanh (C) numerical value of shot and long term memory cell 4600, and three words of other of output buffer 1104 are then to maintain which previously to count Value.Address 16 instruction execute after, for each neural processing unit group 4901, output buffer 1104 whole Four words can all write new memory cell output (H) value of corresponding shot and long term memory cell 4600.Aforementioned addresses 6 to 16 are held Row flow process (the namely execution of excluded address 2, this is because address 2 is not belonging to a part for program circulation) can repeat 30 Secondary, as the program circulation that address 17 returns to address 3.
Figure 51 be a form, show be stored in neutral net unit 121 program storage 129 program, this program by The neutral net unit 121 of Figure 49 is executed and the configuration according to Figure 50 uses data and weight, is associated with shot and long term note to reach Recall the calculating of born of the same parents' layer.The example program bag of Figure 51 is located at address 0 to 17 respectively containing 18 nand architecture instructions.The instruction of address 0 is One initialization directive, in order to remove accumulator 202 and initialize cycle counter 3804 to numerical value 31, is followed with executing 31 times Ring group (instruction of address 1 to 17).The to be written of data random access memory 122 can simultaneously be fallen in lines (for example by this initialization directive 2606) buffer of Figure 26/Figure 39 is initialized as numerical value 1, and after the first time of the instruction of address 16 executes, this numerical value can increase Add to 3.It is preferred that this initialization directive neutral net unit 121 can be made in wide configuration, thus, neutral net unit 121 512 neural processing units 126 will be configured with.As described in following sections, in the execution process instruction of address 0 to 17, this 128 neural processing unit groups 4901 of 512 neural compositions of processing unit 126 are used as 128 corresponding shot and long terms notes Recall born of the same parents 4600 to be operated.
The instruction of address 1 and 2 is not belonging to the circulation group of program and only can execute once.These instructions can produce initial memory Born of the same parents export (H) value (such as 0) and are written into all words of output buffer 1104.The instruction of address 1 can be random from data The row 0 of access memorizer 122 read initial H numerical value the accumulator 202 for being placed on being removed by the instruction of address 0.Address 2 Instruction (OUTPUT PASSTHRU, NOP, CLR ACC) 202 numerical value of accumulator can be transferred to output buffer 1104, such as scheme Shown in 50." NOP " sign in the output order (and other output orders of Figure 51) of address 2 represents that output valve can only be write Enter output buffer 1104, without being written into memorizer, that is, data random access memory 122 or power will not be written into Weight random access memory 124.The instruction of address 2 simultaneously can remove accumulator 202.
The instruction of address 3 to 17 is located in circulation group, and which executes numerical value (such as 31) of the number of times for cycle count.
Executing each time for the instruction of address 3 to 6 can calculate tanh (the C ') numerical value of current time step and be written into Word OUTBUF [3], this word will be used by the instruction of address 11.More precisely, the multiply-accumulate instruction of address 3 can be from (such as row 2,4,6, the rest may be inferred 62) reads to row and be associated with now spacer step for the current reading row of data random access memory 122 Rapid memory cell input (X) value, reads Wc numerical value from the row 0 of weight random access memory 124, and aforementioned value is multiplied to Produce a product and add the accumulator 202 that is removed by the instruction of address 2.
(1) MULT-ACCUM OUTBUF [0], WR ROW can read from word OUTBUF [0] for the multiply-accumulate instruction of address 4 H numerical value (whole four neural processing units 126 of i.e. neural processing unit group 4901) is taken, from weight random access memory 124 row 1 read Uc numerical value, and aforementioned value is multiplied to produce one second product addition accumulator 202.
Accumulator instruction (the ADD_W_ACC WR ROW 2) meeting that weight word is added of address 5 is deposited from weight random access memory The row 2 of reservoir 124 read Bc numerical value and are added into accumulator 202.
Output order (OUTPUT TANH, NOP, the MASK [0 of address 6:2], CLR ACC) 202 numerical value of accumulator can be held Row tanh run function, and only by implementing result write word OUTBUF [3] (that is, only neural processing unit group The remainder that numbering removes 4 in group 4901 is that 3 neural processing unit 126 can write this result), also, accumulator 202 can be clear Remove.That is, the output order of address 6 can cover word OUTBUF [0], OUTBUF [1] and OUTBUF [2] (such as instruction art Language MASK [0:2] represented) and its current value is maintained, as shown in figure 50.Additionally, the output order of address 6 can't write Memorizer (as represented by instructions nomenclature NOP).
The execution each time of the instruction of address 7 to 10 can calculate input lock (I) numerical value of current time step, forget lock (F) numerical value lock (O) numerical value is respectively written into word OUTBUF [0] with output, OUTBUF [1], and OUTBUF [2], these Numerical value will be used by the instruction of address 11,12 and 15.More precisely, the multiply-accumulate instruction of address 7 can be random from data (such as row 2,4,6, the rest may be inferred 62) reads the memory for being associated with now intermediate step to row for the current reading row of access memorizer 122 Born of the same parents are input into (X) value, read Wi, Wf and Wo numerical value from the row 3 of weight random access memory 124, and aforementioned value is multiplied to Produce a product and add the accumulator 202 that is removed by the instruction of address 6.More precisely, in neural processing unit group 4901 In, numbering removes the product that the neural processing unit 126 that 4 remainder is 0 can calculate x and Wi, and numbering removes the nerve that 4 remainder is 1 Processing unit 126 can calculate the product of X and Wf, and it is that 2 neural processing unit 126 can calculate X and Wo to number except 4 remainder Product.
The multiply-accumulate instruction of address 8 can read H numerical value (i.e. neural processing unit group 4901 from word OUTBUF [0] Whole four neural processing units 126), read Ui, Uf and Uo numerical value, and general from the row 4 of weight random access memory 124 Aforementioned value is multiplied to produce one second product and adds accumulator 202.More precisely, in neural processing unit group 4901 In, numbering removes the product that the neural processing unit 126 that 4 remainder is 0 can calculate H and Ui, and numbering removes the nerve that 4 remainder is 1 Processing unit 126 can calculate the product of H and Uf, and it is that 2 neural processing unit 126 can calculate H and Uo to number except 4 remainder Product.
Accumulator instruction (the ADD_W_ACC WR ROW 2) meeting that weight word is added of address 9 is deposited from weight random access memory The row 5 of reservoir 124 read Bi, Bf and Bo numerical value and are added into accumulator 202.More precisely, in neural processing unit group In group 4901, numbering is that 0 neural processing unit 126 can execute the additional calculation of Bi numerical value except 4 remainder, numbering remove 4 remaining Number can execute the additional calculation of Bf numerical value for 1 neural processing units 126, and number remove 4 remainder be 2 neural processing unit 126 additional calculations that can execute Bo numerical value.
The output order (OUTPUT SIGMOID, NOP, MASK [3], CLR ACC) of address 10 can be to 202 numerical value of accumulator Execute S types run function and the I for calculating, F and O numerical value be respectively written into word OUTBUF [0], OUTBUF [1] and OUTBUF [2], this are instructed and can be removed accumulator 202, and are not written into memorizer.That is, the output order meeting of address 10 Cover word OUTBUF [3] (as represented by instructions nomenclature MASK [3]) and maintain the current value (namely C ') of this word, such as Shown in Figure 50.
The execution each time of the instruction of address 11 to 13 can calculate the new memory cell state (C) of current time step generation Value is simultaneously written into the row 6 of weight random access memory 124 and uses (namely for the finger of address 12 for next time step Order upper once circulation is executed when using), more precisely, this numerical value write row 6 are corresponding to neural processing unit group 4901 Four style of writing words in label except 4 remainder be 3 word.Additionally, the execution each time of the instruction of address 14 all can be by tanh (C) Numerical value write OUTBUF [3] is used for the instruction of address 15.
More precisely, the multiply-accumulate instruction (MULT-ACCUM OUTBUF [0], OUTBUF [3]) of address 11 can be from text Word OUTBUF [0] reads input lock (I) numerical value, reads candidate's memory cell state (C ') value from word OUTBUF [3], and will be aforementioned Numerical value is multiplied to produce one first product and adds the accumulator 202 that is removed by the instruction of address 10.More precisely, nerve is processed Each neural processing unit 126 in the neural processing unit 126 of four of cell group 4901 can all calculate I numerical value and C ' numerical value The first product.
(6) MULT-ACCUM OUTBUF [1], WR ROW can indicate neural processing unit for the multiply-accumulate instruction of address 12 126 read forgetting lock (F) numerical value from word OUTBUF [1], read which from the row 6 of weight random access memory 124 corresponding Word, and be multiplied to produce the instruction of the second product and address 11 and result from the first product addition in accumulator 202.More Speak by the book, for the remainder that neural 4901 internal label of processing unit group removes 4 is 3 neural processing unit 126, from row 6 words for reading are the totallings of current memory cell state (C) value that previous time steps are calculated, the first product and the second product It is this new memory cell state (C).But, for other three neural processing units of neural processing unit group 4901 For 126, the word read from row 6 is the numerical value that is not required to comprehend, this is because the accumulated value produced by these numerical value will not be by Use, namely output buffer 1104 will not be put into and can be removed by the instruction of address 14 by address 13 and 14 instruction.? That is, the remainder that label removes 4 in only neural processing unit group 4901 is new produced by 3 neural processing unit 126 Memory cell state (C) value will be used, i.e., by address 13 and 14 instruction use.With regard to the instruction of address 12 the second to three For ten once execute, the C numerical value that reads from the row 6 of weight random access memory 124 be circulation group previous execution in by The numerical value of the instruction write of address 13.But, for the first time of the instruction of address 12 executes, the C numerical value of row 6 be then by Initial value of the framework program before the nand architecture program for starting Figure 51 or by version write after an adjustment of nand architecture program.
Output order (OUTPUT PASSTHRU, WR ROW 6, the MASK [0 of address 13:2]) accumulator 202 can be transmitted only Numerical value, that is, the C numerical value for calculating, to word OUTBUF [3] (that is, label in only neural processing unit group 4901 Remainder except 4 be 3 neural processing unit 126 can the C numerical value that be calculated write output buffer 1104), and weight with Machine access memorizer 124 row 6 be then with renewal after output buffer 1104 write, as shown in figure 50.That is, address 13 output order can cover word OUTBUF [0], OUTBUF [1] and OUTBUF [2] and maintain its current value (i.e. I, F and O Numerical value).It has been observed that only row 6 are 3 corresponding to the remainder that label in four style of writing words of neural processing unit group 4901 removes 4 C numerical value in word can be used, that is, be used by the instruction of address 12;Therefore, nand architecture program will not comprehend weight with Row 0-2, row 4-6 are located in the row 6 of machine access memorizer 124, the rest may be inferred to row 508-510 numerical value, as shown in figure 50 (i.e. I, F and O numerical value).
Output order (OUTPUT TANH, NOP, the MASK [0 of address 14:2], CLR ACC) can be to 202 numerical value of accumulator Tanh run function is executed, and by the tanh for calculating (C) numerical value write word OUTBUF [3], this instructs and understands clear Accumulator 202 is removed, and is not written into memorizer.The output order of address 14, such as the output order of address 13, can cover word OUTBUF [0], OUTBUF [1] and OUTBUF [2] and maintain its script numerical value, as shown in figure 50.
The execution each time of the instruction of address 15 to 16 can calculate memory cell output (H) value of current time step generation simultaneously The current output row rear secondary series of data random access memory 122 is written into, its numerical value will be read by framework program And for time step next time (namely the instruction use in upper once circulation execution by address 3 and 7).More precisely, The multiply-accumulate instruction of address 15 can read output lock (O) numerical value from word OUTBUF [2], read from word OUTBUF [3] Tanh (C) numerical value, and the accumulator 202 that product addition is removed is multiplied to produce by the instruction of address 14.More accurately Say, each the neural processing unit 126 in four neural processing units 126 of neural processing unit group 4901 can all calculate number Value O and the product of tanh (C).
The output order of address 16 can transmit 202 numerical value of accumulator and write, in first time executes, the H numerical value for calculating Fall in lines 3, the H numerical value for calculating write row 5 in executing at second, the rest may be inferred once execute the 30th in will calculate H numerical value write row 63, as shown in figure 50, next these numerical value can by address 4 and 8 instruction use.Additionally, such as Figure 50 institutes Show, the H numerical value that these calculate can be placed into output buffer 1104 and subsequently use for address 4 and 8 instruction.Address 16 Output order simultaneously can remove accumulator 202.In one embodiment, the design of shot and long term memory cell 4600 refers to the output of address 16 Make the output order of address 22 (and/or in Figure 48) that there is a run function, such as S types or hyperbolic tangent function, rather than transmission 202 numerical value of accumulator.
The recursion instruction of address 17 can make cycle counter 3804 successively decrease and big in new 3804 numerical value of cycle counter The instruction of address 3 is returned in the case of zero.
Thus can find because the feedback of the output buffer 1104 in 121 embodiment of neutral net unit of Figure 49 with Screening ability, nand architecture instruction of the instruction number in the circulation group of the nand architecture program of Figure 51 compared to Figure 48 are substantially reduced 34%.Additionally, because the feedback of the output buffer 1104 in 121 embodiment of neutral net unit of Figure 49 and screening ability, Memorizer in the data random access memory 122 of Figure 51 nand architecture programs configures arranged in pairs or groups time number of steps substantially schemes Three times of 48.Aforementioned improvement contributes to some framework journeys calculated using 121 executive chairman impermanent memory born of the same parents layer of neutral net unit Sequence application, especially for application of 4600 quantity of shot and long term memory cell in shot and long term memory cell layer less equal than 128.
The embodiment of Figure 47 to Figure 51 assumes that the weight in each time step is remained unchanged with deviant.But, this Bright be not limited to this, other weights and deviant embodiment that intermediate step changes at any time is also belonged to the scope of the present invention, wherein, weight Random access memory 124 not inserts single group of weight and deviant as shown in Figure 47 to Figure 50, but in each time step The 124 address meeting of weight random access memory of different group weights and deviant and the nand architecture program of Figure 48 to Figure 51 is inserted suddenly Adjust therewith.
Substantially, in the embodiment of aforementioned Figure 47 to Figure 51, weight, skew are stored in (such as C, C ' numerical value) with value between two parties Weight random access memory 124, and be input into and be then stored in data random access memory with output valve (such as X, H numerical value) 122.This feature is conducive to data random access memory 122, and for dual-port, weight random access memory 124 is single port Embodiment, this is because having more flows from nand architecture program and framework program to data random access memory 122. But, because weight random access memory 124 is larger, in another embodiment of the invention then be exchange storage nand architecture with The memorizer (i.e. interchange of data random access memory 122 and weight random access memory 124) of framework program write numerical value. That is, W, U, B, C ', tanh (C) and C numerical value be stored in data random access memory 122 and X, H, I, F with O numerical value then It is stored in weight random access memory 124 (embodiment after the adjustment of Figure 47);And W, U, B, with C numerical value is stored in data Random access memory 122 and X are then stored in weight random access memory 124 with H numerical value (to be implemented after the adjustment of Figure 50 Example).Because weight random access memory 124 is larger, these embodiments can process more time step in a batch.Right For the application for executing the framework program for calculating using neutral net unit 121, this feature is conducive to some can be from more What time step was got profit applies and can provide foot for the memorizer (such as weight random access memory 124) that single port be designed Enough frequency ranges.
Figure 52 is a block chart, shows the embodiment of neutral net unit 121, the neural processing unit group of this embodiment Interior buffering with output is covered and feedback capability, and shared run function unit 1112.121 class of neutral net unit of Figure 52 The neutral net unit 121 of Figure 47 is similar to, and in figure, has the component of identical label also similar.But, four of Figure 49 Run function unit 212 is replaced by single shared run function unit 1112 in the present embodiment, and this single is opened Dynamic function unit can receive four from the output 217 of four accumulators 202 and produce four and export to word OUTBUF [0], OUTBUF [1], OUTBUF [2] and OUTBUF [3].The function mode of the neutral net unit 212 of Figure 52 is similar to Figure 49 above To the embodiment described in Figure 51, and which operates the mode of shared run function unit 1112 similar to the institutes of Figure 11 to Figure 13 above The embodiment that states.
Figure 53 is a block chart, shows and be associated with Figure 46 one there are 128 length when neutral net unit 121 is executed During the calculating of the level of phase memory cell 4600, the data random access memory 122 of the neutral net unit 121 of Figure 49, weight Another embodiment of the data configuration in random access memory 124 and output buffer 1104.The example of Figure 53 is similar to figure 50 example.But, in Figure 53, Wi, Wf and Wo value is located at row 0 (rather than as Figure 50 is located at row 3);Ui, Uf and Uo value is located at Row 1 (rather than if Figure 50 is positioned at row 4);Bi, Bf and Bo value is located at row 2 (rather than as Figure 50 is located at row 5);C values be located at row 3 (rather than As 6) Figure 50 is located at row.In addition, the content of the output buffer 1104 of Figure 53 is similar to Figure 50, but, because Figure 54 and Figure 51 Nand architecture program difference, tertial content (i.e. I, F, O and C ' numerical value) be occur in after the instruction of address 7 is executed defeated Go out buffer 1104 (rather than if Figure 50 is the instruction of address 10);The content (i.e. I, F, O and C numerical value) of the 4th row is in address 10 Instruction execute after occur in output buffer 1104 (rather than if Figure 50 is the instruction of address 13);5th content for arranging (i.e. I, F, O and tanh (C) numerical value) it is to occur in output buffer 1104 after the instruction of address 11 is executed (rather than if Figure 50 is address 14 instruction);And the content (i.e. H numerical value) of the 6th row is after the instruction of address 13 is executed to occur in output buffer 1104 (rather than such as Figure 50 is the instruction of address 16), the details will be described later.
Figure 54 be a form, show be stored in neutral net unit 121 program storage 129 program, this program by The neutral net unit 121 of Figure 49 is executed and the configuration according to Figure 53 uses data and weight, is associated with shot and long term note to reach Recall the calculating of born of the same parents' layer.Program of the example program of Figure 54 similar to Figure 51.More precisely, in Figure 54 and Figure 51, address 0 to 5 Instruction identical;In Figure 54, the instruction of address 7 and 8 is same as the instruction of address 10 and 11 in Figure 51;And address 10 in Figure 54 Instruction to 14 is same as the instruction of address 13 to 17 in Figure 51.
But, in Figure 54 the instruction of address 6 can't remove accumulator 202 (in comparison, in Figure 51 address 6 instruction Accumulator can be removed 202) then.Additionally, the instruction of address 7 to 9 is not present in the nand architecture program of Figure 54 in Figure 51.Most Afterwards, for the instruction of address in Figure 54 9 with the instruction of address 12 in Figure 51, except weight is read in the instruction of address 9 in Figure 54 The row 3 of random access memory 124 and in Figure 51 the instruction of address 12 then be read weight random access memory row 6 outside, Other parts all same.
Because the difference of the nand architecture program of the nand architecture program of Figure 54 and Figure 51, the weight that the configuration of Figure 53 is used are random The columns of access memorizer 124 can reduce three, and the instruction number in program circulation can also reduce three.The nand architecture journey of Figure 54 Circulation packet size in sequence substantially only has the half of the circulation packet size in the nand architecture program of Figure 48, and substantially only schemes 80% of circulation packet size in 51 nand architecture program.
Figure 55 is a block chart, shows the part of the neural processing unit 126 of another embodiment of the present invention.More accurately Say, for single nerve processing unit 126 in the multiple neural processing unit 126 of Figure 49, in figure, show multitask The input 207,211 and 4905 associated with it of buffer 208, and the input 206,711 associated with it of multitask buffer 705 With 4907.In addition to the input of Figure 49, the multitask buffer 208 of neural processing unit 126 is with multitask buffer 705 not Receive numbering (index_within_group) input 5599 in a group.In group, numbering input 5599 is pointed out at specific nerve Reason numbering of the unit 126 in its neural processing unit group 4901.So that it takes up a position, for example, with each neural processing unit group As a example by group 4901 has the embodiment of four neural processing units 126, in each neural processing unit group 4901, wherein one Individual neural processing unit 126 receives value of zero in numbering input 5599 in its group, and one of nerve processing unit 126 exists Number in input 5599 in its group and receive numerical value one, one of nerve processing unit 126 numbers input in its group Numerical value two is received in 5599, and one of nerve processing unit 126 receives numerical value three in numbering input 5599 in its group. In other words, in the group received by neural processing unit 126,5599 numerical value of numbering input are exactly that this neural processing unit 126 exists Remainder of the numbering in neutral net unit 121 divided by J, wherein J are that the nerve in neural processing unit group 4901 processes list The quantity of unit 126.So that it takes up a position, for example, neural processing unit 73 numbers input 5599 in its group receives numerical value one, nerve Processing unit 353 is numbered input 5599 in its group and receives numerical value three, and neural processing unit 6 is numbered in its group and is input into 5599 receive numerical value two.
Additionally, when control input 213 specifies a default value, here to be expressed as " SELF ", multitask buffer 208 can be selected The output of output buffer 1,104 4905 corresponding to 5599 numerical value of numbering input in group.Therefore, when nand architecture is instructed with SELF Numerical value specify to receive and (in the instruction of Figure 57 addresses 2 and 7, be denoted as OUTBUF from the data of output buffer 1104 [SELF]), the multitask buffer 208 of each neural processing unit 126 can receive its corresponding text from output buffer 1104 Word.So that it takes up a position, for example, when neutral net unit 121 executes the nand architecture instruction of address 2 and 7 in Figure 57, neural processing unit 73 multitask buffer 208 can select second (numbering 1) input to receive from output buffering in four inputs 4905 The word 73 of device 1104, the multitask buffer 208 of neural processing unit 353 can select the 4th in four inputs 4905 (numbering 3) input is to receive the word 353 from output buffer 1104, and the multitask buffer 208 of neural processing unit 6 Can the 3rd (numbering 2) input be selected in four inputs 4905 to receive the word 6 from output buffer 1104.Although simultaneously The nand architecture program in Figure 57 is not used, but, nand architecture instruction be possible with SELF numerical value (OUTBUF [SELF]) specify connect Receiving makes control input 713 specify default value to make many of each neural processing unit 126 from the data of output buffer 1104 Task buffer device 705 receives its corresponding word from output buffer 1104.
Figure 56 is a block chart, to show and execute the Jordan time recurrent neural nets for being associated with Figure 43 when neutral net unit The calculating of network and using Figure 55 embodiment when, the data random access memory 122 of neutral net unit 121 is random with weight One example of the data configuration in access memorizer 124.Weight configuration in figure in weight random access memory 124 is same as The example of Figure 44.The example being similarly configured in Figure 44 of the numerical value in figure in data random access memory 122, except in this model In example, each time step has corresponding a pair liang row memorizer to load input layer D values with output node layer Y Value, rather than if the example of Figure 44 is using one group of four memorizer for arranging.That is, in this example, hidden layer Z numerical value and content Layer C numerical value is simultaneously not written into data random access memory 122.But using output buffer 1104 as hidden layer Z numerical value with interior Hold the classification scratch memory of layer C numerical value, in detail as described in the nand architecture program of Figure 57.Aforementioned OUTBUF [SELF] output buffer 1104 feedback characteristic, can making the running of nand architecture program, more quick (this is by for data random access memory 122 The write twice for executing and twi-read action, with the write twice and the twi-read action that execute for output buffer 1104 To replace) and the space of the data random access memory 122 that each time step is used is reduced, and make the data of the present embodiment The data loaded by random access memory 122 can be used to be approximately twice the time step that the embodiment of Figure 44 and Figure 45 has Suddenly, as shown in FIG., i.e., 32 time steps.
Figure 57 be a form, show be stored in neutral net unit 121 program storage 129 program, this program by Neutral net unit 121 is executed and the configuration according to Figure 56 uses data and weight, to reach Jordan time recurrent neural nets Network.Nand architecture program of the nand architecture program of Figure 57 similar to Figure 45, as described below at its difference.
There are the example program of Figure 57 12 nand architecture instructions to be located at address 0 to 11 respectively.The initialization directive meeting of address 0 Remove accumulator 202 and the numerical value of cycle counter 3804 is initialized as 32, execute circulation group (instruction of address 2 to 11) 32 times.The null value of accumulator 202 (being removed by the instruction of address 0) can be put into output buffer by the output order of address 1 1104.Thus can be observed, in the implementation procedure of the instruction of address 2 to 6, this 512 neural processing units 126 are corresponding and make Operated for 512 hiding node layer Z, and in the implementation procedure of the instruction of address 7 to 10, corresponding and defeated as 512 Go out node layer Y to be operated.That is, 32 execution of the instruction of address 2 to 6 can calculate 32 corresponding time steps Hiding node layer Z numerical value, and put it into output buffer 1104 and execute use for corresponding 32 times of the instruction of address 7 to 9, To calculate the output node layer Y of this 32 corresponding time steps and be written into data random access memory 122, and provide Corresponding 32 times of the instruction of address 10 execute use, the content node layer C of this 32 corresponding time steps are put into defeated Go out buffer 1104.(the content node layer C for being put into the 32nd time step in output buffer 1104 can't be used.)
Instruction (ADD_D_ACC OUTBUF [SELF] and ADD_D_ACC ROTATE, COUNT=511) in address 2 and 3 First time execute, each the neural processing unit 126 in 512 neural processing units 126 can be by output buffer 1104 512 content node C values be added to its accumulator 202, these content nodes C values are produced by the instruction of address 0 to 1 is executed With write.In second execution of the instruction of address 2 and 3, each nerve in this 512 neural processing units 126 is processed 512 content node C values of output buffer 1104 can be added to its accumulator 202, these content nodes C values by unit 126 With write produced by the instruction of address 7 to 8 and 10 is executed.More precisely, the instruction of address 2 can indicate that each nerve is processed The multitask buffer 208 of unit 126 selects its 1104 word of corresponding output buffer, as it was previously stated, and being added into tiring out Plus device 202;The instruction of address 3 can indicate that neural processing unit 126 rotates content node C values in the rotator of 512 words, Collective running institute of the rotator of this 512 words by the multitask buffer 208 being connected in this 512 neural processing units Constitute, and allow each neural processing unit 126 that this 512 content node C values are added to its accumulator 202.Address 3 Instruction can't remove accumulator 202, and input layer D values (can be multiplied by its corresponding power with 5 instruction by such address 4 Weight) plus the content node layer C values added up out by the instruction of address 2 and 3.
Instruction (MULT-ACCUM DR ROW+2, WR ROW 0 and MULT-ACCUM ROTATE, WR in address 4 and 5 ROW+1, COUNT=511) execute for each time, each 126 meeting of neural processing unit in this 512 neural processing units 126 512 multiplyings are executed, by the row that current time step is associated with data random access memory 122 (for example:For when Row 0 are for intermediate step 0, for time step 1, are row 2, the rest may be inferred, for for time step 31 i.e. For row 512 input node D values 62), it is multiplied by the row 0 to 511 of weight random access memory 124 corresponding at this nerve The weight of the row of reason unit 126, to produce 512 products, and together with this address 2 and 3 instruction for this 512 content nodes The accumulation result that C values are executed, is added to the accumulator 202 of corresponding neural processing unit 126 to calculate concealed nodes Z layers in the lump Numerical value.
In each execution of the instruction (OUTPUT PASSTHRU, NOP, CLRACC) of address 6, this 512 nerves are processed 512 202 numerical value of accumulator of unit 126 transmit and write the corresponding word of output buffer 1104, and accumulator 202 Can be eliminated.
Instruction (MULT-ACCUM OUTBUF [SELF], WR ROW 512 and MULT-ACCUM in address 7 and 8 ROTATE, WR ROW+1, COUNT=511) implementation procedure in, at each nerve in this 512 neural processing units 126 Reason unit 126 can execute 512 multiplyings, by 512 concealed nodes Z values in output buffer 1104 (by address 2 to 6 Instruction corresponding time execute produced by and write), it is right in the row 512 to 1023 of weight random access memory 124 to be multiplied by Should be in the weight of the row of this neural processing unit 126, to produce 512 product accumulations to corresponding neural processing unit 126 Accumulator 202.
In each execution of the instruction (OUTPUT ACTIVATION FUNCTION, DR OUT ROW+2) of address 9, meeting For this 512 accumulated values execute run function (such as hyperbolic tangent function, S type functions, correction function) to calculate output node Y Value, this output node Y value can be written in data random access memory 122 corresponding to current time step row (for example:Right Row 1 are for time step 0, for time step 1, are row 3, the rest may be inferred, for time step 31 i.e. For row 63).The instruction of address 9 can't remove accumulator 202.
In each execution of the instruction (OUTPUT PASSTHRU, NOP, CLR ACC) of address 10, the finger of address 7 and 8 512 numerical value adding up out of order can be placed into output buffer 1104 and execute use next time for the instruction of address 2 and 3, and And accumulator 202 can be eliminated.
The recursion instruction of address 11 can make the number decrements of cycle counter 3804, and if new cycle counter 3804 Numerical value would indicate that the instruction for returning to address 2 still above zero.
As described in the chapters and sections corresponding to Figure 44, in the nand architecture program performing Jordan time recurrent neural using Figure 57 In the example of network, although can run function be imposed for 202 numerical value of accumulator to produce output node layer Y value, but, this model Official holiday is scheduled on before imposing run function, and 202 numerical value of accumulator is just transferred to content node layer C, rather than transmits real output layer Node Y value.But, for the Jordan times that run function puts on 202 numerical value of accumulator to produce content node layer C are passed For returning neutral net, the instruction of address 10 will be removed from the nand architecture program of Figure 57.In the embodiments described herein, Elman or Jordan time recurrent neural networks have single concealed nodes layer (such as Figure 40 and Figure 42), however, it is desirable to understand , the embodiment of these processors 100 and neutral net unit 121 can be used similar to manner described herein, effectively Ground executes the calculating for being associated with the time recurrent neural network with multiple hidden layers.
As described in corresponding to the chapters and sections of Fig. 2 above, each neural processing unit 126 is used as in an artificial neural network Neuron is operated, and all of neural processing unit 126 can be with the side of extensive parallel processing in neutral net unit 121 Formula effectively calculates the neuron output value of a level of this network.The parallel processing of this neutral net unit, particularly The rotator constituted using neural processing unit multitask buffer collective, not traditionally calculates the mode of neuronal layers output Institute's energy intuition is expected.Furthermore, it is understood that traditional approach is usually directed to is associated with single neuron or a very little god Through the calculating (for example, executing multiplication and additional calculation using parallel arithmetical unit) of first subclass, then continue to execute association In the calculating of the next neuron of same level, the rest may be inferred continues executing with serial fashion, until completing for this level In all of neuron calculating.In comparison, of the invention within each time-frequency cycle, all god of neutral net unit 121 The small set produced in calculating needed for all neurons outputs is associated with through processing unit 126 (neuron) meeting parallel execution (such as single multiplication and accumulation calculating).After about M time-frequency end cycle-M be in current level link nodes- Neutral net unit 121 will calculate the output of all neurons.In the configuration of many artificial neural networks, because existing big The neural processing unit 126 of amount, neutral net unit 121 just can be in M time-frequency end cycle for all god of flood level Its neuron output value is calculated through unit.As described herein, this is calculated for all types of artificial neural networks all Tool efficiency, these artificial neural networks are including but not limited to feedforward and time recurrent neural network, such as Elman, Jordan and length Impermanent memory network.Finally, although in the embodiments herein, neutral net unit 121 is configured to 512 neural processing units To execute the calculating of time recurrent neural network, but, the present invention is not limited to this to 126 (for example taking wide word configuration), will Neutral net unit 121 is configured to 1024 neural processing units 126 (for example taking narrow word configuration) to execute time recurrence The embodiment of the calculating of neutral net unit, and such as the aforementioned neural processing unit with other quantity beyond 512 and 1024 126 neutral net unit 121, also belongs to the scope of the present invention.
Only as described above, only presently preferred embodiments of the present invention, when can not limit the model of present invention enforcement with this Enclose, i.e., the simple equivalence changes that is made according to scope of the present invention patent and invention description content generally and modification all still belong to In the range of patent of the present invention covers.For example, software can execute the function of apparatus and method of the present invention, system Make, mould, emulate, describe and/or test.This can be by general program language (such as C, C++), hardware description language (HDL) include Verilog HDL, VHDL etc., or other existing programs are reaching.This software can be arranged at any of Computer can utilize medium, such as tape, quasiconductor, disk, CD (such as CD-ROM, DVD-ROM etc.), network connection, wireless or It is other medium of communications.The embodiment of apparatus and method described herein may be included with quasiconductor intelligence wealth core, for example micro- place Manage core (such as with the embodiment of hardware description language) and the making through integrated circuit is converted to hardware.Additionally, herein Described apparatus and method can also include the combination of hardware and software.Therefore, any embodiment as herein described, and be not used to Limit the scope of the present invention.Additionally, present invention can apply to the micro processor, apparatus of general general purpose computer.Finally, affiliated skill Art field tool usually intellectual utilize disclosed herein idea and embodiment based on, design and adjust out difference Structure reached identical purpose, also without departing from the scope of the present invention.

Claims (22)

1. a kind of neutral net unit, it is characterised in that include:
One array being made up of N number of neural processing unit (NPU), respectively the neural processing unit include:
One arithmetical unit and an accumulator;And
First and second multitask buffer, is respectively provided with first and second output, by the arithmetical unit and an adjacent nerve First and second multiplexer corresponding of reason unit is received, those first multiplexers of N number of neural processing unit with Respectively selectively collective is carried out with second N number of word rotator those second multiplexers as first N number of word rotator Running;
First memory, to load the row of N number of weight word, and N number of weight word is provided to N number of nerve process by string Those corresponding neural processing units in unit;And
One second memory, to load the row of N number of data literal, and N number of data literal is provided to N number of nerve by string Those corresponding neural processing units in reason unit;
Wherein, the neutral net unit programmed can make the neural pe array optionally:
To be received from the first memory multiple row N number of weight word be received from this using second N number of word rotator The N number of data literal of the string of second memory, executes multiply-accumulate computing;
To be received from using first N number of word rotator the first memory multiple row N number of weight word be received from this N number of data literal of the multiple row of second memory, execute convolution algorithm, the multiple row weight word be a data matrix, the multiple row Data literal is the element of a convolution kernel;And
To N number of weight word of the multiple row of the first memory is received from using first N number of word rotator, execute common source fortune Calculate.
2. neutral net unit according to claim 1, it is characterised in that make after the neutral net unit is programmed The neural pe array executes multiply-accumulate computing, and the neutral net unit simultaneously can be programmed to the multiply-accumulate computing Result execute a run function to produce N number of second result, the result of the multiply-accumulate computing is loaded at N number of nerve In reason unit respectively in the accumulator of the neural processing unit.
3. neutral net unit according to claim 2, it is characterised in that the neutral net unit simultaneously can programmed will N number of second result writes the string of the second memory.
4. neutral net unit according to claim 1, it is characterised in that make after the neutral net unit is programmed The neural pe array executes convolution algorithm, and the every string in the multi-column data is with the phase in those elements of the convolution kernel Corresponding element is inserted.
5. neutral net unit according to claim 1, it is characterised in that make after the neutral net unit is programmed The neural pe array executes convolution algorithm, the neutral net unit and can the programmed result to the convolution algorithm hold To produce the second result, the result of the convolution algorithm is loaded in N number of neural processing unit the respectively nerve to one division arithmetic of row In the accumulator of processing unit.
6. neutral net unit according to claim 5, it is characterised in that the neutral net unit simultaneously can programmed will Second result writes the multiple row of the first memory.
7. neutral net unit according to claim 1, it is characterised in that make after the neutral net unit is programmed The neural pe array executes common source computing, and the neutral net unit simultaneously programmed can make N number of neural processing unit A subclass in the arithmetical unit of the respectively neural processing unit select a corresponding subregion of the first memory One maximum of those data literals stores to the accumulator of a subclass of N number of neural processing unit to produce second As a result.
8. neutral net unit according to claim 7, it is characterised in that the neutral net unit simultaneously can programmed will Second result writes the multiple row of the first memory.
9. neutral net unit according to claim 8, it is characterised in that the multiple row of the first memory is based on loading Select with the ratio of the columns of the subregion in the columns of those data literals of the first memory.
10. neutral net unit according to claim 7, it is characterised in that the subclass of N number of neural processing unit Selected with the ratio of the line number of the subregion based on N.
11. neutral net units according to claim 1, it is characterised in that after the neutral net unit is programmed The neural pe array is made to execute common source computing, the neutral net unit simultaneously programmed can make N number of nerve process list The arithmetical unit of the respectively neural processing unit in one subclass of unit is by the one of the first memory corresponding subregion Those data literals add up out a sum store to the accumulator and by the sum divided by the corresponding subregion data literal Quantity producing meansigma methodss.
12. neutral net units according to claim 1, it is characterised in that also include:
One program storage, using a processor comprising the neutral net unit, write one can be by the neutral net Unit executes to indicate that the neutral net unit executes the program of product accumulation computing, convolution algorithm and common source computing.
13. neutral net units according to claim 1, it is characterised in that the arithmetical unit can be with:
One is executed to the weight word for being received from the first memory and the data literal for being received from the second memory to take advantage of Method computing is producing a product, and stores to the accumulator for the product executes an accumulating operation, and to multiple weights Word repeats the multiplying and the accumulating operation with data literal;And
Is executed by a Selecting operation to store for one currency of the weight word and the accumulator that are received from the first memory Maximum is in the accumulator, and repeats to execute the Selecting operation to multiple weight words.
The method of 14. one neutral net units (NNU) of a kind of running, it is characterised in that the neutral net unit includes one by N The array that individual neural processing unit (NPU) is constituted, respectively the neural processing unit include an arithmetical unit and an accumulator, Yi Ji One and the second multitask buffer, first and second multitask buffer is respectively provided with first and second output, by the arithmetic Unit is received with first and second multiplexer corresponding of an adjacent nerve processing unit, and the method includes:
Using a first memory, N number of for string weight word is provided corresponding those into N number of neural processing unit Neural processing unit;
Using a second memory, N number of for string data literal is provided corresponding those into N number of neural processing unit Neural processing unit;And
The sequencing neutral net unit makes the neural pe array optionally:
In one first example (instance), to be received from the first memory multiple row N number of weight word with using this The N number of data literal of string that two N number of word rotator are received from the second memory, executes multiply-accumulate computing;
In one second example, N number of weight text of the multiple row to being received from the first memory using first N number of word rotator Word and N number of data literal of the multiple row for being received from the second memory, execute convolution algorithm, and the multiple row weight word is data Matrix, the multi-column data word are the element of a convolution kernel;And
In one the 3rd example, N number of weight text of the multiple row to being received from the first memory using first N number of word rotator Word, executes common source computing.
15. methods according to claim 14, it is characterised in that also include:
The sequencing neutral net unit executes a run function to produce N number of second knot to the result of the multiply-accumulate computing Really, the result of the multiply-accumulate computing is loaded into the accumulator of the respectively neural processing unit in N number of neural processing unit Interior.
16. methods according to claim 14, it is characterised in that also include:
Every string in for the multi-column data, is inserted with the corresponding element in those elements of the convolution kernel.
17. methods according to claim 14, it is characterised in that also include:
The sequencing neutral net unit executes a division arithmetic to produce the second result with the result to the convolution algorithm, the volume The result of product computing is loaded in N number of neural processing unit in respectively accumulator of the neural processing unit.
18. methods according to claim 14, it is characterised in that also include:
What the sequencing neutral net unit made the respectively neural processing unit in a subclass of N number of neural processing unit should Arithmetical unit selects a maximum of those data literals of a corresponding subregion of the first memory to store to N number of god Through in the accumulator of a subclass of processing unit producing the second result.
19. methods according to claim 14, it is characterised in that also include:
What the sequencing neutral net unit made the respectively neural processing unit in a subclass of N number of neural processing unit should Those data literals of the one of the first memory corresponding subregion sum that adds up out is stored and adds up to this by arithmetical unit Device and by the sum divided by the data literal of the corresponding subregion quantity producing meansigma methodss.
20. methods according to claim 14, it is characterised in that also include:
Using a processor comprising the neutral net unit, one can be executed to indicate this by the neutral net unit Neutral net unit executes product accumulation computing, convolution algorithm and the program of common source computing, writes a program storage.
21. methods according to claim 14, it is characterised in that also include:
One is executed to the weight word for being received from the first memory and the data literal for being received from the second memory to take advantage of Method computing is producing a product, and stores to the accumulator for the product executes an accumulating operation, and to multiple weights Word repeats the multiplying and the accumulating operation with data literal;And
Is executed by a Selecting operation to store for one currency of the weight word and the accumulator that are received from the first memory Maximum is in the accumulator, and repeats to execute the Selecting operation to multiple weight words.
22. a kind of are encoded in the computer journey that an at least non-momentary computer can be used for a computer installation using media Sequence product, it is characterised in that include:
The computer for being included in the media can use program code, and in order to describe a neutral net unit, the computer can be used Program code includes:
First program code, in order to describe an array being made up of N number of neural processing unit (NPU), respectively the nerve processes single Unit includes:
One arithmetical unit and an accumulator;And
First and second multitask buffer, is respectively provided with first and second output, by the arithmetical unit and an adjacent nerve First and second multiplexer corresponding of reason unit is received, those first multiplexers of N number of neural processing unit with Respectively selectively collective is carried out with second N number of word rotator those second multiplexers as first N number of word rotator Running;
Second program code, in order to describe a first memory, the first memory loads the row of N number of weight word, and by one Arrange N number of weight word and those the corresponding neural processing units into N number of neural processing unit are provided;And
3rd program code, in order to describe a second memory, the second memory loads the row of N number of data literal, and by one Arrange N number of data literal and those the corresponding neural processing units into N number of neural processing unit are provided;
Wherein, the neutral net unit programmed can make the neural pe array optionally:
To be received from the first memory multiple row N number of weight word be received from this using second N number of word rotator The N number of data literal of the string of second memory, executes multiply-accumulate computing;
To be received from using first N number of word rotator the first memory multiple row N number of weight word be received from this N number of data literal of the multiple row of second memory, execute convolution algorithm, the multiple row weight word be a data matrix, the multiple row Data literal is the element of a convolution kernel;And
To N number of weight word of the multiple row of the first memory is received from using first N number of word rotator, execute common source fortune Calculate.
CN201610864609.2A 2015-10-08 2016-09-29 Multioperation neural network unit Active CN106503796B (en)

Applications Claiming Priority (48)

Application Number Priority Date Filing Date Title
US201562239254P 2015-10-08 2015-10-08
US62/239,254 2015-10-08
US201562262104P 2015-12-02 2015-12-02
US62/262,104 2015-12-02
US201662299191P 2016-02-24 2016-02-24
US62/299,191 2016-02-24
US15/090,712 2016-04-05
US15/090,794 2016-04-05
US15/090,807 US10380481B2 (en) 2015-10-08 2016-04-05 Neural network unit that performs concurrent LSTM cell calculations
US15/090,796 US10228911B2 (en) 2015-10-08 2016-04-05 Apparatus employing user-specified binary point fixed point arithmetic
US15/090,696 2016-04-05
US15/090,672 2016-04-05
US15/090,823 US10409767B2 (en) 2015-10-08 2016-04-05 Neural network unit with neural memory and array of neural processing units and sequencer that collectively shift row of data received from neural memory
US15/090,701 2016-04-05
US15/090,669 US10275394B2 (en) 2015-10-08 2016-04-05 Processor with architectural neural network execution unit
US15/090,801 2016-04-05
US15/090,807 2016-04-05
US15/090,701 US10474628B2 (en) 2015-10-08 2016-04-05 Processor with variable rate execution unit
US15/090,665 US10474627B2 (en) 2015-10-08 2016-04-05 Neural network unit with neural memory and array of neural processing units that collectively shift row of data received from neural memory
US15/090,796 2016-04-05
US15/090,798 2016-04-05
US15/090,696 US10380064B2 (en) 2015-10-08 2016-04-05 Neural network unit employing user-supplied reciprocal for normalizing an accumulated value
US15/090,691 US10387366B2 (en) 2015-10-08 2016-04-05 Neural network unit with shared activation function units
US15/090,727 US10776690B2 (en) 2015-10-08 2016-04-05 Neural network unit with plurality of selectable output functions
US15/090,823 2016-04-05
US15/090,727 2016-04-05
US15/090,665 2016-04-05
US15/090,801 US10282348B2 (en) 2015-10-08 2016-04-05 Neural network unit with output buffer feedback and masking capability
US15/090,708 2016-04-05
US15/090,798 US10585848B2 (en) 2015-10-08 2016-04-05 Processor with hybrid coprocessor/execution unit neural network unit
US15/090,829 2016-04-05
US15/090,678 2016-04-05
US15/090,666 2016-04-05
US15/090,814 2016-04-05
US15/090,712 US10366050B2 (en) 2015-10-08 2016-04-05 Multi-operation neural network unit
US15/090,691 2016-04-05
US15/090,672 US10353860B2 (en) 2015-10-08 2016-04-05 Neural network unit with neural processing units dynamically configurable to process multiple data sizes
US15/090,705 US10353861B2 (en) 2015-10-08 2016-04-05 Mechanism for communication between architectural program running on processor and non-architectural program running on execution unit of the processor regarding shared resource
US15/090,666 US10275393B2 (en) 2015-10-08 2016-04-05 Tri-configuration neural network unit
US15/090,708 US10346350B2 (en) 2015-10-08 2016-04-05 Direct execution by an execution unit of a micro-operation loaded into an architectural register file by an architectural instruction of a processor
US15/090,829 US10346351B2 (en) 2015-10-08 2016-04-05 Neural network unit with output buffer feedback and masking capability with processing unit groups that operate as recurrent neural network LSTM cells
US15/090,669 2016-04-05
US15/090,814 US10552370B2 (en) 2015-10-08 2016-04-05 Neural network unit with output buffer feedback for performing recurrent neural network computations
US15/090,722 US10671564B2 (en) 2015-10-08 2016-04-05 Neural network unit that performs convolutions using collective shift register among array of neural processing units
US15/090,722 2016-04-05
US15/090,794 US10353862B2 (en) 2015-10-08 2016-04-05 Neural network unit that performs stochastic rounding
US15/090,705 2016-04-05
US15/090,678 US10509765B2 (en) 2015-10-08 2016-04-05 Neural processing unit that selectively writes back to neural memory either activation function output or accumulator value

Publications (2)

Publication Number Publication Date
CN106503796A true CN106503796A (en) 2017-03-15
CN106503796B CN106503796B (en) 2019-02-12

Family

ID=57866772

Family Applications (15)

Application Number Title Priority Date Filing Date
CN201610864450.4A Active CN106355246B (en) 2015-10-08 2016-09-29 Three configuration neural network units
CN201610864607.3A Active CN106445468B (en) 2015-10-08 2016-09-29 The direct execution of the execution unit of micro- operation of load framework register file is instructed using processor architecture
CN201610866454.6A Active CN106447037B (en) 2015-10-08 2016-09-29 Neural network unit with multiple optional outputs
CN201610863911.6A Active CN106485321B (en) 2015-10-08 2016-09-29 Processor with framework neural network execution unit
CN201610866453.1A Active CN106484362B (en) 2015-10-08 2016-09-29 Device for specifying two-dimensional fixed-point arithmetic operation by user
CN201610864272.5A Active CN106447035B (en) 2015-10-08 2016-09-29 Processor with variable rate execution unit
CN201610864054.1A Active CN106528047B (en) 2015-10-08 2016-09-29 A kind of processor, neural network unit and its operation method
CN201610866451.2A Active CN106447036B (en) 2015-10-08 2016-09-29 Execute the neural network unit being rounded at random
CN201610864609.2A Active CN106503796B (en) 2015-10-08 2016-09-29 Multioperation neural network unit
CN201610863682.8A Active CN106485318B (en) 2015-10-08 2016-09-29 With mixing coprocessor/execution unit neural network unit processor
CN201610864055.6A Active CN106485315B (en) 2015-10-08 2016-09-29 Neural network unit with output buffer feedback and shielding function
CN201610864610.5A Active CN106503797B (en) 2015-10-08 2016-09-29 Neural network unit and collective with neural memory will arrange the neural pe array shifted received from the data of neural memory
CN201610864608.8A Active CN106485323B (en) 2015-10-08 2016-09-29 It is fed back with output buffer to execute the neural network unit of time recurrent neural network calculating
CN201610866027.8A Active CN106485319B (en) 2015-10-08 2016-09-29 With the dynamically configurable neural network unit to execute a variety of data sizes of neural processing unit
CN201610864446.8A Active CN106485322B (en) 2015-10-08 2016-09-29 It is performed simultaneously the neural network unit of shot and long term memory cell calculating

Family Applications Before (8)

Application Number Title Priority Date Filing Date
CN201610864450.4A Active CN106355246B (en) 2015-10-08 2016-09-29 Three configuration neural network units
CN201610864607.3A Active CN106445468B (en) 2015-10-08 2016-09-29 The direct execution of the execution unit of micro- operation of load framework register file is instructed using processor architecture
CN201610866454.6A Active CN106447037B (en) 2015-10-08 2016-09-29 Neural network unit with multiple optional outputs
CN201610863911.6A Active CN106485321B (en) 2015-10-08 2016-09-29 Processor with framework neural network execution unit
CN201610866453.1A Active CN106484362B (en) 2015-10-08 2016-09-29 Device for specifying two-dimensional fixed-point arithmetic operation by user
CN201610864272.5A Active CN106447035B (en) 2015-10-08 2016-09-29 Processor with variable rate execution unit
CN201610864054.1A Active CN106528047B (en) 2015-10-08 2016-09-29 A kind of processor, neural network unit and its operation method
CN201610866451.2A Active CN106447036B (en) 2015-10-08 2016-09-29 Execute the neural network unit being rounded at random

Family Applications After (6)

Application Number Title Priority Date Filing Date
CN201610863682.8A Active CN106485318B (en) 2015-10-08 2016-09-29 With mixing coprocessor/execution unit neural network unit processor
CN201610864055.6A Active CN106485315B (en) 2015-10-08 2016-09-29 Neural network unit with output buffer feedback and shielding function
CN201610864610.5A Active CN106503797B (en) 2015-10-08 2016-09-29 Neural network unit and collective with neural memory will arrange the neural pe array shifted received from the data of neural memory
CN201610864608.8A Active CN106485323B (en) 2015-10-08 2016-09-29 It is fed back with output buffer to execute the neural network unit of time recurrent neural network calculating
CN201610866027.8A Active CN106485319B (en) 2015-10-08 2016-09-29 With the dynamically configurable neural network unit to execute a variety of data sizes of neural processing unit
CN201610864446.8A Active CN106485322B (en) 2015-10-08 2016-09-29 It is performed simultaneously the neural network unit of shot and long term memory cell calculating

Country Status (1)

Country Link
CN (15) CN106355246B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018185765A1 (en) * 2017-04-04 2018-10-11 Hailo Technologies Ltd. Neural network processor incorporating inter-device connectivity
CN110231958A (en) * 2017-08-31 2019-09-13 北京中科寒武纪科技有限公司 A kind of Matrix Multiplication vector operation method and device
CN110462637A (en) * 2017-03-24 2019-11-15 华为技术有限公司 Neural Network Data processing unit and method
CN111610963A (en) * 2020-06-24 2020-09-01 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
TWI751931B (en) * 2020-05-04 2022-01-01 神盾股份有限公司 Processing device and processing method for executing convolution neural network computation
US11221929B1 (en) 2020-09-29 2022-01-11 Hailo Technologies Ltd. Data stream fault detection mechanism in an artificial neural network processor
US11237894B1 (en) 2020-09-29 2022-02-01 Hailo Technologies Ltd. Layer control unit instruction addressing safety mechanism in an artificial neural network processor
US11238334B2 (en) 2017-04-04 2022-02-01 Hailo Technologies Ltd. System and method of input alignment for efficient vector operations in an artificial neural network
US11263077B1 (en) 2020-09-29 2022-03-01 Hailo Technologies Ltd. Neural network intermediate results safety mechanism in an artificial neural network processor
US11544545B2 (en) 2017-04-04 2023-01-03 Hailo Technologies Ltd. Structured activation based sparsity in an artificial neural network
US11551028B2 (en) 2017-04-04 2023-01-10 Hailo Technologies Ltd. Structured weight based sparsity in an artificial neural network
US11615297B2 (en) 2017-04-04 2023-03-28 Hailo Technologies Ltd. Structured weight based sparsity in an artificial neural network compiler
US11811421B2 (en) 2020-09-29 2023-11-07 Hailo Technologies Ltd. Weights safety mechanism in an artificial neural network processor
US11874900B2 (en) 2020-09-29 2024-01-16 Hailo Technologies Ltd. Cluster interlayer safety mechanism in an artificial neural network processor

Families Citing this family (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11226840B2 (en) 2015-10-08 2022-01-18 Shanghai Zhaoxin Semiconductor Co., Ltd. Neural network unit that interrupts processing core upon condition
US11221872B2 (en) 2015-10-08 2022-01-11 Shanghai Zhaoxin Semiconductor Co., Ltd. Neural network unit that interrupts processing core upon condition
US10474628B2 (en) * 2015-10-08 2019-11-12 Via Alliance Semiconductor Co., Ltd. Processor with variable rate execution unit
JP6556768B2 (en) * 2017-01-25 2019-08-07 株式会社東芝 Multiply-accumulator, network unit and network device
IT201700008949A1 (en) * 2017-01-27 2018-07-27 St Microelectronics Srl OPERATING PROCEDURE FOR NEURAL NETWORKS, NETWORK, EQUIPMENT AND CORRESPONDENT COMPUTER PRODUCT
US11663450B2 (en) * 2017-02-28 2023-05-30 Microsoft Technology Licensing, Llc Neural network processing with chained instructions
US10896367B2 (en) * 2017-03-07 2021-01-19 Google Llc Depth concatenation using a matrix computation unit
CN107633298B (en) * 2017-03-10 2021-02-05 南京风兴科技有限公司 Hardware architecture of recurrent neural network accelerator based on model compression
CN108629405B (en) * 2017-03-22 2020-09-18 杭州海康威视数字技术股份有限公司 Method and device for improving calculation efficiency of convolutional neural network
CN107423816B (en) * 2017-03-24 2021-10-12 中国科学院计算技术研究所 Multi-calculation-precision neural network processing method and system
US11537851B2 (en) 2017-04-07 2022-12-27 Intel Corporation Methods and systems using improved training and learning for deep neural networks
CN108564169B (en) * 2017-04-11 2020-07-14 上海兆芯集成电路有限公司 Hardware processing unit, neural network unit, and computer usable medium
US10795836B2 (en) * 2017-04-17 2020-10-06 Microsoft Technology Licensing, Llc Data processing performance enhancement for neural networks using a virtualized data iterator
CN107679620B (en) * 2017-04-19 2020-05-26 赛灵思公司 Artificial neural network processing device
JP6865847B2 (en) 2017-04-19 2021-04-28 シャンハイ カンブリコン インフォメーション テクノロジー カンパニー リミテッドShanghai Cambricon Information Technology Co.,Ltd. Processing equipment, chips, electronic equipment and methods
CN107679621B (en) 2017-04-19 2020-12-08 赛灵思公司 Artificial neural network processing device
CN108733408A (en) * 2017-04-21 2018-11-02 上海寒武纪信息科技有限公司 Counting device and method of counting
CN108734281A (en) * 2017-04-21 2018-11-02 上海寒武纪信息科技有限公司 Processing unit, processing method, chip and electronic device
CN108734288B (en) * 2017-04-21 2021-01-29 上海寒武纪信息科技有限公司 Operation method and device
CN107704922B (en) * 2017-04-19 2020-12-08 赛灵思公司 Artificial neural network processing device
EP3699826A1 (en) * 2017-04-20 2020-08-26 Shanghai Cambricon Information Technology Co., Ltd Operation device and related products
CN108805275B (en) * 2017-06-16 2021-01-22 上海兆芯集成电路有限公司 Programmable device, method of operation thereof, and computer usable medium
CN108804139B (en) * 2017-06-16 2020-10-20 上海兆芯集成电路有限公司 Programmable device, method of operation thereof, and computer usable medium
CN110443360B (en) * 2017-06-16 2021-08-06 上海兆芯集成电路有限公司 Method for operating a processor
CN107608715B (en) * 2017-07-20 2020-07-03 上海寒武纪信息科技有限公司 Apparatus and method for performing artificial neural network forward operations
US10167800B1 (en) * 2017-08-18 2019-01-01 Microsoft Technology Licensing, Llc Hardware node having a matrix vector unit with block-floating point processing
US20190102197A1 (en) * 2017-10-02 2019-04-04 Samsung Electronics Co., Ltd. System and method for merging divide and multiply-subtract operations
US11222256B2 (en) * 2017-10-17 2022-01-11 Xilinx, Inc. Neural network processing system having multiple processors and a neural network accelerator
GB2568230B (en) * 2017-10-20 2020-06-03 Graphcore Ltd Processing in neural networks
CN109726809B (en) * 2017-10-30 2020-12-08 赛灵思公司 Hardware implementation circuit of deep learning softmax classifier and control method thereof
GB2568081B (en) * 2017-11-03 2022-01-19 Imagination Tech Ltd End-to-end data format selection for hardware implementation of deep neural network
CN109961137B (en) * 2017-12-14 2020-10-09 中科寒武纪科技股份有限公司 Integrated circuit chip device and related product
CN111160542B (en) * 2017-12-14 2023-08-29 中科寒武纪科技股份有限公司 Integrated circuit chip device and related products
CN108021537B (en) * 2018-01-05 2022-09-16 南京大学 Softmax function calculation method based on hardware platform
CN108304925B (en) * 2018-01-08 2020-11-03 中国科学院计算技术研究所 Pooling computing device and method
KR102637735B1 (en) * 2018-01-09 2024-02-19 삼성전자주식회사 Neural network processing unit including approximate multiplier and system on chip including the same
CN110045960B (en) * 2018-01-16 2022-02-18 腾讯科技(深圳)有限公司 Chip-based instruction set processing method and device and storage medium
CN108288091B (en) * 2018-01-19 2020-09-11 上海兆芯集成电路有限公司 Microprocessor for booth multiplication
CN108416431B (en) * 2018-01-19 2021-06-01 上海兆芯集成电路有限公司 Neural network microprocessor and macroinstruction processing method
CN108304265B (en) * 2018-01-23 2022-02-01 腾讯科技(深圳)有限公司 Memory management method, device and storage medium
CN108416434B (en) * 2018-02-07 2021-06-04 复旦大学 Circuit structure for accelerating convolutional layer and full-connection layer of neural network
CN110163363B (en) * 2018-02-13 2021-05-11 上海寒武纪信息科技有限公司 Computing device and method
CN110222833B (en) * 2018-03-01 2023-12-19 华为技术有限公司 Data processing circuit for neural network
CN108171328B (en) * 2018-03-02 2020-12-29 中国科学院计算技术研究所 Neural network processor and convolution operation method executed by same
CN108510065A (en) * 2018-03-30 2018-09-07 中国科学院计算技术研究所 Computing device and computational methods applied to long Memory Neural Networks in short-term
US10621489B2 (en) 2018-03-30 2020-04-14 International Business Machines Corporation Massively parallel neural inference computing elements
CN108829610B (en) * 2018-04-02 2020-08-04 浙江大华技术股份有限公司 Memory management method and device in neural network forward computing process
TWI695386B (en) * 2018-07-17 2020-06-01 旺宏電子股份有限公司 Semiconductor circuit and operating method for the same
US20200065676A1 (en) * 2018-08-22 2020-02-27 National Tsing Hua University Neural network method, system, and computer program product with inference-time bitwidth flexibility
US10956814B2 (en) * 2018-08-27 2021-03-23 Silicon Storage Technology, Inc. Configurable analog neural memory system for deep learning neural network
CN110865792B (en) * 2018-08-28 2021-03-19 中科寒武纪科技股份有限公司 Data preprocessing method and device, computer equipment and storage medium
CN109376853B (en) * 2018-10-26 2021-09-24 电子科技大学 Echo state neural network output axon circuit
CN109272109B (en) * 2018-10-30 2020-07-17 北京地平线机器人技术研发有限公司 Instruction scheduling method and device of neural network model
JP6528893B1 (en) * 2018-11-07 2019-06-12 富士通株式会社 Learning program, learning method, information processing apparatus
CN109739556B (en) * 2018-12-13 2021-03-26 北京空间飞行器总体设计部 General deep learning processor based on multi-parallel cache interaction and calculation
CN109670158B (en) * 2018-12-27 2023-09-29 北京及客科技有限公司 Method and device for generating text content according to information data
CN109711367B (en) * 2018-12-29 2020-03-06 中科寒武纪科技股份有限公司 Operation method, device and related product
CN112013506B (en) * 2019-05-31 2022-02-25 青岛海尔空调电子有限公司 Communication detection method and device and air conditioner
CN110489077B (en) * 2019-07-23 2021-12-31 瑞芯微电子股份有限公司 Floating point multiplication circuit and method of neural network accelerator
CN110717588B (en) * 2019-10-15 2022-05-03 阿波罗智能技术(北京)有限公司 Apparatus and method for convolution operation
CN111124991A (en) * 2019-12-27 2020-05-08 中国电子科技集团公司第四十七研究所 Reconfigurable microprocessor system and method based on interconnection of processing units
US11663455B2 (en) * 2020-02-12 2023-05-30 Ememory Technology Inc. Resistive random-access memory cell and associated cell array structure
CN111666077B (en) * 2020-04-13 2022-02-25 北京百度网讯科技有限公司 Operator processing method and device, electronic equipment and storage medium
CN112966729B (en) * 2021-02-26 2023-01-31 成都商汤科技有限公司 Data processing method and device, computer equipment and storage medium
CN115600062B (en) * 2022-12-14 2023-04-07 深圳思谋信息科技有限公司 Convolution processing method, circuit, electronic device and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1716230A (en) * 2004-06-30 2006-01-04 富士通株式会社 Arithmetic unit and operation apparatus control method
CN102708665A (en) * 2012-06-04 2012-10-03 深圳市励创微电子有限公司 Broadband code signal detection circuit and wireless remote signal decoding circuit thereof

Family Cites Families (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4876660A (en) * 1987-03-20 1989-10-24 Bipolar Integrated Technology, Inc. Fixed-point multiplier-accumulator architecture
US5047973A (en) * 1989-04-26 1991-09-10 Texas Instruments Incorporated High speed numerical processor for performing a plurality of numeric functions
GB9206126D0 (en) * 1992-03-20 1992-05-06 Maxys Circuit Technology Limit Parallel vector processor architecture
US5517667A (en) * 1993-06-14 1996-05-14 Motorola, Inc. Neural network that does not require repetitive training
US5583964A (en) * 1994-05-02 1996-12-10 Motorola, Inc. Computer utilizing neural network and method of using same
CN1128375A (en) * 1995-06-27 1996-08-07 电子科技大学 Parallel computer programmable through communication system and its method
US5956703A (en) * 1995-07-28 1999-09-21 Delco Electronics Corporation Configurable neural network integrated circuit
RU2131145C1 (en) * 1998-06-16 1999-05-27 Закрытое акционерное общество Научно-технический центр "Модуль" Neural processor, device for calculation of saturation functions, calculating unit and adder
GB9902115D0 (en) * 1999-02-01 1999-03-24 Axeon Limited Neural networks
US6651204B1 (en) * 2000-06-01 2003-11-18 Advantest Corp. Modular architecture for memory testing on event based test system
CA2419063A1 (en) * 2000-08-09 2002-02-14 Skybitz, Inc. Frequency translator using a cordic phase rotator
JP2005532756A (en) * 2002-07-10 2005-10-27 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Electronic circuit having an array of programmable logic cells
TWI220042B (en) * 2002-08-22 2004-08-01 Ip First Llc Non-temporal memory reference control mechanism
US7139785B2 (en) * 2003-02-11 2006-11-21 Ip-First, Llc Apparatus and method for reducing sequential bit correlation in a random number generator
GB2402764B (en) * 2003-06-13 2006-02-22 Advanced Risc Mach Ltd Instruction encoding within a data processing apparatus having multiple instruction sets
MY138544A (en) * 2003-06-26 2009-06-30 Neuramatix Sdn Bhd Neural networks with learning and expression capability
AT413895B (en) * 2003-09-08 2006-07-15 On Demand Informationstechnolo DIGITAL SIGNAL PROCESSING DEVICE
US7401179B2 (en) * 2005-01-21 2008-07-15 Infineon Technologies Ag Integrated circuit including a memory having low initial latency
US8571340B2 (en) * 2006-06-26 2013-10-29 Qualcomm Incorporated Efficient fixed-point approximations of forward and inverse discrete cosine transforms
US7543013B2 (en) * 2006-08-18 2009-06-02 Qualcomm Incorporated Multi-stage floating-point accumulator
US9223751B2 (en) * 2006-09-22 2015-12-29 Intel Corporation Performing rounding operations responsive to an instruction
CN101178644B (en) * 2006-11-10 2012-01-25 上海海尔集成电路有限公司 Microprocessor structure based on sophisticated instruction set computer architecture
US20080140753A1 (en) * 2006-12-08 2008-06-12 Vinodh Gopal Multiplier
JP2009042898A (en) * 2007-08-07 2009-02-26 Seiko Epson Corp Parallel arithmetic unit and parallel operation method
US20090160863A1 (en) * 2007-12-21 2009-06-25 Michael Frank Unified Processor Architecture For Processing General and Graphics Workload
CN101482924B (en) * 2008-01-08 2012-01-04 华晶科技股份有限公司 Automatic identifying and correcting method for business card display angle
JP4513865B2 (en) * 2008-01-25 2010-07-28 セイコーエプソン株式会社 Parallel computing device and parallel computing method
CN101246200B (en) * 2008-03-10 2010-08-04 湖南大学 Analog PCB intelligent test system based on neural network
JP5481793B2 (en) * 2008-03-21 2014-04-23 富士通株式会社 Arithmetic processing device and method of controlling the same
US8131984B2 (en) * 2009-02-12 2012-03-06 Via Technologies, Inc. Pipelined microprocessor with fast conditional branch instructions based on static serializing instruction state
US8533437B2 (en) * 2009-06-01 2013-09-10 Via Technologies, Inc. Guaranteed prefetch instruction
CN101944012B (en) * 2009-08-07 2014-04-23 威盛电子股份有限公司 Instruction processing method and super-pure pipeline microprocessor
US8879632B2 (en) * 2010-02-18 2014-11-04 Qualcomm Incorporated Fixed point implementation for geometric motion partitioning
CN101795344B (en) * 2010-03-02 2013-03-27 北京大学 Digital hologram compression method and system, decoding method and system, and transmission method and system
CN102163139B (en) * 2010-04-27 2014-04-02 威盛电子股份有限公司 Microprocessor fusing loading arithmetic/logic operation and skip macroinstructions
US8726130B2 (en) * 2010-06-01 2014-05-13 Greenliant Llc Dynamic buffer management in a NAND memory controller to minimize age related performance degradation due to error correction
CN101882238B (en) * 2010-07-15 2012-02-22 长安大学 Wavelet neural network processor based on SOPC (System On a Programmable Chip)
CN101916177B (en) * 2010-07-26 2012-06-27 清华大学 Configurable multi-precision fixed point multiplying and adding device
CN201927073U (en) * 2010-11-25 2011-08-10 福建师范大学 Programmable hardware BP (back propagation) neuron processor
US8880851B2 (en) * 2011-04-07 2014-11-04 Via Technologies, Inc. Microprocessor that performs X86 ISA and arm ISA machine language program instructions by hardware translation into microinstructions executed by common execution pipeline
US9092729B2 (en) * 2011-08-11 2015-07-28 Greenray Industries, Inc. Trim effect compensation using an artificial neural network
DE102011081197A1 (en) * 2011-08-18 2013-02-21 Siemens Aktiengesellschaft Method for the computer-aided modeling of a technical system
KR20130090147A (en) * 2012-02-03 2013-08-13 안병익 Neural network computing apparatus and system, and method thereof
US9082078B2 (en) * 2012-07-27 2015-07-14 The Intellisis Corporation Neural processing engine and architecture using the same
US20140143191A1 (en) * 2012-11-20 2014-05-22 Qualcomm Incorporated Piecewise linear neuron modeling
CN103019656B (en) * 2012-12-04 2016-04-27 中国科学院半导体研究所 The multistage parallel single instruction multiple data array processing system of dynamic reconstruct
US20140279772A1 (en) * 2013-03-13 2014-09-18 Baker Hughes Incorporated Neuronal networks for controlling downhole processes
JP6094356B2 (en) * 2013-04-22 2017-03-15 富士通株式会社 Arithmetic processing unit
CN103236997A (en) * 2013-05-03 2013-08-07 福建京奥通信技术有限公司 Long term evolution-interference cancellation system (LTE-ICS) and method
CN103677739B (en) * 2013-11-28 2016-08-17 中国航天科技集团公司第九研究院第七七一研究所 A kind of configurable multiply accumulating arithmetic element and composition thereof multiply accumulating computing array
CN104809498B (en) * 2014-01-24 2018-02-13 清华大学 A kind of class brain coprocessor based on Neuromorphic circuit

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1716230A (en) * 2004-06-30 2006-01-04 富士通株式会社 Arithmetic unit and operation apparatus control method
CN102708665A (en) * 2012-06-04 2012-10-03 深圳市励创微电子有限公司 Broadband code signal detection circuit and wireless remote signal decoding circuit thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DONGSUN KIM等: "《A SIMD Neural Network Processor for Image Processing》", 《NETWORK AND PARALLEL COMPUTING》 *
ROBERT W. MEANS等: "《EXTENSIBLLEIN EAR FLOATING POINT SIMD NEUROCOMPUTER ARRAY PROCESSOR》", 《IJCNN-91-SEATTLE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110462637A (en) * 2017-03-24 2019-11-15 华为技术有限公司 Neural Network Data processing unit and method
US11615297B2 (en) 2017-04-04 2023-03-28 Hailo Technologies Ltd. Structured weight based sparsity in an artificial neural network compiler
US11216717B2 (en) 2017-04-04 2022-01-04 Hailo Technologies Ltd. Neural network processor incorporating multi-level hierarchical aggregated computing and memory elements
US11461615B2 (en) 2017-04-04 2022-10-04 Hailo Technologies Ltd. System and method of memory access of multi-dimensional data
WO2018185762A1 (en) * 2017-04-04 2018-10-11 Hailo Technologies Ltd. Neural network processor incorporating multi-level hierarchical aggregated computing and memory elements
US11551028B2 (en) 2017-04-04 2023-01-10 Hailo Technologies Ltd. Structured weight based sparsity in an artificial neural network
WO2018185765A1 (en) * 2017-04-04 2018-10-11 Hailo Technologies Ltd. Neural network processor incorporating inter-device connectivity
US10387298B2 (en) 2017-04-04 2019-08-20 Hailo Technologies Ltd Artificial neural network incorporating emphasis and focus techniques
US11514291B2 (en) 2017-04-04 2022-11-29 Hailo Technologies Ltd. Neural network processing element incorporating compute and local memory elements
US11675693B2 (en) 2017-04-04 2023-06-13 Hailo Technologies Ltd. Neural network processor incorporating inter-device connectivity
US11238331B2 (en) 2017-04-04 2022-02-01 Hailo Technologies Ltd. System and method for augmenting an existing artificial neural network
US11238334B2 (en) 2017-04-04 2022-02-01 Hailo Technologies Ltd. System and method of input alignment for efficient vector operations in an artificial neural network
US11544545B2 (en) 2017-04-04 2023-01-03 Hailo Technologies Ltd. Structured activation based sparsity in an artificial neural network
US11263512B2 (en) 2017-04-04 2022-03-01 Hailo Technologies Ltd. Neural network processor incorporating separate control and data fabric
US11354563B2 (en) 2017-04-04 2022-06-07 Hallo Technologies Ltd. Configurable and programmable sliding window based memory access in a neural network processor
US11461614B2 (en) 2017-04-04 2022-10-04 Hailo Technologies Ltd. Data driven quantization optimization of weights and input data in an artificial neural network
CN110231958A (en) * 2017-08-31 2019-09-13 北京中科寒武纪科技有限公司 A kind of Matrix Multiplication vector operation method and device
TWI751931B (en) * 2020-05-04 2022-01-01 神盾股份有限公司 Processing device and processing method for executing convolution neural network computation
CN111610963A (en) * 2020-06-24 2020-09-01 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
US11263077B1 (en) 2020-09-29 2022-03-01 Hailo Technologies Ltd. Neural network intermediate results safety mechanism in an artificial neural network processor
US11237894B1 (en) 2020-09-29 2022-02-01 Hailo Technologies Ltd. Layer control unit instruction addressing safety mechanism in an artificial neural network processor
US11221929B1 (en) 2020-09-29 2022-01-11 Hailo Technologies Ltd. Data stream fault detection mechanism in an artificial neural network processor
US11811421B2 (en) 2020-09-29 2023-11-07 Hailo Technologies Ltd. Weights safety mechanism in an artificial neural network processor
US11874900B2 (en) 2020-09-29 2024-01-16 Hailo Technologies Ltd. Cluster interlayer safety mechanism in an artificial neural network processor

Also Published As

Publication number Publication date
CN106485323B (en) 2019-02-26
CN106503797B (en) 2019-03-15
CN106485322A (en) 2017-03-08
CN106485323A (en) 2017-03-08
CN106447036A (en) 2017-02-22
CN106528047A (en) 2017-03-22
CN106503797A (en) 2017-03-15
CN106447037A (en) 2017-02-22
CN106503796B (en) 2019-02-12
CN106447035B (en) 2019-02-26
CN106485318B (en) 2019-08-30
CN106445468B (en) 2019-03-15
CN106485315A (en) 2017-03-08
CN106447035A (en) 2017-02-22
CN106484362B (en) 2020-06-12
CN106485315B (en) 2019-06-04
CN106445468A (en) 2017-02-22
CN106485319B (en) 2019-02-12
CN106485321B (en) 2019-02-12
CN106484362A (en) 2017-03-08
CN106485318A (en) 2017-03-08
CN106485319A (en) 2017-03-08
CN106485321A (en) 2017-03-08
CN106447037B (en) 2019-02-12
CN106447036B (en) 2019-03-15
CN106528047B (en) 2019-04-09
CN106485322B (en) 2019-02-26
CN106355246B (en) 2019-02-15
CN106355246A (en) 2017-01-25

Similar Documents

Publication Publication Date Title
CN106485323B (en) It is fed back with output buffer to execute the neural network unit of time recurrent neural network calculating
CN106599990B (en) The neural pe array that neural network unit and collective with neural memory will be shifted from the data of neural memory column
CN107844830A (en) Neutral net unit with size of data and weight size mixing computing capability
CN108268944A (en) Neural network unit with the memory that can be remolded
CN108268932A (en) Neural network unit
CN108268946A (en) The neural network unit of circulator with array-width sectional
CN108268945A (en) The neural network unit of circulator with array-width sectional
CN108133263A (en) Neural network unit
CN108133264A (en) Perform the neural network unit of efficient 3 dimension convolution
CN108133262A (en) With for perform it is efficient 3 dimension convolution memory layouts neural network unit
CN108564169A (en) Hardware processing element, neural network unit and computer usable medium
CN108804139A (en) Programmable device and its operating method and computer usable medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: Room 301, 2537 Jinke Road, Zhangjiang High Tech Park, Pudong New Area, Shanghai 201203

Patentee after: Shanghai Zhaoxin Semiconductor Co.,Ltd.

Address before: Room 301, 2537 Jinke Road, Zhangjiang High Tech Park, Pudong New Area, Shanghai 201203

Patentee before: VIA ALLIANCE SEMICONDUCTOR Co.,Ltd.

CP01 Change in the name or title of a patent holder