CN106599991A - Neural network unit with neural memory and array of neural processing units that collectively shift row of data received from neural memory - Google Patents
Neural network unit with neural memory and array of neural processing units that collectively shift row of data received from neural memory Download PDFInfo
- Publication number
- CN106599991A CN106599991A CN201610866129.XA CN201610866129A CN106599991A CN 106599991 A CN106599991 A CN 106599991A CN 201610866129 A CN201610866129 A CN 201610866129A CN 106599991 A CN106599991 A CN 106599991A
- Authority
- CN
- China
- Prior art keywords
- data
- processing unit
- row
- input
- random access
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012545 processing Methods 0.000 title claims abstract description 801
- 230000015654 memory Effects 0.000 title claims abstract description 310
- 230000001537 neural effect Effects 0.000 title description 715
- 238000013528 artificial neural network Methods 0.000 title description 90
- 239000000872 buffer Substances 0.000 claims description 508
- 238000009434 installation Methods 0.000 claims description 3
- 230000007935 neutral effect Effects 0.000 description 430
- 230000006870 function Effects 0.000 description 373
- 210000004027 cell Anatomy 0.000 description 162
- 239000000047 product Substances 0.000 description 128
- 210000005036 nerve Anatomy 0.000 description 104
- 238000000034 method Methods 0.000 description 99
- 230000008569 process Effects 0.000 description 82
- 238000003860 storage Methods 0.000 description 79
- 238000010586 diagram Methods 0.000 description 76
- 210000002569 neuron Anatomy 0.000 description 76
- 230000007787 long-term memory Effects 0.000 description 73
- 230000000306 recurrent effect Effects 0.000 description 63
- 239000011159 matrix material Substances 0.000 description 57
- 238000007667 floating Methods 0.000 description 35
- 238000004422 calculation algorithm Methods 0.000 description 34
- 230000003139 buffering effect Effects 0.000 description 31
- 230000008901 benefit Effects 0.000 description 30
- 230000001186 cumulative effect Effects 0.000 description 26
- 230000035508 accumulation Effects 0.000 description 23
- 238000009825 accumulation Methods 0.000 description 23
- 230000005540 biological transmission Effects 0.000 description 23
- 239000003638 chemical reducing agent Substances 0.000 description 21
- 238000011068 loading method Methods 0.000 description 21
- 230000000116 mitigating effect Effects 0.000 description 18
- 230000004044 response Effects 0.000 description 18
- 238000004364 calculation method Methods 0.000 description 17
- 230000008035 nerve activity Effects 0.000 description 13
- 238000005070 sampling Methods 0.000 description 13
- 210000004218 nerve net Anatomy 0.000 description 12
- 230000007774 longterm Effects 0.000 description 11
- 230000002829 reductive effect Effects 0.000 description 11
- 238000007792 addition Methods 0.000 description 10
- 230000008859 change Effects 0.000 description 10
- 238000012163 sequencing technique Methods 0.000 description 10
- 230000009471 action Effects 0.000 description 9
- 230000006835 compression Effects 0.000 description 9
- 238000007906 compression Methods 0.000 description 9
- 238000013461 design Methods 0.000 description 9
- 201000007094 prostatitis Diseases 0.000 description 9
- 238000013519 translation Methods 0.000 description 9
- 230000014616 translation Effects 0.000 description 9
- 241001269238 Data Species 0.000 description 7
- 229910002056 binary alloy Inorganic materials 0.000 description 7
- 238000012937 correction Methods 0.000 description 7
- 230000009977 dual effect Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 230000009286 beneficial effect Effects 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 5
- 230000001965 increasing effect Effects 0.000 description 5
- 230000014759 maintenance of location Effects 0.000 description 5
- 238000011176 pooling Methods 0.000 description 5
- 239000004065 semiconductor Substances 0.000 description 5
- 230000003068 static effect Effects 0.000 description 5
- 238000012546 transfer Methods 0.000 description 5
- 230000001934 delay Effects 0.000 description 4
- 238000009826 distribution Methods 0.000 description 4
- 230000000977 initiatory effect Effects 0.000 description 4
- 230000002040 relaxant effect Effects 0.000 description 4
- 230000006403 short-term memory Effects 0.000 description 4
- 230000001629 suppression Effects 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 3
- 239000006227 byproduct Substances 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 239000013078 crystal Substances 0.000 description 3
- 125000004122 cyclic group Chemical group 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000003111 delayed effect Effects 0.000 description 3
- 238000000151 deposition Methods 0.000 description 3
- 238000006073 displacement reaction Methods 0.000 description 3
- 230000000873 masking effect Effects 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000036961 partial effect Effects 0.000 description 3
- 238000002360 preparation method Methods 0.000 description 3
- 230000000750 progressive effect Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 229910052770 Uranium Inorganic materials 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 239000002184 metal Substances 0.000 description 2
- 229910052751 metal Inorganic materials 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000009738 saturating Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 230000008054 signal transmission Effects 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 101100058681 Drosophila melanogaster Btk29A gene Proteins 0.000 description 1
- 241000854350 Enicospilus group Species 0.000 description 1
- 238000012356 Product development Methods 0.000 description 1
- 101100534231 Xenopus laevis src-b gene Proteins 0.000 description 1
- 238000005267 amalgamation Methods 0.000 description 1
- 229910052797 bismuth Inorganic materials 0.000 description 1
- -1 branch Chemical compound 0.000 description 1
- 238000011094 buffer selection Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000009514 concussion Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000001816 cooling Methods 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000003708 edge detection Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011067 equilibration Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 206010016256 fatigue Diseases 0.000 description 1
- 230000005669 field effect Effects 0.000 description 1
- 238000011049 filling Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000012160 loading buffer Substances 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 230000003446 memory effect Effects 0.000 description 1
- 229910044991 metal oxide Inorganic materials 0.000 description 1
- 150000004706 metal oxides Chemical class 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000000926 neurological effect Effects 0.000 description 1
- 230000004751 neurological system process Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 210000004205 output neuron Anatomy 0.000 description 1
- 235000020825 overweight Nutrition 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000021715 photosynthesis, light harvesting Effects 0.000 description 1
- 238000004080 punching Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- VEMKTZHHVJILDY-UHFFFAOYSA-N resmethrin Chemical compound CC1(C)C(C=C(C)C)C1C(=O)OCC1=COC(CC=2C=CC=CC=2)=C1 VEMKTZHHVJILDY-UHFFFAOYSA-N 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 238000005549 size reduction Methods 0.000 description 1
- 125000006850 spacer group Chemical group 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
- 238000010977 unit operation Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/57—Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
- G06F7/575—Basic arithmetic logic units, i.e. devices selectable to perform either addition, subtraction or one of several logical operations, using, at least partially, the same circuitry
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3893—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/468—Specific access rights for resources, e.g. using capability register
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Pure & Applied Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Neurology (AREA)
- Advance Control (AREA)
- Complex Calculations (AREA)
- Memory System (AREA)
- Executing Machine-Instructions (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Devices For Executing Special Programs (AREA)
- Multi Processors (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
An array of N processing units (PU) each has: an accumulator; an arithmetic unit performs an operation on first, second and third inputs to generate a result to store in the accumulator, the first input receives the accumulator output; a weight input is received by the second input to the arithmetic unit; a multiplexed register has first and second data inputs, an output received by the third input to the arithmetic unit, and a control input that controls the data input selection. The multiplexed register output is also received by an adjacent PU's multiplexed register second data input. The N PU's multiplexed registers collectively operate as an N-word rotater when the control input specifies the second data input. Respective first/second memories hold W/D rows of N weight/data words and provide the N weight/data words to the corresponding weight/ multiplexed register first data inputs of the N PUs.
Description
Technical field
The present invention relates to a kind of processor, more particularly to a kind of place of the operation efficiency and efficiency for lifting artificial neural network
Reason device.
Subject application advocates the international priority of following U.S. Provisional Application case.These priority cases are incorporated by this
Case is for reference.
Subject application is associated with following while the U. S. application case filed an application.These association request cases are incorporated by this
Case is for reference.
Background technology
In recent years, artificial neural network (artificial neural networks, ANN) has attracted the note of people again
Meaning.These researchs are commonly known as deep learning (deep learning), computer learning (computer learning) etc.
Similar terms.The lifting of general processor operational capability also raised people after many decades now for artificial neural network
Interest.The recent application of artificial neural network includes language and image identification etc..For the computing for lifting artificial neural network
Efficiency seems to increase with the demand of efficiency.
The content of the invention
In view of this, the present invention provides a kind of device, including the array being made up of N number of processing unit, one first deposit
Reservoir and a second memory.Each processing unit in the array that N number of processing unit is constituted includes an accumulator
(accumulator), arithmetical unit, weight input and a multitask buffer.Accumulator is with an output.Arithmetic
Unit have first, second with the 3rd input, and perform computing and stored to accumulator with producing a result, aforementioned first is input into and connects
Receive the output of accumulator.Weight input is received to arithmetical unit by the aforementioned second input.Multitask buffer has first and the
Two data inputs, an output and a control input, the output of multitask buffer is received to arithmetical unit by the aforementioned 3rd input,
Control input is controlled for the selection of first and second data input.Wherein, the output of multitask buffer and by adjacent process
Second data input of the multitask buffer of unit is received, when control input selectes the second data input, N number of process list
The multitask buffer cooperating syringe of unit is such as the circulator of N number of word.First memory loads N number of weight word of W row, and
N number of weight word of wherein one row that W is arranged is provided to the corresponding weight of N number of processing unit of pe array and is input into.
Second memory loads N number of data literal of D row, and N number of data literal of wherein one row that D is arranged is provided to processing unit battle array
Corresponding first data input of the multitask buffer of N number of processing unit of row.
The present invention also provides a kind of processor.This processor includes instruction set, a battle array being made up of N number of processing unit
Row, a first memory and a second memory.There is instruction set framework instruction to be operated with indicating processor.Process single by N number of
Each processing unit in the array that unit is constituted includes that an accumulator, arithmetical unit, a weight are input into and more than one
Business buffer.Accumulator is with an output.Arithmetical unit has first, second and the 3rd to be input into, and performs computing to produce one
As a result store to accumulator, aforementioned first input receives the output of accumulator.Weight input is received to calculation by the aforementioned second input
Art unit.Multitask buffer have first and second data input, one output with a control input, multitask buffer it is defeated
Go out and received to arithmetical unit by the aforementioned 3rd input, control input is controlled for the selection of first and second data input.Wherein,
The output of multitask buffer is also received by the second data input of the multitask buffer of adjacent processing units, defeated when controlling
When entering selected second data input, the multitask buffer cooperating syringe of N number of processing unit is such as the circulator of N number of word.The
N number of weight word of one memory loads W row, and N number of weight word of wherein one row of W row is provided to pe array
N number of processing unit corresponding weight input.Second memory loads N number of data literal of D row, and wherein that D is arranged
N number of data literal of row is provided to corresponding first data of the multitask buffer of N number of processing unit of pe array
Input.
The present invention also offer one kind is encoded in an at least non-momentary computer can be using media so that a computer installation makes
One computer program.This computer program includes that the computer for being included in the media can use program code,
This computer can include the first program code, the second program code and the 3rd program code using program code.First program generation
Code specifies an array being made up of N number of processing unit.Each processing unit in this array includes an accumulator, a calculation
Art unit, weight input and a multitask buffer.Accumulator is with an output.Arithmetical unit have first, second with
3rd input, and perform computing and stored to accumulator with producing a result, aforementioned first input receives the output of accumulator.Weight
Input is received to arithmetical unit by the aforementioned second input.Multitask buffer have first and second data input, one output with
One control input, the output of multitask buffer is received to arithmetical unit by the aforementioned 3rd input, and control input control is for the
One and second data input selection.Wherein, the output of multitask buffer is also by the multitask buffer of adjacent processing units
The second data input received, when control input selectes the second data input, the multitask buffer of N number of processing unit is total to
With running such as the circulator of N number of word.Second program code specifies first memory, to load N number of weight word of W row,
And N number of weight word of wherein one row for arranging W provides defeated to the corresponding weight of N number of processing unit of pe array
Enter.3rd program code specifies second memory, to load N number of data literal of D row, and N number of number of wherein one row that D is arranged
There is provided according to word to corresponding first data input of the multitask buffer of N number of processing unit of pe array.
Specific embodiment of the present invention, will be further described by below example and schema.
Description of the drawings
Fig. 1 illustrates for the square for showing the processor comprising neutral net unit (neural network unit, NNU)
Figure.
Fig. 2 is the block schematic diagram of the neural processing unit (neural processing unit, NPU) for showing Fig. 1.
Fig. 3 is block diagram, is shown using N number of multitask caching of the N number of neural processing unit of the neutral net unit of Fig. 1
Device, for the column data word that the data random access memory by Fig. 1 is obtained is performed such as the circulator of N number of word
(rotator) or claim cyclic shifter (circular shifter) running.
Fig. 4 is form, shows the program storage of a neutral net unit for being stored in Fig. 1 and by the neutral net list
The program that unit performs.
Fig. 5 is to show that neutral net unit performs the sequential chart of the program of Fig. 4.
Fig. 6 A are that the neutral net unit for showing Fig. 1 performs the block schematic diagram of the program of Fig. 4.
Fig. 6 B are flow chart, show the computing device framework program of Fig. 1, are associated with being performed using neutral net unit
The running of the typical multiply-accumulate run function computing of the neuron of the hidden layer of artificial neural network, such as the program by Fig. 4
The running of execution.
Fig. 7 is the block schematic diagram of another embodiment of the neural processing unit for showing Fig. 1.
Fig. 8 is the block schematic diagram of the another embodiment of the neural processing unit for showing Fig. 1.
Fig. 9 is form, shows the program storage of a neutral net unit for being stored in Fig. 1 and by the neutral net list
The program that unit performs.
Figure 10 is to show that neutral net unit performs the sequential chart of the program of Fig. 9.
Figure 11 is the block schematic diagram of an embodiment of the neutral net unit for showing Fig. 1.In the embodiment in figure 11, one
Individual neuron is divided into two parts, i.e. run function cell mesh, and (this part also includes shifting cache with ALU part
Device part), and each run function cell mesh is by multiple ALU partial sharings.
Figure 12 is that the neutral net unit for showing Figure 11 performs the sequential chart of the program of Fig. 4.
Figure 13 is that the neutral net unit for showing Figure 11 performs the sequential chart of the program of Fig. 4.
Figure 14 is block schematic diagram, and display is moved to the instruction of neutral net (MTNN) framework and it corresponds to the god of Fig. 1
The running of the part of Jing NEs.
Figure 15 is block schematic diagram, and display is moved to the instruction of neutral net (MTNN) framework and it corresponds to the god of Fig. 1
The running of the part of Jing NEs.
Figure 16 is the block schematic diagram of an embodiment of the data random access memory for showing Fig. 1.
Figure 17 is block schematic diagram of the weight random access memory with an embodiment of buffer for showing Fig. 1.
Figure 18 be show Fig. 1 can dynamic configuration neural processing unit block schematic diagram.
Figure 19 is block schematic diagram, is shown according to the embodiment of Figure 18, using N number of nerve of the neutral net unit of Fig. 1
2N multitask buffer of processing unit, for the column data word that the data random access memory by Fig. 1 is obtained is held
Row is such as the running of circulator (rotator).
Figure 20 is form, shows the program storage of a neutral net unit for being stored in Fig. 1 and by the neutral net
The program that unit is performed, and this neutral net unit has neural processing unit as shown in the embodiment of figure 18.
Figure 21 is to show that neutral net unit performs the sequential chart of the program of Figure 20, and this neutral net unit has such as Figure 18
Shown neural processing unit is implemented in narrow configuration.
Figure 22 is the block schematic diagram of the neutral net unit for showing Fig. 1, and this neutral net unit has as shown in figure 18
Neural processing unit performing the program of Figure 20.
Figure 23 be show Fig. 1 can dynamic configuration neural processing unit another embodiment block schematic diagram.
Figure 24 is block schematic diagram, shows and is transported using convolution (convolution) is performed by the neutral net unit of Fig. 1
One example of the data structure of calculation.
Figure 25 is flow chart, shows the computing device framework program of Fig. 1 with using neutral net unit foundation Figure 24's
Data array performs the convolution algorithm of convolution kernel.
Figure 26 A are the program listing of neutral net unit program, and this neutral net unit program utilizes the convolution kernel of Figure 24
Perform the convolution algorithm of data matrix and write back weight random access memory.
Figure 26 B are that the square of an embodiment of some fields of the control buffer of the neutral net unit for showing Fig. 1 shows
It is intended to.
Figure 27 is block schematic diagram, shows an example of the weight random access memory that input data is inserted in Fig. 1, this
Input data performs common source computing (pooling operation) by the neutral net unit of Fig. 1.
Figure 28 is the program listing of neutral net unit program, and this neutral net unit program performs the input data of Figure 27
The common source computing of matrix is simultaneously write back weight random access memory.
Figure 29 A are the block schematic diagram of an embodiment of the control buffer for showing Fig. 1.
Figure 29 B are the block schematic diagram of another embodiment of the control buffer for showing Fig. 1.
Figure 29 C are to show illustrate with the square of an embodiment of the inverse (reciprocal) of two section store Figure 29 A
Figure.
Figure 30 is the block schematic diagram of an embodiment of the run function unit (AFU) for showing Fig. 2.
Figure 31 is an example of the running of the run function unit for showing Figure 30.
Figure 32 is second example of the running of the run function unit for showing Figure 30.
Figure 33 is the 3rd example of the running of the run function unit for showing Figure 30.
Figure 34 is the block schematic diagram of the part details of the processor and neutral net unit that show Fig. 1.
Figure 35 is block diagram, shows the processor with variable rate neutral net unit.
Figure 36 A are sequential chart, show that the processor with neutral net unit operates on a running example of general modfel,
This general modfel i.e. with it is main when frequency operation.
Figure 36 B are sequential chart, show that the processor with neutral net unit operates on a running example of mitigation pattern,
Frequency when frequency is less than main during the running of mitigation pattern.
Figure 37 is flow chart, shows the running of the processor of Figure 35.
Figure 38 is block diagram, displays the details of the sequence of neutral net unit.
Figure 39 is block diagram, shows the control of neutral net unit and some fields of status register.
Figure 40 is block diagram, shows Elman time recurrent neural networks (recurrent neural network, RNN)
An example.
Figure 41 is block diagram, is shown when neutral net unit performs the Elman time recurrent neural networks for being associated with Figure 40
Calculating when, one of the data configuration in the data random access memory of neutral net unit and weight random access memory
Example.
Figure 42 is form, and display is stored in the program of the program storage of neutral net unit, and this program is by neutral net
Unit is performed, and the configuration according to Figure 41 uses data and weight, to reach Elman time recurrent neural networks.
Figure 43 is the example that block diagram shows Jordan time recurrent neural networks.
Figure 44 is block diagram, is shown when neutral net unit performs the Jordan time recurrent neural networks for being associated with Figure 43
Calculating when, one of the data configuration in the data random access memory of neutral net unit and weight random access memory
Example.
Figure 45 is form, and display is stored in the program of the program storage of neutral net unit, and this program is by neutral net
Unit is performed, and the configuration according to Figure 44 uses data and weight, to reach Jordan time recurrent neural networks.
Figure 46 is block diagram, shows that shot and long term remembers an embodiment of (long short term memory, LSTM) born of the same parents.
Figure 47 is block diagram, is shown when neutral net unit performs the calculating of the shot and long term memory cell layer for being associated with Figure 46
When, an example of the data configuration in the data random access memory and weight random access memory of neutral net unit.
Figure 48 is form, and display is stored in the program of the program storage of neutral net unit, and this program is by neutral net
Unit is performed and the configuration according to Figure 47 uses data and weight, to reach the calculating for being associated with shot and long term memory cell layer.
Figure 49 is block diagram, shows the embodiment of neutral net unit, is had in the neural processing unit group of this embodiment
There are the masking of output buffering and feedback capability.
Figure 50 is block diagram, is shown when neutral net unit performs the calculating of the shot and long term memory cell layer for being associated with Figure 46
When, the data random access memory of the neutral net unit of Figure 49, in weight random access memory and output buffer
One example of data configuration.
Figure 51 is form, display be stored in neutral net unit program storage program, this program by Figure 49 god
Jing NEs are performed and the configuration according to Figure 50 uses data and weight, in terms of reaching and be associated with shot and long term memory cell layer
Calculate.
Figure 52 is block diagram, shows the embodiment of neutral net unit, is had in the neural processing unit group of this embodiment
There are the masking of output buffering and feedback capability, and shared run function unit.
Figure 53 is block diagram, is shown when neutral net unit performs the calculating of the shot and long term memory cell layer for being associated with Figure 46
When, the data random access memory of the neutral net unit of Figure 49, in weight random access memory and output buffer
Another embodiment of data configuration.
Figure 54 is form, display be stored in neutral net unit program storage program, this program by Figure 49 god
Jing NEs are performed and the configuration according to Figure 53 uses data and weight, in terms of reaching and be associated with shot and long term memory cell layer
Calculate.
Figure 55 is block diagram, shows the partial nerve processing unit of another embodiment of the present invention.
Figure 56 is block diagram, is shown when neutral net unit performs the Jordan time recurrent neural networks for being associated with Figure 43
Calculating and during the embodiment using Figure 55, data random access memory and the weight random access memory of neutral net unit
One example of the data configuration in device.
Figure 57 is form, and display is stored in the program of the program storage of neutral net unit, and this program is by neutral net
Unit is performed and the configuration according to Figure 56 uses data and weight, to reach Jordan time recurrent neural networks.
Specific embodiment
Processor with framework neutral net unit
Fig. 1 includes the side of the processor 100 of neutral net unit (neural network unit, NNU) 121 for display
Block schematic diagram.As shown in FIG., this processor 100 is included and instructs acquisition unit 101, instruction cache 102, instruction translator 104,
Renaming unit 106, multiple reservation stations 108, multiple media caches 118, multiple general caching devices 116, aforementioned neurological network
Multiple performance elements 112 outside unit 121 and memory sub-system 114.
Processor 100 is electronic installation, used as the CPU of integrated circuit.The number of the receives input of processor 100
Digital data, processes these data, and produces the result of the computing indicated by instruction and make according to the instruction seized by memory
For its output.This processor 100 can be used for desktop PC, running gear or tablet PC, and for calculating, word
The application such as process, multimedia display and network browsing.This processor 100 may also be disposed in embedded system, to control various bags
Include the device of equipment, mobile phone, smart phone, vehicle and industrial controller.Central processing unit is passed through and performs bag to data
The computings such as arithmetic, logical AND input/output are included, to perform computer program (or referred to as computer applied algorithm or application program)
The electronic circuit (i.e. hardware) of instruction.Integrated circuit is made in small semiconductor material, typically silicon for one group, electronics electricity
Road.Integrated circuit is also commonly used to represent chip, microchip or crystal grain.
Instruction acquisition unit 101 controls to seize framework instruction 103 to instruction cache 102 by system storage (not shown)
Running.Instruction acquisition unit 101 is provided seizes address to instruction cache 102, is seized to cache with given processor 100
The storage address of the cache line of 102 framework command byte.That seizes address selectes the instruction pointer for being based on processor 100
The currency or program counter of (not shown).In general, program counter can in proper order be incremented by according to instruction size, until referring to
Make occur in crossfire such as branch, calling or return control instruction, or occur for example interruptions, trap (trap), make an exception or
The exceptional conditions such as mistake, and need to update program with non-sequential addresses such as such as branch target address, return address or exception vectors
Counter.Sum it up, program counter can be updated in response to the execute instruction of performance element 112/121.Program counter
Also can be updated when exceptional condition is detected, for example instruction translator 104 suffers from the finger for not being defined in processor 100
The instruction 103 of order collection framework.
The framework instruction 103 from a system storage for being coupled to processor 100 is seized in the storage of instruction cache 102.This
A little framework instructions 103 include that be moved to neutral net (MTNN) instruction removes (MFNN) instruction with by neutral net, and the details will be described later.
In one embodiment, framework instruction 103 is the instruction of x86 instruction set architectures, and affix MTNN is instructed and MFNN is instructed.At this
In disclosure, x86 instruction set architecture processors are interpreted as in the case where same mechanical sound instruction is performed, withProcessor produces the processor of identical result in instruction set architecture layer.But, other instruction set architectures,
For example, advanced reduced instruction set machine framework (ARM), the extendible processor architecture (SPARC) of rising Yang (SUN) or enhancing
Reduced instruction set computer performance operational performance optimization architecture (PowerPC), it can also be used to other embodiments of the invention.Instruction cache
102 provide framework instruction 103 to instruction translator 104, and framework instruction 103 is translated to into microcommand 105.
Microcommand 105 is provided to renaming unit 106 and finally performed by performance element 112/121.These microcommands 105
Can realize that framework is instructed.For a preferred embodiment, instruction translator 104 include Part I, to will frequently execute with
And/or be that relatively uncomplicated framework instruction 103 translates to microcommand 105.This instruction translator 104 also includes second
Point, it has microcode unit (not shown).There is microcode unit microcode memory to load micro-code instruction, to perform framework instruction set
Middle complicated and/or few instruction.Microcode unit also includes that micro-sequencer (microsequencer) provides nand architecture microprogram
Counter (micro-PC) is to microcode memory.For a preferred embodiment, these microcommands (are not schemed via micro- transfer interpreter
Show) translate to microcommand 105.Whether selector is currently possessed of control power according to microcode unit, selects from Part I or the
The microcommand 105 of two parts is provided to renaming unit 106.
The entity of the framework buffer RNTO processor 100 that renaming unit 106 can specify framework instruction 103 delays
Storage.For a preferred embodiment, this processor 100 includes reorder buffer (not shown).Renaming unit 106 can be according to
The allocation of items of reorder buffer is given each microcommand 105 by program order.Can so processor 100 be made according to program order
Remove microcommand 105 and its corresponding framework instruction 103.In one embodiment, media cache 118 has 256 bit wides
Degree, and general caching device 116 has 64 bit widths.In one embodiment, media cache 118 is x86 media caches, for example
Advanced vector expands (AVX) buffer.
In one embodiment, each project of reorder buffer has storage area to store the result of microcommand 105.This
Outward, processor 100 includes framework register file, and this framework register file has physical registers slow corresponding to each framework
Storage, such as media cache 118, general caching device 116 and other framework buffers.(for a preferred embodiment, citing
For, media cache 118 is of different sizes with general caching device 116, you can using separate register file correspondence to this
Two kinds of buffers.) for each source operand that a framework buffer is assigned with microcommand 105, renaming unit can profit
With the reorder buffer catalogue of a newest microcommand in the old microcommand 105 of write framework buffer, microcommand 105 is inserted
Source operand field.When performance element 112/121 completes the execution of microcommand 105, performance element 112/121 can be by its result
Write the reorder buffer project of this microcommand 105.When microcommand 105 is removed, unit meeting (not shown) is removed in the future since then
The result of the reorder buffer field of microcommand writes the buffer of physical registers archives, this physical registers profile associated in
Thus the framework purpose buffer specified by microcommand 105 is removed.
In another embodiment, processor 100 includes physical registers archives, and the quantity of its physical registers having is more
In the quantity of framework buffer, but, this processor 100 is not included in framework register file, and reorder buffer project
Result storage area is not included.(for a preferred embodiment, because the size of media cache 118 and general caching device 116
It is different, you can using separate register file correspondence to both buffers.) also including pointer gauge, it has this processor 100
There is the corresponding pointer of each framework buffer.For each operand that framework buffer is assigned with microcommand 105, order again
Name unit can utilize a pointer for pointing to free buffer in physical registers archives, insert the purpose behaviour in microcommand 105
Make digital section.If there is no free buffer in physical registers archives, renaming unit 106 can lie over pipeline.It is right
Each source operand of framework buffer is assigned with microcommand 105, renaming unit can point to entity caching using one
In device archives, the pointer of the buffer of newest microcommand in the old microcommand 105 of write framework buffer is assigned to, is inserted micro-
Source operand field in instruction 105.When performance element 112/121 completes to perform microcommand 105, the meeting of performance element 112/121
Write the result into the buffer that the destination operand field of microcommand 105 in physical registers archives is pointed to.When microcommand 105 is removed
Except when, remove unit and the destination operand field value of microcommand 105 can be copied to and be associated with this and remove what microcommand 105 was specified
The pointer of the pointer gauge of framework purpose buffer.
Reservation station 108 can load microcommand 105, until these microcommands complete to be distributed to performance element 112/121 for
The preparation of execution.When all source operands of a microcommand 105 all can take and performance element 112/121 can also be used for holding
During row, as this microcommand 105 completes the preparation issued.Performance element 112/121 is implemented by reorder buffer or aforementioned first
Framework register file described in example, or the physical registers archives accession buffer source behaviour by described in aforementioned second embodiment
Count.Additionally, performance element 112/121 can be directed through result transmission bus (not shown) order caching device source operand.This
Outward, performance element 112/121 can receive the immediate operand specified by microcommand 105 from reservation station 108.MTNN and MFNN framves
Structure instruction 103 includes immediate operand to specify the function to be performed by of neutral net unit 121, and this function by MTNN with
One or more microcommands 105 that the translation of MFNN frameworks instruction 103 is produced are provided, and the details will be described later.
Performance element 112 includes one or more load/store units (not shown), is loaded by memory sub-system 114
Data and data are stored to memory sub-system 114.For a preferred embodiment, this memory sub-system 114 includes depositing
Reservoir administrative unit (not shown), this MMU may include that (lookaside) buffering is searched in for example multiple translations
Device, table movement (tablewalk) unit, the data quick of stratum one (with instruction cache 102), a stratum two unify
Cache and a Bus Interface Unit as the interface between processor 100 and system storage.In one embodiment, Fig. 1
Processor 100 is represented with one of multiple processing cores of polycaryon processor, and this polycaryon processor shares one most
Stratum's cache afterwards.Performance element 112 may also include multiple integer units, multiple media units, multiple floating point units and
Individual branch units.
Neutral net unit 121 includes weight random access memory (RAM) 124, data random access memory 122, N
The sequencer 128 of program storage 129, of individual neural processing unit (NPU) 126, and multiple controls and status register
127.These neural processing units 126 are conceptually such as the function of the neuron in neutral net.Weight random access memory
Device 124, data random access memory 122 are both transparent for MTNN and MFNN frameworks instruction 103 and write respectively with program storage 129
Enter and reading.Weight random access memory 124 is arranged as W row, the N number of weight word of each column, data random access memory 122
It is arranged as D row, the N number of data literal of each column.Each data literal is multiple positions with each weight word, with regard to a preferred embodiment
For, can be 8 positions, 9 positions, 12 positions or 16 positions.Each data literal is used as the neuron of preceding layer in network
Output valve (represents) with initiation value sometimes, and each weight word in network used as being associated with into the neuron of network current layer
The weight of link.Although in many applications of neutral net unit 121, being loaded into the text of weight random access memory 124
Word or operand are actually to be associated with the weight into the link of neuron, but be should be noted that in neutral net
In some applications of unit 121, the word of weight random access memory 124 not weight is loaded into, but because these texts
Word is stored in weight random access memory 124, so still being represented with the term of " weight word ".For example, exist
In some applications of neutral net unit 121, the example of the convolution algorithm of such as Figure 24 to Figure 26 A or being total to for Figure 27 to Figure 28
The example of source computing, weight random access memory 124 can load the object beyond weight, and for example data matrix is (such as image picture
Prime number evidence) element.Similarly, although in many applications of neutral net unit 121, it is loaded into data random access storage
The word or operand of device 122 is substantially exactly the output valve or initiation value of neuron, but be should be noted that in nerve net
In some applications of network unit 121, the word for being loaded into data random access memory 122 is really not so, but because these
Word is stored in data random access memory 122, so still being represented with the term of " data literal ".For example,
In some applications of neutral net unit 121, the example of the convolution algorithm of such as Figure 24 to Figure 26 A, data random access is deposited
Reservoir 122 can load non-neuronal output, the element of such as convolution kernel.
In one embodiment, neural processing unit 126 includes combinational logic, sequencing logic, state machine with sequencer 128
Device or its combination.The content of status register 127 can be loaded one of them and be led to by framework instruction (such as MFNN instructions 1500)
With buffer 116, to confirm the state of neutral net unit 121, such as neutral net unit 121 is from program storage 129
An order or the running of a program are completed, or neutral net unit 121 can freely receive a new order or open
Begin a new neutral net unit program.
The quantity of neural processing unit 126 can at random be deposited according to increase in demand, weight random access memory 124 with data
The width of access to memory 122 also can be adjusted therewith with depth to be expanded.For a preferred embodiment, weight arbitrary access is deposited
Reservoir 124 can be more than data random access memory 122, this is because there are many links in typical neural net layer, because
And need larger storage area storage to be associated with the weight of each neuron.Many is disclosed herein with regard to data and weight word
Size, weight random access memory 124 and the size of data random access memory 122 and different nerves process single
The embodiment of first 126 quantity.In one embodiment, it is 64KB (8192 x64 row) that neutral net unit 121 has a size
Data random access memory 122, size is the weight random access memory 124 of 2MB (8192 x2048 row),
And 512 neural processing units 126.This neutral net unit 121 is with 16 nanometers of TaiWan, China integrated circuit (TSMC)
Processing procedure is manufactured, and its occupied area is about 3.3 square millimeters.
Sequencer 128 is seized by program storage 129 and instructs and perform, and the running that it is performed also includes producing address with control
Signal processed is supplied to data random access memory 122, weight random access memory 124 with neural processing unit 126.Sequencing
Device 128 produces storage address 123 and is supplied to data random access memory 122 with reading order, uses the N number of number in D row
N number of neural processing unit 126 is supplied to according to one is selected in word.Sequencer 128 can also produce storage address 125 and read
Order is supplied to weight random access memory 124, uses and selects one to be supplied to N number of nerve in N number of weight word of W row
Processing unit 126.Sequencer 128 produces the order of the address 123,125 for being also provided to neural processing unit 126 and determines nerve
" link " between unit.Sequencer 128 can also produce storage address 123 and be supplied to data random access memory with writing commands
122, use and select one to be write by N number of neural processing unit 126 in N number of data literal of D row.Sequencer 128 is also
Storage address 125 can be produced and be supplied to weight random access memory 124 with writing commands, use the N number of weight text in W row
One is selected to be write by N number of neural processing unit 126 in word.Sequencer 128 can also produce storage address 131 to program
To select to be supplied to the neutral net unit of sequencer 128 to instruct, this part can illustrate memory 129 in following sections.
The correspondence of storage address 131 is to program counter (not shown), position of the sequencer 128 usually in accordance with program storage 129
Order is incremented by program counter, and except non-sequencer 128 suffers from control instruction, such as recursion instruction (refer to such as Figure 26 A institutes
Show), in the case, program counter can be updated to sequencer 128 destination address of this control instruction.Sequencer 128 is also
Control signal can be produced to neural processing unit 126, indicate that neural processing unit 126 performs a variety of computings or function,
Such as Initiation, arithmetic/logic, rotation/shift operation, run function and computing is write back, related example is follow-up
Chapters and sections (refer to as shown in micro- computing 3418 of Figure 34) can be described in more detail.
N number of neural processing unit 126 can produce N number of result word 133, these result words 133 can be written back into weight with
Machine accesses a row of memory 124 or data random access memory 122.For a preferred embodiment, weight arbitrary access
Memory 124 is coupled directly to N number of neural processing unit 126 with data random access memory 122.Furthermore, it is understood that weight
Random access memory 124 belongs to these neural processing units 126 with 122 turns of data random access memory, and is not shared with
Other performance elements 112 in processor 100, these neural processing units 126 can constantly within each time-frequency cycle
As soon as from weight random access memory 124 and data random access memory 122 or the two obtain and complete a row, one compared with
For good embodiment, processed in pipelined fashion can be adopted.In one embodiment, data random access memory 122 is random with weight
Each in access memory 124 can provide 8192 positions to neural processing unit 126 within each time-frequency cycle.
This 8192 positions can be considered as 512 16 bytes or 1024 8 bytes to be processed, and the details will be described later.
The data group size processed by neutral net unit 121 is not limited to weight random access memory 124 with number
According to the size of random access memory 122, and the size of system storage can be only limited to, this is because data and weight can be
Refer to through MTNN and MFNN between system storage and weight random access memory 124 and data random access memory 122
The use (for example, through media cache 118) of order and move.In one embodiment, data random access memory 122 is assigned
Give dual-port, enable and data literal is being read by data random access memory 122 or write data literal is deposited at random to data
While access to memory 122, write data literal is to data random access memory 122.In addition, including including cache
The larger memory hierarchical structure of memory sub-system 114 can provide very big data bandwidth for system storage and nerve net
Carry out data transmission between network unit 121.Additionally, for a preferred embodiment, this memory sub-system 114 includes hardware data
Device, the access mode of trace memory, the neural deta for for example being loaded by system storage and weight are seized in advance, and to cache rank
Rotating fields execution data are seized to be beneficial in advance and transmitted to weight random access memory 124 and data random access memory
The transmission of high frequency range and low latency is reached during 122.
Although in the embodiments herein, being provided by weights memory to one of behaviour of each neural processing unit 126
Count and be denoted as weight, this term is common in neutral net, it is to be appreciated, however, that these operands can also be other with
The data of related type are calculated, and its calculating speed can pass through these devices and be lifted.
Fig. 2 is the block schematic diagram of the neural processing unit 126 for showing Fig. 1.As shown in FIG., this neural processing unit
126 running can perform many functions or computing.Especially, this neural processing unit 126 can be used as in artificial neural network
Neuron or node are operated, to perform typical product accumulation function or computing.That is, in general, nerve net
Network unit 126 (neuron) to:(1) there is the neuron receives input value of link with it from each, this link would generally but
It is not necessarily the preceding layer in artificial neural network;(2) each output valve is multiplied by the corresponding power for being associated with its link
Weight values are producing product;(3) all products are added up total to produce one;(4) this sum is performed run function to produce god
The output of Jing units.But, different from traditional approach need perform be associated with it is all link input all multiplyings and by its
Product is added up, and each neuron of the present invention can perform within the given time-frequency cycle and be associated with one of power for linking input
Multiplying and by the cumulative of its product product of performed link input with the time-frequency cycle being associated with before the time point again
Value is added (cumulative).It is assumed that a total M link connects so far neuron, the General Logistics Department is added (probably to need M time-frequency in M product
The time in cycle), this neuron can perform run function to this cumulative number to produce output or result.The advantage of this mode is
The quantity of the multiplier needed for can reducing, and a less, simpler and more quick addition is only needed in neuron
Device circuit (for example using two input adders), without using can by it is all link input products add up or or even
Adder needed for adding up to a wherein subclass.This mode is also beneficial to use a myriad of in neutral net unit 121
(N) neuron (neural processing unit 126), thus, after about M time-frequency cycle, neutral net unit 121 can be produced
The output of this big quantity (N) neuron.Finally, for substantial amounts of different link inputs.The nerve net being made up of these neurons
Network unit 121 can just be effective as the execution of ANN network layers.If that is, the quantity of M has increased in different layers
Subtract, producing the time-frequency periodicity needed for memory cell output also can accordingly increase and decrease, and resource (such as multiplier and accumulator)
Can be fully utilized.In comparison, traditional design has the part of some multipliers and adder for less M values
Fail to be utilized.Therefore, the link in response to neutral net unit exports number, and embodiment as herein described has elasticity concurrently with efficiency
Advantage, and high efficiency can be provided.
Neural processing unit 126 includes 205, dual input multitask buffer 208, ALU of buffer
(ALU) 204, accumulator 202 and run function unit (AFU) 212.Buffer 205 is connect by weight random access memory 124
Weight word 206 of retaking the power simultaneously provides its output 203 in the follow-up time-frequency cycle.Multitask buffer 208 is input into 207, in 211 at two
One is selected to be stored in its buffer and be provided in its output 209 in the follow-up time-frequency cycle.Input 207 receives random from data
The data literal of access memory 122.Another input 211 then receives the output 209 of adjacent nerve processing unit 126.Fig. 2 institutes
The neural processing unit 126 for showing is denoted as neural processing unit J in the N number of neural processing unit shown in Fig. 1.That is,
Neural processing unit J is that the one of this N number of neural processing unit 126 represents example.For a preferred embodiment, nerve processes single
The input 211 of the multitask buffer 208 of the J examples of unit 126 receives the multitask of the J-1 examples of neural processing unit 126 and delays
The output 209 of storage 208, and the output 209 of the multitask buffer 208 of neural processing unit J is supplied to neural processing unit
The input 211 of the multitask buffer 208 of 126 J+1 examples.Thus, the multitask buffer of N number of neural processing unit 126
Cooperating syringe by 208, such as the circulator or title cyclic shifter of N number of word, this part has in more detail in follow-up Fig. 3
Explanation.Multitask buffer 208 utilizes control input 213 to control in the two inputs, and which can be by multitask buffer 208
Selection is stored in its buffer and in being subsequently provided in output 209.
ALU 204 is with three inputs.One of input receives weight word 203 by buffer 205.Separately
One input receives the output 209 of multitask buffer 208.Yet another input receives the output 217 of accumulator 202.This calculation
Art logical block 204 can be input into execution arithmetic and/or logical operation to it and be provided in its output to produce a result.It is preferable with regard to one
For embodiment, the arithmetic of the execution of ALU 204 and/or logical operation are by the instruction for being stored in program storage 129
It is specified.For example, multiply-accumulate computing is specified in the multiply-accumulate instruction in Fig. 4, that is, result 215 can be accumulator 202
Numerical value 217 and weight word 203 and the totalling of the product of the data literal of the output of multitask buffer 208 209.But also may be used
To specify other computings, these computings to include but is not limited to:As a result 215 is the numerical value of the transmission of multitask buffer output 209;Knot
Really 215 is the numerical value of the transmission of weight word 203;As a result 215 is null value;As a result 215 is the numerical value 217 of accumulator 202 and weight 203
Totalling;As a result 215 is the totalling of the numerical value 217 of accumulator 202 and multitask buffer output 209;As a result 215 is accumulator
Maximum in 202 numerical value 217 and weight 203;As a result 215 is the numerical value 217 of accumulator 202 and multitask buffer output 209
In maximum.
ALU 204 provides its output 215 to accumulator 202 and stores.ALU 204 includes multiplier
242 pairs of weight words 203 carry out multiplying to produce a product with the data literal of the output of multitask buffer 208 209
246.In one embodiment, two 16 positional operands are multiplied to produce multiplier 242 result of 32.This arithmetical logic
Unit 204 also includes that adder 244 is total to produce one plus product 246 in the output 217 of accumulator 202, and this sum is
It is stored in the result 215 of the accumulating operation of accumulator 202.In one embodiment, 41 in accumulator 202 of adder 244
Place value 217 adds 32 results of multiplier 242 to produce 41 results.Thus, in the phase in multiple time-frequency cycles
The circulator characteristic that interior utilization multitask buffer 208 has, neural processing unit 126 may achieve needed for neutral net
Neuron product add up computing.This ALU 204 may also comprise other circuit units to perform other such as front institute
The arithmetic/logic stated.In one embodiment, second adder subtracts in the data literal of the output of multitask buffer 208 209
Weight word 203 is removed to produce a difference, subsequent adder 244 can add this difference to produce in the output 217 of accumulator 202
One result 215, this result is the accumulation result in accumulator 202.Thus, in a period of multiple time-frequency cycles, at nerve
Reason unit 126 can just reach the computing of difference totalling.For a preferred embodiment, although weight word 203 and data literal
209 size identical (in bits), they can also have different binary point positions, and the details will be described later.It is preferably real with regard to one
For applying example, multiplier 242 is integer multiplier and adder with adder 244, is patrolled compared to the arithmetic using floating-point operation
Volume unit, this ALU 204 have the advantages that low complex degree, it is small-sized, quickly with low power consuming.But, the present invention's
In other embodiments, ALU 204 also can perform floating-point operation.
Although only showing a multiplier 242 and adder 244 in the ALU 204 of Fig. 2, but, with regard to one compared with
For good embodiment, this ALU 204 also includes other components to perform aforementioned other different computings.Citing comes
Say, this ALU 204 may include that comparator (not shown) compares accumulator 202 and data/weight word, and multiplexing
Device (not shown) selects the greater (maximum) to store to accumulator 202 in the two values that comparator is specified.At another
In example, ALU 204 includes selecting logic (not shown), and using data/weight word multiplier 242 is skipped,
Adder 224 is set to store to accumulator to produce a sum plus this data/weight word in the numerical value 217 of accumulator 202
202.These extra computings can be described in more detail in following sections such as Figure 18 to Figure 29 A, and these computings are also contributed to
Such as convolution algorithm and the execution of common source computing.
Run function unit 212 receives the output 217 of accumulator 202.Run function unit 212 can be to accumulator 202
Output performs run function to produce the result 133 of Fig. 1.In general, in the neuron of the intermediary layer of artificial neural network
Run function can be used to standardize the sum after product accumulation, it is particularly possible to be carried out using nonlinear mode.For " standard
Change " progressive total, the run function of Current neural unit can be expected reception as defeated in other neurons of connection Current neural unit
An end value is produced in the number range for entering.(result after standardization is referred to as sometimes " startup ", and herein, startup is to work as
The output of front nodal point, and this output can be multiplied by and be associated with the weight linked between output node and receiving node to produce by receiving node
A raw product, and the product accumulation that this product can link with other inputs for being associated with this receiving node.) for example, connecing
Receive/be concatenated neuron be expected receive as be input into numerical value between 0 and 1 in the case of, output neuron may require that non-thread
Property ground extruding and/or adjust (such as upward displacement with by negative value be converted on the occasion of) beyond 0 and 1 scope outside progressive total,
It is set to fall within this desired extent.Therefore, the computing that run function unit 212 is performed to 202 numerical value of accumulator 217 can be by result
133 take in known range.The result 133 of N number of neural performance element 126 all can be by write back data random access memory simultaneously
122 or weight random access memory 124.For a preferred embodiment, run function unit 212 is to perform multiple startups
Function, and the input for example from control buffer 127 can select one to be implemented in accumulator 202 in these run functions
Output 217.These run functions may include but be not limited to step function, correction function, S type functions, hyperbolic tangent function with it is soft
Plus function (also referred to as smooth correction function).The analytic formula of soft plus function is f (x)=ln (1+ex), that is, 1 and exPlus
Total natural logrithm, wherein, " e " is Euler's numbers (Euler ' s number), and x is the input 217 of this function.Preferably implement with regard to one
For example, run function may also comprise transmission (pass-through) function, directly the transmission numerical value 217 of accumulator 202 or wherein
A part, the details will be described later.In one embodiment, the circuit of run function unit 212 can be performed within single time-frequency cycle and opened
Dynamic function.In one embodiment, run function unit 212 includes multiple lists, and it receives accumulated value and exports a numerical value, to certain
A little run functions, such as S type functions, hyperbolic tangent function, soft plus function, this numerical value can be similar to real run function and be carried
For numerical value.
For a preferred embodiment, the width (in bits) of accumulator 202 is more than the output of run function function 212
133 width.For example, in one embodiment, the width of this accumulator is 41, to avoid being added to most 512
(this part can be described in more detail as corresponded in following sections at Figure 30) loss precision in the case of the product of 32, and
As a result 133 width is 16.In one embodiment, in the follow-up time-frequency cycle, run function unit 212 can transmit accumulator
Other undressed parts of 202 outputs 217, and these parts can be write back data random access memory 122 or power
Weight random access memory 124, this part corresponds in following sections and can be described in more detail at Fig. 8.Will not by so
The numerical value of accumulator 202 of Jing process carries back media cache 118 through MFNN instructions, and thereby, other in processor 100 are performed
The instruction that unit 112 is performed can just perform the complicated run function that run function unit 212 cannot be performed, such as common is soft
Very big (softmax) function, this function also referred to as standardizes exponential function.In one embodiment, the instruction set of processor 100
Framework includes performing the instruction of this exponential function, is typically expressed as exOr exp (x), this instruction can hold by other of processor 100
Row unit 112 uses the execution speed for lifting soft very big run function.
In one embodiment, neural processing unit 126 adopts pipeline designs.For example, neural processing unit 126 can be wrapped
Include the buffer of ALU 204, for example positioned at multiplier and adder and/or be ALU 204 its
Buffer between its circuit, neural processing unit 126 may also include a buffer for loading the output of run function function 212.
The other embodiments of this neural processing unit 126 can be illustrated in following sections.
Fig. 3 is block diagram, shows N number of many of the N number of neural processing unit 126 using the neutral net unit 121 of Fig. 1
Business buffer 208, for the column data word 207 obtained by the data random access memory 122 of Fig. 1 is performed as N number of
The circulator (rotator) of word or the running of title cyclic shifter (circular shifter).In the fig. 3 embodiment, N
It is 512, therefore, neutral net unit 121 has 512 multitask buffers 208, is denoted as 0 to 511, is respectively corresponding to 512
Individual neural processing unit 126.Wherein the one of the D row of each the meeting receiving data of multitask buffer 208 random access memory 122
Corresponding data literal 207 on row.That is, multitask buffer 0 can be received from the row of data random access memory 122
Data literal 0, multitask buffer 1 can be from the row receiving data word 1 of data random access memory 122, multitask buffer 2
Can be from the row receiving data word 2 of data random access memory 122, the rest may be inferred, and multitask buffer 511 can be random from data
The access row receiving data word 511 of memory 122.Additionally, multitask buffer 1 can receive the output 209 of multitask buffer 0
Used as another input 211, multitask buffer 2 can receive the output 209 of multitask buffer 1 as another input 211, many
Business buffer 3 can receive the output 209 of multitask buffer 2 as another input 211, and the rest may be inferred, multitask buffer 511
The output 209 of multitask buffer 510 can be received as another input 211, and multitask buffer 0 can receive multitask caching
The output 209 of device 511 is used as other inputs 211.Each multitask buffer 208 can receive control input 213 to control it
Select data literal 207 or circulation input 211.In the pattern of here running, control input 213 can be in the first time-frequency cycle
It is interior, control each multitask buffer 208 and select data literal 207 to be supplied to arithmetic to buffer and in subsequent step with storage
Logical block 204, and (such as the aforementioned M-1 time-frequency cycle) within the follow-up time-frequency cycle, it is many that control input 213 can control each
Task buffer device 208 selects circulation input 211 to be supplied to ALU 204 to buffer and in subsequent step to store.
Although in the embodiment described by Fig. 3 (and follow-up Fig. 7 and Figure 19), multiple neural processing units 126 can use
With by the numerical value of these multitask buffers 208/705 to right rotation, namely by neural processing unit J towards neural processing unit
J+1 is moved, but the present invention is not limited to this, in other embodiments (such as corresponding to the embodiment of Figure 24 to Figure 26),
Multiple neural processing units 126 may be used to the numerical value of multitask buffer 208/705 to anticlockwise, namely process single by nerve
First J is towards neural processing unit J-1 movements.Additionally, in other embodiments of the invention, these neural processing units 126 can
Optionally the numerical value of multitask buffer 208/705 is rotated to the left or to the right, for example, this selection can be by neutral net
Unit instruction.
Fig. 4 is form, shows the program storage 129 of a neutral net unit 121 for being stored in Fig. 1 and by the nerve
The program that NE 121 is performed.As it was previously stated, one layer of relevant calculating of this example program performing and artificial neural network.
The form of Fig. 4 shows four row and three rows.Each row are corresponding to the address that the first row is shown in program storage 129.Second
Row specifies corresponding instruction, and the third line is pointed out to be associated with the time-frequency periodicity of this instruction.It is front for a preferred embodiment
State time-frequency periodicity and represent and often instruct the effective time-frequency periodicity of time-frequency periodic quantity in the embodiment that pipeline is performed, rather than refer to
Order postpones.As shown in FIG., because neutral net unit 121 has the essence of pipeline execution, each instruction has associated
Time-frequency cycle, the instruction positioned at address 2 is an exception, and this instruction actually be can do by myself and be repeated 511 times, so that
In 511 time-frequency cycles, the details will be described later.
Each instruction in all of meeting of neural processing unit 126 parallel processing program.That is, all of N number of god
Jing processing units 126 all can be in the instruction of execution of same time-frequency cycle first row, and all of N number of neural processing unit 126 is all
Can be in the instruction of execution of same time-frequency cycle secondary series, the rest may be inferred.But the present invention is not limited to this, in following sections
In other embodiments, some instructions are performed in the way of the parallel portion sequence of part, for example, such as the embodiment of Figure 11
It is described, multiple neural processing units 126 share a run function unit embodiment in, run function be located at address 3
Output order with 4 is to perform in this way.One layer is assumed in the example of Fig. 4 there are 512 neuron (neural processing units
126), and each neuron have 512 from 512 neurons of preceding layer links be input into, it is a total of 256K link.
Each neuron can receive 16 bit data values from each link input, and this 16 bit data value is multiplied by into appropriate 16
Position weighted value.
First row (also may specify to other addresses) positioned at address 0 can specify the neural processing unit instruction of initialization.This
Initialization directive can remove the numerical value of accumulator 202 and be allowed to be zero.In one embodiment, initialization directive also can be in accumulator 202
In one row of interior loading data random access memory 122 or weight random access memory 124, the thus phase of instruction
Corresponding word.This initialization directive also can be by Configuration Values Loading Control buffer 127, and this part is in subsequent figure 29A and Figure 29 B
Can be described in more detail.For example, the width of data literal 207 and weight word 209 can be loaded, for arithmetical logic list
Using to confirm the computing size that circuit is performed, this width can also affect the result 215 for being stored in accumulator 202 for unit 204.One
In embodiment, neural processing unit 126 includes a circuit before the output 215 of ALU 204 is stored in accumulator 202
This output 215 is filled up, and Configuration Values can be loaded this circuit by initialization directive, this Configuration Values can affect aforesaid to fill up computing.
In one embodiment, also can ALU function instruction (such as the multiply-accumulate instruction of address 1) or output order (as
The write starting function unit output order of location 4) in so specify, accumulator 202 is removed to null value.
Secondary series positioned at address 1 specifies multiply-accumulate instruction to indicate that this 512 neural processing units 126 are random from data
The one of access memory 122 arranges the corresponding data literal of loading and the row loading from weight random access memory 124
Corresponding weight word, and the first multiply-accumulate fortune is performed to this data literal input 207 and weight word input 206
Calculate, i.e., plus the initialization null value of accumulator 202.Furthermore, it is understood that this instruction can indicate that sequencer 128 is produced in control input 213
A raw numerical value is input into 207 to select data literal.In the example of Fig. 4, the specified of data random access memory 122 is classified as
17, the specified of weight random access memory 124 is classified as row 0, thus sequencer can be instructed to output numerical value 17 as data with
Machine accesses storage address 123, and output numerical value 0 is used as weight random access memory address 125.Therefore, it is random from data
512 data literals of the row 17 of access memory 122 provide defeated as the corresponding data of 512 neural processing units 126
Enter 207, and 512 weight words from the row 0 of weight random access memory 124 provide single as 512 nerve process
The corresponding weight input 206 of unit 126.
The 3rd row positioned at address 2 specify multiply-accumulate rotation instruction, and it is 511 that this instruction has one to count its numerical value, with
Indicate that this 512 neural processing units 126 perform 511 multiply-accumulate computings.This instruction indicates this 512 neural processing units
126 will be input into the data literal 209 of ALU 204 in the computing each time of 511 multiply-accumulate computings, used as from neighbour
The rotational value 211 of nearly nerve processing unit 126.That is, this instruction can indicate that sequencer 128 is produced in control input 213
Give birth to a numerical value to select rotational value 211.Additionally, this instruction can indicate that this 512 neural processing units 126 tire out 511 multiplication
Plus " next " row of the corresponding weighted value loading weight random access memory 124 in the computing each time of computing.Namely
Say, this instruction can indicate that sequencer 128 increases weight random access memory address 125 from the numerical value in previous time-frequency cycle
One, in this example, the first time-frequency cycle of instruction is row 1, and the next time-frequency cycle is exactly row 2, in the next time-frequency cycle
It is exactly row 3, the rest may be inferred, the 511st time-frequency cycle is exactly row 511.Each computing in this 511 multiply-accumulate computings
In, rotation input 211 can be added into the previous numerical value of accumulator 202 with the product of weight word input 206.This 512 god
Jing processing units 126 can perform this 511 multiply-accumulate computings within 511 time-frequency cycles, each meeting of neural processing unit 126
For the different pieces of information word-it is, adjacent neural processing unit of the row 17 from data random access memory 122
126 perform the data literal of computing, and the different weight words execution one for being associated with data literal in the previous time-frequency cycle
Individual multiply-accumulate computing is conceptually the different of neuron and links input.This example assumes each neural processing unit 126
(neuron) there are 512 to link input, therefore involve the process of 512 data literals and 512 weight words.In row 2
Multiply-accumulate rotation instruction repeat after last time iteration, this 512 will be deposited in accumulator 202 and links taking advantage of for input
Long-pending totalling.In one embodiment, the instruction set of neural processing unit 126 includes that " execution " instructs to indicate ALU
204 perform the arithmetic logic unit operation specified by the neural processing unit of initialization, such as ALU of Figure 29 A
Person specified by function 2926, rather than for each different types of arithmetic logical operation (such as aforesaid multiply-accumulate, accumulator
With the maximum of weight etc.) there is independent instruction.
The 4th row positioned at address 3 specify run function instruction.This run function instruction indicates run function unit 212 pairs
Perform specified run function to produce result 133 in the numerical value of accumulator 202.The embodiment of run function is in following sections meeting
It is described in more detail.
The 5th row positioned at address 4 specify write run function unit output order, to indicate that this 512 nerves process single
Its run function unit 212 is exported 133 row for being written back to data random access memory 122 as a result, here by unit 216
I.e. row 16 in example.That is, this instruction can indicate the output numerical value 16 of sequencer 128 as data random access memory
Address 123 and writing commands (corresponding to by the reading order of the multiply-accumulate instruction of address 1).It is preferably real with regard to one
For applying example, because the characteristic that pipeline is performed, write run function unit output order can simultaneously be performed with other instructions, therefore
Write run function unit output order can essentially be performed within single time-frequency cycle.
For a preferred embodiment, used as a pipeline, this pipeline has various different work(to each neural processing unit 126
Energy component, such as multitask buffer 208 (and multitask buffer 705 of Fig. 7), ALU 204, accumulator
202nd, run function unit 212, multiplexer 802 (refer to Fig. 8), column buffer 1104 (please join with run function unit 1112
According to Figure 11) etc., some of which component itself can pipeline execution.In addition to data literal 207 and weight word 206, this pipeline
Can also receive from program storage 129 and instruct.These are instructed can flow and control several functions unit along pipeline.In another reality
In applying example, not comprising run function instruction in this program, but specified by the neural processing unit instruction of initialization be implemented in it is cumulative
The run function of the numerical value 217 of device 202, it is indicated that the numerical value of appointed run function is stored in allocating cache device, opening for pipeline
The part of dynamic function unit 212 is after the last numerical value 217 of accumulator 202 is produced, that is, the multiply-accumulate rotation in address 2 refers to
Order repeats after last time execution, is used.For a preferred embodiment, in order to save power consumption, the run function of pipeline
The part of unit 212, when instruction is reached, can start before write run function unit output order is reached in not starting state
The output of accumulator 202 217 that function unit 212 can start and specify to initialization directive performs run function.
Fig. 5 is to show that neutral net unit 121 performs the sequential chart of the program of Fig. 4.Each row correspondence of timing diagram is extremely
The continuous time-frequency cycle that the first row is pointed out.Other rows are then to be respectively corresponding to god different in this 512 neural processing units 126
Jing processing units 126 simultaneously point out its computing.Only show the computing of neural processing unit 0,1,511 to simplify explanation in figure.
In the time-frequency cycle 0, each the neural processing unit 126 in this 512 neural processing units 126 can perform figure
4 initialization directive, is in Figure 5 that null value is assigned to into accumulator 202.
In the time-frequency cycle 1, each the neural processing unit 126 in this 512 neural processing units 126 can perform figure
The multiply-accumulate instruction of address 1 in 4.As shown in FIG., neural processing unit 0 can be by the numerical value of accumulator 202 (i.e. zero) plus number
According to the product of the word 0 of the row 0 of the word 0 and weight random access memory 124 of the row 17 of random access memory 122;God
Jing processing units 1 can be by the numerical value of accumulator 202 (i.e. zero) plus the word 1 of the row 17 of data random access memory 122 and power
The product of the word 1 of the row 0 of weight random access memory 124;The rest may be inferred, and neural processing unit 511 can count in accumulator 202
Value (i.e. zero) is plus the word 511 of the row 17 of data random access memory 122 and the row 0 of weight random access memory 124
Word 511 product.
In the time-frequency cycle 2, each the neural processing unit 126 in this 512 neural processing units 126 can carry out figure
The first time iteration of the multiply-accumulate rotation instruction of address 2 in 4.As shown in FIG., neural processing unit 0 can be by accumulator 202
Numerical value is plus being exported the spin data word 211 of 209 receptions by the multitask buffer 208 of neural processing unit 511 (i.e. by counting
According to the data literal 511 that random access memory 122 is received) take advantage of with the word 0 of the row 1 of weight random access memory 124
Product;The numerical value of accumulator 202 can be added and export 209 by the multitask buffer 208 of neural processing unit 0 by neural processing unit 1
The spin data word 211 (data literal 0 for being received by data random access memory 122) of reception and weight arbitrary access
The product of the word 1 of the row 1 of memory 124;The rest may be inferred, and neural processing unit 511 can add the numerical value of accumulator 202 by god
The spin data word 211 that the output of multitask buffer 208 209 of Jing processing units 510 is received (is deposited by data random access
The data literal 510 that reservoir 122 is received) product with the word 511 of the row 1 of weight random access memory 124.
In the time-frequency cycle 3, each the neural processing unit 126 in this 512 neural processing units 126 can carry out figure
Second iteration of the multiply-accumulate rotation instruction of address 2 in 4.As shown in FIG., neural processing unit 0 can be by accumulator 202
Numerical value is plus being exported the spin data word 211 of 209 receptions by the multitask buffer 208 of neural processing unit 511 (i.e. by counting
According to the data literal 510 that random access memory 122 is received) take advantage of with the word 0 of the row 2 of weight random access memory 124
Product;The numerical value of accumulator 202 can be added and export 209 by the multitask buffer 208 of neural processing unit 0 by neural processing unit 1
The spin data word 211 (data literal 511 for being received by data random access memory 122) of reception is deposited at random with weight
The product of the word 1 of the row 2 of access to memory 124;The rest may be inferred, neural processing unit 511 can by the numerical value of accumulator 202 add by
The spin data word 211 that the output of multitask buffer 208 209 of neural processing unit 510 is received is (i.e. by data random access
The data literal 509 that memory 122 is received) product with the word 511 of the row 2 of weight random access memory 124.Such as figure
5 omission label shows that following 509 time-frequency cycles can persistently be carried out according to this, until the time-frequency cycle 512.
In the time-frequency cycle 512, each the neural processing unit 126 in this 512 neural processing units 126 can be carried out
511st iteration of the multiply-accumulate rotation instruction of address 2 in Fig. 4.As shown in FIG., neural processing unit 0 can be by accumulator
202 numerical value plus the spin data word 211 for exporting 209 receptions by the multitask buffer 208 of neural processing unit 511 (i.e.
The data literal 1 received by data random access memory 122) word 0 with the row 511 of weight random access memory 124
Product;The numerical value of accumulator 202 can be added and exported by the multitask buffer 208 of neural processing unit 0 by neural processing unit 1
The 209 spin data words 211 (data literal 2 for being received by data random access memory 122) for receiving are random with weight
The product of the word 1 of the row 511 of access memory 124;The rest may be inferred, and neural processing unit 511 can add the numerical value of accumulator 202
On 209 receptions are exported by the multitask buffer 208 of neural processing unit 510 spin data word 211 it is (random i.e. by data
The data literal 0 that access memory 122 is received) product with the word 511 of the row 511 of weight random access memory 124.
Multiple time-frequency cycles are needed to read with weight random access memory 124 from data random access memory 122 in one embodiment
Data literal and weight word to perform Fig. 4 in address 1 multiply-accumulate instruction;But, data random access memory 122,
Weight random access memory 124 is to adopt pipeline configuration with neural processing unit 126, so in first multiply-accumulate computing
After beginning (as shown in the time-frequency cycle 1 of Fig. 5), follow-up multiply-accumulate computing (as shown in the time-frequency cycle 2-512 of Fig. 5) will
Start to be performed within the time-frequency cycle continued.For a preferred embodiment, in response to being instructed using framework, such as MTNN or MFNN refer to
Make (can illustrate in follow-up Figure 14 and Figure 15), for data random access memory 122 and/or weight random access memory
The access action of device 124, or the microcommand that framework instruction translation goes out, these neural processing units 126 can be shelved briefly.
In the time-frequency cycle 513, the startup letter of each the neural processing unit 126 in this 512 neural processing units 126
Counting unit 212 can all perform the run function of address 3 in Fig. 4.Finally, in the time-frequency cycle 514, this 512 neural processing units
Each neural processing unit 126 in 126 can be passed through the row 16 of its write back data random access memory 122 of result 133
In corresponding word to perform Fig. 4 in address 4 write run function unit output order, that is to say, that nerve processes single
The result 133 of unit 0 can be written into the word 0 of data random access memory 122, and the result 133 of neural processing unit 1 can be write
Enter the word 1 of data random access memory 122, the rest may be inferred, the result 133 of neural processing unit 511 can be written into data
The word 511 of random access memory 122.Fig. 6 A are shown in corresponding to the corresponding block diagram of the computing of aforementioned Fig. 5.
Fig. 6 A are that the neutral net unit 121 for showing Fig. 1 performs the block schematic diagram of the program of Fig. 4.This neutral net list
First 121 data random access memories 122 for including 512 neural processing units 126, receiving address input 123, with reception ground
The weight random access memory 124 of location input 125.When the time-frequency cycle 0, this 512 neural processing units 126 can be held
Row initialization directive.This running does not show in figure.As shown in FIG., when the time-frequency cycle 1,512 16 of row 17
The data literal of position can read from data random access memory 122 and provide to this 512 neural processing units 126.When
During the frequency cycle 1 to 512, the weight word of 512 16 of row 0 to row 511 can respectively from weight random access memory
Device 122 reads and provides to this 512 neural processing units 126.When the time-frequency cycle 1, this 512 neural processing units
126 can perform its corresponding multiply-accumulate computing to the data literal for loading with weight word.This running does not show in figure
Show.During the time-frequency cycle 2 to 512, the multitask buffer 208 of 512 neural processing units 126 can be such as same tool
The circulator for having 512 16 words is operated, and the number that previously will have been loaded by the row 17 of data random access memory 122
Corresponding data after turning to neighbouring neural processing unit 126, and these meetings of neural processing units 126 to rotation according to word
Word and the multiply-accumulate computing of corresponding weight word execution loaded by weight random access memory 124.In time-frequency week
When phase 513, this 512 run function units 212 can perform enabled instruction.This running does not show in figure.In time-frequency
When cycle 514, this 512 neural processing units 126 can by its 512 corresponding 16 write back data of result 133 with
Machine accesses the row 16 of memory 122.
As shown in FIG., result word (neuron output) and write back data random access memory 122 or weight are produced
The data input (link) that the current layer of the time-frequency periodicity substantially neutral net that random access memory 124 needs is received
The square root of quantity.For example, if current layer has 512 neurons, and each neuron has 512 from previous
The link of layer, these sums for linking are exactly 256K, and the time-frequency periodicity for producing current layer result needs will be slightly larger than
512.Therefore, neutral net unit 121 can provide high efficiency in terms of neural computing.
Fig. 6 B are flow chart, and the processor 100 for showing Fig. 1 performs framework program, to be performed using neutral net unit 121
The running of the typical multiply-accumulate run function computing of the neuron of the hidden layer of artificial neural network is associated with, as by Fig. 4
Program performing running.The example of Fig. 6 B suppose there is four hidden layers and (be shown in the variable NUM_ of initialization step 602
LAYERS), each hidden layer has 512 neurons, and each neuron links 512 whole neurons of preceding layer and (passes through
The program of Fig. 4).However, it is desirable to be understood by, the selection of these layers and the quantity of neuron to illustrate the invention, neutral net
Unit 121 has varying number nerve when the embodiment that similar calculating can be applied to varying number hidden layer, in each layer
The embodiment of unit, or the embodiment that neuron is not all linked.In one embodiment, for non-existent god in this layer
The weighted value that Jing is first or non-existent neuron links can be set to zero.For a preferred embodiment, framework program meeting
First group of weight is write into weight random access memory 124 and starts neutral net unit 121, when neutral net unit 121
When being carrying out the calculating for being associated with ground floor, second group of weight can be write weight random access memory by this framework program
124, once thus, neutral net unit 121 completes the calculating of the first hidden layer, neutral net unit 121 can just start
Two layers of calculating.Thus, framework program can travel to and fro between two regions of weight random access memory 124, to guarantee nerve net
Network unit 121 can be fully utilized.This flow process starts from step 602.
In step 602, as described in the related Sections of Fig. 6 A, input value is write number by the processor 100 for performing framework program
According to random access memory 122 Current neural unit hidden layer, that is, write data random access memory 122 row 17.
These values are likely to have been positioned at the row 17 of data random access memory 122 and are directed to preceding layer as neutral net unit 121
Operation result 133 (such as convolution, common source or input layer).Secondly, variable N can be initialized as numerical value 1 by framework program.Variable
N represents the current layer that will be processed by neutral net unit 121 in hidden layer.Additionally, framework program can be by variable NUM_
LAYERS is initialized as numerical value 4, because there is four hidden layers in this example.Next flow process advances to step 604.
In step 604, the weight word of layer 1 is write weight random access memory 124, such as Fig. 6 A by processor 100
Shown row 0 to 511.Next flow process advances to step 606.
In step 606, processor 100 is instructed using specified function 1432 with the MTNN of write-in program memory 129
1400, by the multiply-accumulate run function program program storage 129 of write neutral net unit 121 (as shown in Figure 4).Processor
100 instruct 1400 to start neutral net unit program followed by MTNN, and this instruction specified function 1432 starts to perform this journey
Sequence.Next flow process advances to step 608.
In steps in decision-making 608, whether the numerical value of framework program validation variable N is less than NUM_LAYERS.If so, flow process is just
Step 612 can be advanced to;Otherwise proceed to step 614.
In step 612, the weight word of layer N+1 is write weight random access memory 124 by processor 100, for example
Row 512 to 1023.Therefore, when the hidden layer that framework program just can perform current layer in neutral net unit 121 is calculated by under
One layer of weight word write weight random access memory 124, thereby, completing the calculating of current layer, that is, writes number
After random access memory 122, neutral net unit 121 can just get started the hidden layer calculating for performing next layer.Connect
Get off to advance to step 614.
In step 614, processor 100 confirms the neutral net unit program being carrying out (for layer 1, in step
606 start to perform, and are then to start to perform in step 618 for layer 2 to 4) whether complete to perform.Preferably implement with regard to one
For example, processor 100 can read the status register 127 of neutral net unit 121 to confirm through MFNN instructions 1500 are performed
Whether complete to perform.In another embodiment, neutral net unit 121 can produce an interruption, and expression has completed multiplication
Cumulative run function layer program.Next flow process advances to steps in decision-making 616.
In steps in decision-making 616, whether the numerical value of framework program validation variable N is less than NUM_LAYERS.If so, flow process meeting
Advance to step 618;Otherwise proceed to step 622.
In step 618, processor 100 can update multiply-accumulate run function program, enable the hidden layer of execution level N+1
Calculate.Furthermore, it is understood that processor 100 can be by the data random access memory 122 of the multiply-accumulate instruction of address in Fig. 41
Train value, is updated to the row (being for example updated to row 16) and more of the write of preceding layer result of calculation in data random access memory 122
New output row (being for example updated to row 15).Processor 100 then begins to update neutral net unit program.In another embodiment
In, the program of Fig. 4 specify the same row of the output order of address 4 as the multiply-accumulate instruction of address 1 row (also
It is the row read by data random access memory 122).In this embodiment, input data word when prostatitis can be written
(because this column data word has been read into multitask buffer 208 and through N words circulator in these neural processing units
Rotated between 126, as long as this column data word is not required to be used for other purposes, such processing mode can be just to be allowed to
).In the case, avoid the need in step 618 updating neutral net unit program, and only need to be restarted.
Next flow process advances to step 622.
In step 622, neutral net unit program of the processor 100 from the reading layer N of data random access memory 122
Result.But, if these results can only be used for next layer, framework program is just not necessary to from data random access memory
122 read these results, and can be retained on data random access memory 122 and be used for next hidden layer calculating.Connect
Flow process of getting off advances to step 624.
In steps in decision-making 624, whether the numerical value of framework program validation variable N is less than NUM_LAYERS.If so, before flow process
Proceed to step 626;Otherwise just terminate this flow process.
In step 626, the numerical value of N can be increased by one by framework program.Next flow process can return to steps in decision-making 608.
As shown in the example of Fig. 6 B, generally per 512 time-frequency cycles, these neural processing units 126 will logarithm
Perform according to random access memory 122 and once read with write-once (through the effect of the computing of the neutral net unit program of Fig. 4
Really).Additionally, these neural processing units 126 generally each time-frequency cycle can be carried out to weight random access memory 124
Read to read a row weight word.Therefore, the frequency range of the whole of weight random access memory 124 all can be because of neutral net list
Unit 121 performs hidden layer computing and is consumed in a mixed manner.Furthermore, it is assumed that there is in one embodiment a write and read
The buffer 1704 of buffer, such as Figure 17, while neural processing unit 126 is read out, processor 100 is random to weight
Access memory 124 is write, and such buffer 1704 generally can be to weight random access memory per 16 time-frequency cycles
Device 124 performs write-once to write weight word.Therefore, in the enforcement that weight random access memory 124 is single-port
In example (as described in the corresponding chapters and sections of Figure 17), generally per 16 time-frequency cycles, these neural processing units 126 will be temporary
When shelve the reading carried out to weight random access memory 124, and enable buffer 1704 to weight random access memory
Device 124 is write.But, in the embodiment of dual-port weight random access memory 124, these neural processing units
126 are just not required to lie on the table.
Fig. 7 is the block schematic diagram of another embodiment of the neural processing unit 126 for showing Fig. 1.The nerve of Fig. 7 processes single
Neural processing unit 126 of the unit 126 similar to Fig. 2.But, the neural processing unit 126 of Fig. 7 has in addition a dual input many
Task buffer device 705.This multitask buffer 705 selects one of input 206 or 711 to be stored in its buffer, and in rear
The continuous time-frequency cycle is provided in its output 203.Input 206 receives weight word from weight random access memory 124.Another is defeated
Enter the output 203 that 711 is the second multitask buffer 705 of reception adjacent nerve processing unit 126.With regard to a preferred embodiment
For, the multitask buffer of the neural processing unit 126 for being arranged in J-1 that the input 711 of neural processing unit J can be received
705 outputs 203, and the output 203 of neural processing unit J is then to provide to many of the neural processing unit 126 for being arranged in J+1
The input 711 of business buffer 705.Thus, the multitask buffer 705 of N number of neural processing unit 126 can cooperating syringe, such as
The circulator of same N number of word, its running is non-data for weight word similar to the mode shown in aforementioned Fig. 3
Word.Multitask buffer 705 utilizes control input 213 to control in the two inputs, and which can be by multitask buffer 705
Selection is stored in its buffer and in being subsequently provided in output 203.
Using multitask buffer 208 and/or multitask buffer 705 (and other realities as shown in Figure 18 and Figure 23
Apply the multitask buffer in example), effectively forming a large-scale circulator will be from data random access memory 122
And/or data/the weight of a row of weight random access memory 124 is rotated, neutral net unit 121 is avoided the need for
There is provided using a very big multiplexer between data random access memory 122 and/or weight random access memory 124
The data of needs/weight word is to appropriate neutral net unit.
Accumulator value is written back in addition to run function result
For some applications, processor 100 is allowed to be received back to (such as slow to media through the MFNN command receptions of Figure 15
Storage 118) numerical value 217 of undressed accumulator 202, in terms of being supplied to and be implemented in the instruction of other performance elements 112 execution
Calculate, there is its use really.For example, in one embodiment, run function unit 212 is not for holding for soft very big run function
Row is configured to reduce the complexity of run function unit 212.So, neutral net unit 121 can export undressed
The numerical value 217 of accumulator 202 or one of subset are bonded to data random access memory 122 or weight random access memory
124, and framework program can be read in subsequent step by data random access memory 122 or weight random access memory 124
Take and this undressed numerical value is calculated.But, for the numerical value 217 of undressed accumulator 202 application not
It is limited to perform soft very big computing, other application is also covered by the present invention.
Fig. 8 is the block schematic diagram of the another embodiment of the neural processing unit 126 for showing Fig. 1.The nerve of Fig. 8 processes single
Neural processing unit 126 of the unit 126 similar to Fig. 2.But, the neural processing unit 126 of Fig. 8 is in run function unit 212
Including multiplexer 802, and this run function unit 212 has control input 803.The width (in bits) of accumulator 202 is more than
The width of data literal.Multiplexer 802 has multiple inputs to receive the data literal width segments of the output of accumulator 202 217.
In one embodiment, the width of accumulator 202 is 41 positions, and neural processing unit 216 may be used to export the knot of 16
Fruit word 133;So, for example, multiplexer 802 (or multiplexer 3032 and/or multiplexer 3037 of Figure 30) is with three
Input receives respectively the position [15 of the output of accumulator 202 217:0], position [31:16] with position [47:32].With regard to a preferred embodiment
Speech, non-carry-out bit (such as position [47 provided by accumulator 202:41]) can be forced to be set as off bit.
Sequencer 128 can control input 803 produce a numerical value, control multiplexer 802 accumulator 202 word (such as
16) in select first, to instruct in response to write accumulator, such as write accumulator of the follow-up Fig. 9 middle positions in address 3 to 5 refers to
Order.For a preferred embodiment, multiplexer 802 also has one or more inputs to receive run function circuit (such as Figure 30
In component 3022,3024,3026,3018,3014 and output 3016), and output that these run function circuits are produced
Width is equal to a data literal.Sequencer 128 can produce numerical value and be opened at these with controlling multiplexer 802 in control input 803
It is selected in dynamic functional circuit output, rather than it is selected in the word of accumulator 202, with response to the startup of address 4 in such as Fig. 4
Function unit output order.
Fig. 9 is form, shows the program storage 129 of a neutral net unit 121 for being stored in Fig. 1 and by the god
The program that Jing NEs 121 are performed.Program of the example program of Fig. 9 similar to Fig. 4.Especially, the two is in address 0 to 2
Instruction is identical.But, the instruction of address 3 and 4 is replaced by write accumulator instruction in Fig. 4, this instruction meeting
Indicate that 512 neural processing units 126 accumulate it the 133 write back data random access memory as a result of the output of device 202 217
122 three row, the i.e. row 16 to 18 in this example.That is, the instruction of this write accumulator can indicate sequencer 128 at the
Frequency cycle output numerical value is 16 data random access memory address 123 and writing commands, in the output of the second time-frequency cycle
Numerical value is 17 data random access memory address 123 and writing commands, is then that output numerical value is in the 3rd time-frequency cycle
18 data random access memory address 123 and writing commands.For preferred embodiment, the execution of write accumulator instruction
Time can overlap with other instructions, thus, write accumulator instruction just actually just can hold within these three time-frequency cycles
OK, wherein each time-frequency cycle can write the row of data random access memory 122.In embodiment, user specifies and starts
Function 2934 orders the numerical value (Figure 29 A) on 2956 hurdles with the output of control buffer 127, by the required part of accumulator 202
Write data random access memory 122 or weight random access memory 124.In addition, write accumulator instruction can be selected
Write back the subset of accumulator 202, rather than the full content for writing back accumulator 202 to property.In embodiment, standard type can be write back
Accumulator 202.This part can be described in more detail in the follow-up chapters and sections corresponding to Figure 29 to Figure 31.
Figure 10 is to show that neutral net unit 121 performs the sequential chart of the program of Fig. 9.The sequential chart of Figure 10 is similar to Fig. 5
Sequential chart, the wherein time-frequency cycle 0 to 512 is identical.But, in time-frequency cycle 513-515, this 512 nerves process single
The run function unit 212 of each neural processing unit 126 can perform the write accumulator of address 3 to 5 in Fig. 9 and refer in unit 126
One of order.Especially, each neural processing unit 126 in 513,512 neural processing units 126 of time-frequency cycle
Accumulator 202 can be exported 217 position [15:0] as in the row 16 of its write back data random access memory 122 of result 133
Corresponding word;Each neural processing unit 126 can tire out in 514,512 neural processing units 126 of time-frequency cycle
Plus the position [31 of the output of device 202 217:16] as the row 17 of its write back data random access memory 122 of result 133 in it is relative
Answer word;And in the time-frequency cycle 515, each neural processing unit 126 can be by accumulator in 512 neural processing units 126
The position [40 of 202 outputs 217:32] as the corresponding text in the row 18 of its write back data random access memory 122 of result 133
Word.For a preferred embodiment, position [47:41] can be forced to be set as null value.
Shared run function unit
Figure 11 is the block schematic diagram of an embodiment of the neutral net unit 121 for showing Fig. 1.In the embodiment of Figure 11
In, a neuron is divided into two parts, i.e. run function cell mesh, and (this part is also comprising displacement with ALU part
Buffer parts), and each run function cell mesh is by multiple ALU partial sharings.In fig. 11, arithmetic is patrolled
Collect cell mesh and refer to neural processing unit 126, and shared run function cell mesh then refers to run function unit 1112.Phase
For the such as embodiment of Fig. 2, each neuron is then comprising the run function unit 212 of oneself.According to this, in Figure 11 embodiments
In one example, neural processing unit 126 (ALU part) may include the accumulator 202, ALU of Fig. 2
204th, multitask buffer 208 and buffer 205, but do not include run function unit 212.In the embodiment in figure 11, nerve
NE 121 includes 512 neural processing units 126, and but, the present invention is not limited to this.In the example of Figure 11, this
512 neural processing units 126 are divided into 64 groups, and group 0 to 63 is denoted as in fig. 11, and each group has eight
Neural processing unit 126.
Neutral net unit 121 also includes column buffer 1104 and multiple shared run function units 1112, and these are opened
Dynamic function unit 1112 is coupled between neural processing unit 126 and column buffer 1104.The width of column buffer 1104 is (with position
Meter), such as 512 words identical with a row of data random access memory 122 or weight random access memory 124.Often
One neural group of processing unit 126 has a run function unit 1112, that is, each correspondence of run function unit 1112
In the group of neural processing unit 126;Thus, there is 64 correspondences of run function unit 1112 in the embodiment in figure 11 to 64
The individual group of neural processing unit 126.The shared startup letter corresponding to this group of the neural processing unit 126 of eight of same group
Counting unit 1112.It is can also be applied to having difference in the run function unit with varying number and each group
The embodiment of the neural processing unit of quantity.For example, it is can also be applied in each group have two, four or
16 neural processing units 126 share the embodiment of same run function unit 1112.
Shared run function unit 1112 contributes to reducing the size of neutral net unit 121.Size reduction can sacrifice effect
Energy.That is, according to the difference of shared rate, it may be desirable to could produce whole nerve processing unit using the extra time-frequency cycle
The result 133 of 126 arrays, for example, as shown in figure 12 below, 8:Seven are accomplished by the case of 1 shared rate additionally
The time-frequency cycle.But, it is however generally that, compared to the time-frequency periodicity produced needed for progressive total (for example, for each
Neuron has a layer of 512 links, it is necessary to 512 time-frequency cycles), the time-frequency periodicity of aforementioned extra increase is (for example
7) it is quite few.Therefore, very little (for example, the about centesimal calculating of increase of impact of the run function unit to efficiency is shared
Time), can be a worthwhile cost for being reduced the size of neutral net unit 121.
In one embodiment, each neural processing unit 126 includes that run function unit 212 is relatively easy to perform
Run function, these simple run function units 212 have less size and can be comprised in each nerve and process single
In unit 126;Conversely, shared complicated run function unit 1112 is then carried out relative complex run function, its size can be bright
It is aobvious to be more than simple run function unit 212.In this embodiment, only need by shared multiple in specified complexity run function
In the case that miscellaneous run function unit 1112 is performed, the extra time-frequency cycle is needed, can be by simple in specified run function
In the case that run function unit 212 is performed, this extra time-frequency cycle is avoided the need for.
Figure 12 and Figure 13 is that the neutral net unit 121 for showing Figure 11 performs the sequential chart of the program of Fig. 4.The sequential of Figure 12
Figure similar to Fig. 5 sequential chart, all same of time-frequency cycle 0 to 512 of the two.But, in the computing not phase in time-frequency cycle 513
Together, because the neural processing unit 126 of Figure 11 can share run function unit 1112;That is, the nerve process of same group
Unit 126 can share the run function unit 1112 for being associated with this group, and Figure 11 shows this share framework.
Each row correspondence of the sequential chart of Figure 13 is to the continuous time-frequency cycle for being shown in the first row.Other rows are then right respectively
To run function unit 1112 different in this 64 run function units 1112 and its computing should be pointed out.Nerve is only shown in figure
The computing of processing unit 0,1,63 is with simplified explanation.The time-frequency cycle of the time-frequency cycle correspondence to Figure 12 of Figure 13, but with not Tongfang
Formula shows that neural processing unit 126 shares the computing of run function unit 1112.As shown in figure 13, in the time-frequency cycle 0 to 512,
This 64 run function units 1112 are at not starting state, and neural processing unit 126 performs initialization nerve and processes
Unit instruction, multiply-accumulate instruction and multiply-accumulate rotation instruction.
As shown in Figure 12 and Figure 13, in the time-frequency cycle 513, run function unit 0 (is associated with the run function list of group 0
1112) unit starts to perform the numerical value 217 of accumulator 202 of neural processing unit 0 specified run function, neural processing unit
First neural processing unit 216 in 0 i.e. group 0, and the output of run function unit 1112 will be stored in row buffer
1104 word 0.Equally in the time-frequency cycle 513, each run function unit 1112 can start to process single to corresponding nerve
The numerical value 217 of accumulator 202 of first neural processing unit 126 performs specified run function in first 216 groups.Therefore,
As shown in figure 13, in the time-frequency cycle 513, run function unit 0 starts the accumulator 202 to neural processing unit 0 and performs indication
Fixed run function is producing the result of the word 0 that will be stored in row buffer 1104;Run function unit 1 starts to nerve
The accumulator 202 of processing unit 8 performs specified run function to produce the word 8 that will be stored in row buffer 1104
As a result;The rest may be inferred, and run function unit 63 starts to perform the accumulator 202 of neural processing unit 504 specified startup
Function is producing the result of the word 504 that will be stored in row buffer 1104.
In the time-frequency cycle 514, run function unit 0 (being associated with the run function unit 1112 of group 0) starts to nerve
The numerical value 217 of accumulator 202 of processing unit 1 performs specified run function, and neural processing unit 1 is second in group 0
Neural processing unit 216, and the output of run function unit 1112 will be stored in the word 1 of row buffer 1104.Equally
In the time-frequency cycle 514, each run function unit 1112 can start to second in the corresponding group of neural processing unit 216
The numerical value 217 of accumulator 202 of neural processing unit 126 performs specified run function.Therefore, as shown in figure 13, in time-frequency
Cycle 514, run function unit 0 starts to perform the accumulator 202 of neural processing unit 1 specified run function to produce
The result of the word 1 of row buffer 1104 will be stored in;Run function unit 1 starts the accumulator to neural processing unit 9
202 perform specified run function to produce the result of the word 9 that will be stored in row buffer 1104;The rest may be inferred, opens
Dynamic function unit 63 starts to perform the accumulator 202 of neural processing unit 505 specified run function will be stored up with producing
It is stored in the result of the word 505 of row buffer 1104.Such place comprehends and lasts till the time-frequency cycle 520, run function unit 0
(being associated with the run function unit 1112 of group 0) starts 202 numerical value of accumulator 217 to neural processing unit 7 and performs indication
Fixed run function, neural processing unit 7 is (last) neural processing unit 216 the 8th in group 0, and run function
The output of unit 1112 will be stored in the word 7 of row buffer 1104.Equally in the time-frequency cycle 520, each run function
Unit 1112 can all start the accumulator 202 to the 8th in the corresponding group of neural processing unit 216 neural processing unit 126
Numerical value 217 performs specified run function.Therefore, as shown in figure 13, in the time-frequency cycle 520, it is right that run function unit 0 starts
The accumulator 202 of neural processing unit 7 performs specified run function to produce the text that will be stored in row buffer 1104
The result of word 7;Run function unit 1 start to perform the accumulator 202 of neural processing unit 15 specified run function with
Generation will be stored in the result of the word 15 of row buffer 1104;The rest may be inferred, and run function unit 63 starts at nerve
The accumulator 202 of reason unit 511 performs specified run function to produce the word 511 that will be stored in row buffer 1104
Result.
In the time-frequency cycle 521, once 512 results of whole of this 512 neural processing units 126 have all been produced and write
Fall in lines buffer 1104, row buffer 1104 will start its content write data random access memory 122 or weight
Random access memory 124.Thus, the run function unit 1112 of each group of neural processing unit 126 is carried out in Fig. 4
A part for the run function instruction of address 3.
Share the embodiment of run function unit 1112 in the group of ALU 204 as shown in figure 11, especially have
Help the use of collocation integer arithmetic logical block 204.This part have phase at Figure 29 A to Figure 33 in following sections as corresponded to
Speak on somebody's behalf bright.
MTNN is instructed with MFNN frameworks
Figure 14 is block schematic diagram, and display is moved to neutral net (MTNN) framework instruction 1400 and it corresponds to Fig. 1
Neutral net unit 121 part running.This MTNN instruction 1400 include performing code field 1402, src1 fields 1404,
Src2 fields, gpr fields 1408 and immediate field 1412.This MTNN instruction is included in process for framework instruction, namely this instruction
In the instruction set architecture of device 100.For a preferred embodiment, this instruction set architecture can utilize the acquiescence for performing code field 1402
It is worth to distinguish other instructions in MTNN instructions 1400 and instruction set architecture.The actuating code 1402 of this MTNN instruction 1400 can be wrapped
Include the preamble (prefix) for being common in x86 frameworks etc., it is also possible to do not include.
Immediate field 1412 provides a numerical value with the control logic 1434 of specified function 1432 to neutral net unit 121.
For a preferred embodiment, immediate operand of this function 1432 as the microcommand 105 of Fig. 1.These can be by nerve net
The function 1432 that network unit 121 is performed includes write data random access memory 122, write weight random access memory
124th, write-in program memory 129, write control buffer 127, the program started in executive memory 129, time-out are held
Notice request (for example interrupting) after program in line program memory 129, the program for completing in executive memory 129,
And reset neutral net unit 121, but not limited to this.For a preferred embodiment, this neutral net unit instruction group meeting
Including an instruction, the result of this instruction points out that neutral net unit program is completed.In addition, this neutral net unit instruction set
Interrupt instruction is clearly produced including one.For a preferred embodiment, the running bag that neutral net unit 121 is reseted
Include in neutral net unit 121, except data random access memory 122, weight random access memory 124, program are deposited
The data of reservoir 129 can maintain complete motionless outer other parts, and effectively pressure returns back to the state of reseting (for example in, emptying
Portion's state machine simultaneously sets it to idle state).Additionally, internal buffer, such as accumulator 202, can't be reseted letter
Several impacts, and emptying must be expressed, such as using the initialization nerve processing unit instruction of address in Fig. 40.It is real one
In applying example, function 1432 may include directly to perform function, and it first carrys out Source buffer and (for example, can refer to comprising micro- computing
Micro- computing 3418 of Figure 34).This directly performs function and indicates that neutral net unit 121 directly performs specified micro- computing.Such as
This, framework program just directly can perform computing by control neural network unit 121, rather than write the instruction into program storage 129
And this is located at the instruction in program storage 129 or through MTNN instructions in the execution of follow-up instruction neutral net unit 121
The execution of 1400 (or MFNN instructions 1500 of Figure 15).Figure 14 shows the function of this write data random access memory 122
One example.
This gpr field specifies the general caching device in general caching device archives 116.In one embodiment, each is general slow
Storage is 64.This general caching device archives 116 provides the numerical value of the general caching device selected to neutral net unit
121, as shown in FIG., and neutral net unit 121 is used this numerical value as address 1422.This address 1422 can select function
One row of the memory specified in 1432.For data random access memory 122 or weight random access memory 124,
This address 1422 can additionally select a data block, and its size is the twice (such as 512 of the position of media cache in this select column
Position).For a preferred embodiment, this position is located at 512 bit boundarys.In one embodiment, multiplexer can select address
1422 (or the addresses 1422 in the case of MFNN described below instructions 1400) are or from the address of sequencer 128
123/125/131 is provided to the program storage 129 of 124/ weight random access memory of data random access memory 124/.
In one embodiment, data random access memory 122 has dual-port, enables neural processing unit 126 to utilize media buffer
Read/write of the device 118 to this data random access memory 122, while read/write this data random access memory
122.In one embodiment, for the purpose being similar to, weight random access memory 124 also has dual-port.
Src1 fields 1404 in figure specify a media buffer of media cache archives 118 with src2 fields 1406
Device.In one embodiment, each media cache 118 is 256.Media cache archives 118 can be by from selected fixed
The conjoint data (such as 512 positions) of media cache is provided to (or the weight arbitrary access of data random access memory 122
Memory 124 or program storage 129) select column 1428 specified with writing address 1422 and in select column 1428 by
The position that address 1422 is specified, as shown in FIG..Through a series of MTNN instruction 1400 (and described below MFNN instruction
1500) fill up by execution, the framework program for being implemented in processor 100 data random access memory 122 row with weight with
Machine access memory 124 is arranged and by program write-in program memory 129, such as program as herein described is (as shown in Fig. 4 and Fig. 9
Program) neutral net unit 121 can be made to carry out computing at a very rapid rate to data and weight, to complete this artificial neuron
Network.In one embodiment, the direct control neural network unit 121 of this framework program rather than by program write-in program memory
129。
In one embodiment, MTNN instructions 1400 are specified a starting source buffer and carry out the quantity of Source buffer, i.e.,
Q, and non-designated two are carried out Source buffer (person as specified by field 1404 and 1406).The MTNN instructions 1400 of this form can refer to
Show that processor 100 will be appointed as starting and come the media cache 118 and the following Q-1 media buffer for continuing of Source buffer
Device 118 writes neutral net unit 121, that is, the data random access memory 122 or weight specified by writing is deposited at random
Access to memory 124.For a preferred embodiment, it is all Q that MTNN instructions 1400 can be translated to write by instruction translator 104
The microcommand of the specified requirement of media cache 118.For example, in one embodiment, when MTNN instructions 1400 will
It is 8 that buffer MR4 be appointed as starting to come Source buffer and Q, and MTNN instructions 1400 will be translated to four by instruction translator 104
Individual microcommand, wherein first microcommand writes buffer MR4 and MR5, second microcommand writes buffer MR6 and MR7, the
Three microcommands write buffer MR8 and MR9, and the 4th microcommand writes buffer MR10 and MR11.In another enforcement
It is 1024 rather than 512 by media cache 118 to the data path of neutral net unit 121 in example, in the case,
MTNN instructions 1400 can be translated to two microcommands by instruction translator 104, wherein first microcommand write buffer MR4 is extremely
MR7, second microcommand is then write buffer MR8 to MR11.It is can also be applied to MFNN instructions 1500 are specified together
The embodiment of the quantity of beginning purpose buffer and purpose buffer, and allow each MFNN instruction 1500 random from data
One row of access memory 122 or weight random access memory 124 read the data block more than single medium buffer 118.
Figure 15 is block schematic diagram, and display is moved to neutral net (MTNN) framework instruction 1500 and it corresponds to Fig. 1
Neutral net unit 121 part running.This MFNN instruction 1500 include performing code field 1502, dst fields 1504,
Gpr fields 1508 and immediate field 1512.MFNN instructions are framework instruction, namely this instruction is contained in the finger of processor 100
In order collection framework.For a preferred embodiment, this instruction set architecture can be distinguished using the default value for performing code field 1502
Other instructions in MFNN instructions 1500 and instruction set architecture.The actuating code 1502 of this MFNN instruction 1500 may include to be common in
The preamble (prefix) of x86 frameworks etc., it is also possible to do not include.
Immediate field 1512 provides a numerical value with the control logic 1434 of specified function 1532 to neutral net unit 121.
For a preferred embodiment, immediate operand of this function 1532 as the microcommand 105 of Fig. 1.These neutral net units
121 functions 1532 that can be performed include reading data random access memory 122, read weight random access memory 124,
Reading program memory 129 and reading state buffer 127, but not limited to this.The example of Figure 15 shows that reading data are random
The function 1532 of access memory 122.
This gpr field 1508 specifies the general caching device in general caching device archives 116.This general caching device archives 116
, to neutral net unit 121, as shown in FIG., and neutral net unit 121 will for the numerical value of the general caching device that offer is selected
This numerical value carries out computing as address 1522 and in the way of the address 1422 similar to Figure 14, uses the selection middle finger of function 1532
One row of fixed memory.For data random access memory 122 or weight random access memory 124, this address
1522 can additionally select a data block, its size to be the position of media cache in this select column (such as 256 positions).With regard to one compared with
For good embodiment, this position is located at 256 bit boundarys.
This dst field 1504 specifies a media cache in a media cache archives 118.As shown in FIG., media
Register file 118 will be from data random access memory 122 (or weight random access memory 124 or program storage
129) data (such as 256) are received to selected media cache, and this digital independent address 1522 from data receiver is specified
Select column 1528 and select column 1528 in the position specified of address 1522.
The port configuration of neutral net unit internal random access memory
Figure 16 is the block schematic diagram of an embodiment of the data random access memory 122 for showing Fig. 1.This data is random
Access memory 122 includes memory array 1606, read port 1602 with write port 1604.Memory array 1606 is loaded
Data literal, for a preferred embodiment, the array of N number of word that these data arrangements are arranged into D as previously mentioned.Implement one
In example, this memory array 1606 includes an array being made up of 64 horizontally arranged static random-access memory cells, its
In each memory cell have 128 width and the height of 64, the data random access of a 64KB so can be provided
Memory 122, its width is 8192 and arranges with 64, and the crystal grain face that this data random access memory 122 is used
Substantially 0.2 square millimeter of product.But, the present invention is not limited to this.
For a preferred embodiment, write port 1602 with multitask mode be coupled to neural processing unit 126 and
Media cache 118.Furthermore, it is understood that these media caches 118 can be coupled to read port through result bus, and tie
Fruit bus is also used for providing data to reorder buffer and/or result transmission bus to provide to other performance elements 112.These
Neural processing unit 126 shares this read port 1602 with media cache 118, to enter to data random access memory 122
Row reads.Also, for a preferred embodiment, write port 1604 is also to be coupled to neural processing unit with multitask mode
126 and media cache 118.These neural processing units 126 share this write port 1604 with media cache 118, with
Write this data random access memory 122.Thus, media cache 118 just can neural processing unit 126 to data with
While machine access memory 122 is read out, data random access memory 122 is write, and neural processing unit 126 is also
Data random access can be write while media cache 118 is read out to data random access memory 122
Memory 122.Such ways of carrying out can lift efficiency.For example, these neural processing units 126 can read data
Random access memory 122 (for example continuously carries out calculating), and this is simultaneously, and media cache 118 can be by more data word
Write data random access memory 122.In another example, these neural processing units 126 can write result of calculation
Data random access memory 122, and this is simultaneously, media cache 118 then can read from data random access memory 122
Result of calculation.In one embodiment, a column count result can be write data random access memory by neural processing unit 126
122, while also reading a column data word from data random access memory 122.In one embodiment, memory array 1606
It is configured to memory block (bank).When neural processing unit 126 accesses data random access memory 122, own
Memory block all can be initiated to access memory array 1606 a complete column;But, access in media cache 118
When data random access memory 122, only specified memory block can be activated.In one embodiment, each
The width of memory block is 128, and the width of media cache 118 is then 256, so, for example, deposit every time
Take media cache 118 to be accomplished by starting two memory blocks.In one embodiment, these ports 1602/1604 are wherein
One of be read/write port.In one embodiment, these ports 1602/1604 are all read/write ports.
Allow the advantage of ability that these neural processing units 126 possess circulator as described herein to be, compared to for
Guarantee that neural processing unit 126 can be fully utilized and framework program (by media cache 118) is persistently provided
Data to data random access memory 122 and neural processing unit 126 perform calculate while, from data random access
Memory 122 fetches the memory array required for result, and this ability contributes to reducing depositing for data random access memory 122
The columns of memory array 1606, thus can be with minification.
Internal random access memory buffer
Figure 17 is to show that the weight random access memory 124 of Fig. 1 is illustrated with the square of an embodiment of buffer 1704
Figure.This weight random access memory 124 includes memory array 1706 and port 1702.This memory array 1706 loads power
Word is weighed, for a preferred embodiment, the array of N number of word that these weight character arrangings are arranged into W as previously mentioned.It is real one
In applying example, this memory array 1706 includes an array being made up of 128 horizontally arranged static random-access memory cells,
Wherein each memory cell with the width of 64 and the height of 2048, deposit at random by the weight that can so provide a 2MB
Access to memory 124, its width is 8192 and arranges with 2048, and the crystalline substance that this weight random access memory 124 is used
Grain accumulates substantially 2.4 square millimeters.But, the present invention is not limited to this.
For a preferred embodiment, this port 1702 is coupled to neural processing unit 126 with buffering with multitask mode
Device 1704.These neural processing units 126 read through this port 1702 with buffer 1704 and write weight arbitrary access and deposit
Reservoir 124.Buffer 1704 is further coupled to the media cache 118 of Fig. 1, thus, media cache 118 can pass through buffer
1704 read and write weight random access memory 124.The advantage of this mode is, when neural processing unit 126 is read
When taking or write weight random access memory 124, media cache 118 with write buffer 118 or can postpone
Rush device 118 read (if but neural processing unit 126 be carrying out, these nerves are shelved in the preferred case and process single
Unit 126, to avoid, when buffer 1704 accesses weight random access memory 124, accessing weight random access memory
124).This mode can lift efficiency, especially because reading of the media cache 118 for weight random access memory 124
Take the reading and write that neural processing unit 126 is significantly less than on relative with write for weight random access memory 124.Lift
For example, in one embodiment, the read/write 8192 of neural processing unit 126 1 times position (row), but, media buffer
The width of device 118 is only 256, and each MTNN instruction 1400 is only written two media caches 118, i.e., 512.Therefore,
In the case where 16 MTNN of framework program performing instructions 1400 are to fill up buffer 1704, neural processing unit 126 with deposit
The time clashed between the framework program of weighting weight random access memory 124 can be less than the percent of the substantially the entirety of time
Six.In another embodiment, a MTNN is instructed 1400 to translate to two microcommands 105 by instruction translator 104, and each is micro-
Instruction can be by the single write buffer 1704 of data buffer 118, thus, neural processing unit 126 is being deposited with framework program
The frequency of conflict is produced during weighting weight random access memory 124 can also further be reduced.
In the embodiment comprising buffer 1704, needed using framework program write weight random access memory 124
Multiple MTNN instructions 1400.What one or more MTNN instructions 1400 specified a function 1432 to specify in write buffer 1704
Data block, specifies a function 1432 to indicate neutral net unit 121 by the content of buffer 1704 with latter MTNN instructions 1400
One select column of write weight random access memory 124.The size of single data block is the digit of media cache 118
Twice, and these data blocks can come into line in buffer 1704 naturally.In one embodiment, each specified function 1432 is writing
Enter the MTNN instructions 1400 of the specified data block of buffer 1704 comprising a bit mask (bitmask), it has position correspondence to buffering
Each data block of device 1704.The data for carrying out Source buffer 118 specified from two are written into the data block of buffer 1704
In, the corresponding position in bit mask is each data block being set.This embodiment contributes to weight random access memory 124
A column memory duplicate data value situation.For example, in order to by buffer 1704, (and subsequent weight is deposited at random
One row of access to memory 124) it to be zeroed, null value loading can be carried out Source buffer and set all of bit mask by program designer
Position.Additionally, bit mask can also allow the selected data block that program designer is only written in buffer 1704, and make other data blocks
Maintain the data mode that its is previous.
In the embodiment comprising buffer 1704, reading weight random access memory 124 using framework program needs
Multiple MFNN instructions 1500.Initial MFNN instructions 1500 specify a function 1532 by a finger of weight random access units 124
Fixed row loading buffer 1704, subsequently one or more MFNN instructions 1500 specify a function 1532 by a finger of buffer 1704
Determine data block to read to purpose buffer.The size of single data block is the digit of media cache 118, and these data
Block can come into line in buffer 1704 naturally.The technical characteristic of the present invention is equally applicable to other embodiments, and such as weight is random
Access memory 124 has multiple buffers 1704, and framework program accesses when performing through the neural processing unit 126 of increase
Quantity, further to reduce between neural processing unit 126 and framework program because accessing produced by weight random access memory 124
Conflict, and increase in the time-frequency cycle for being not necessary to access weight random access memory 124 in neural processing unit 126, change by
Buffer 1704 enters the possibility of line access.
Figure 16 describes dual port data random access memory 122, and but, the present invention is not limited to this.The skill of the present invention
It is also the other embodiments of dual-port design that art feature is equally applicable to weight random access memory 124.Additionally, retouching in Figure 17
State buffer collocation weight random access memory 124 to use, but, the present invention is not limited to this.The technical characteristic of the present invention
Data random access memory 122 is equally applicable to the enforcement similar to the corresponding buffer of buffer 1704
Example.
Can dynamic configuration neural processing unit
Figure 18 be show Fig. 1 can dynamic configuration neural processing unit 126 block schematic diagram.The nerve process of Figure 18
Neural processing unit 126 of the unit 126 similar to Fig. 2.But, the neural processing unit 126 of Figure 18 can dynamic configuration operating
In two it is different configuration of one of them.In first configuration, the running of the neural processing unit 126 of Figure 18 is similar to Fig. 2
Neural processing unit 126.That is, in first configuration, here is denoted as " wide " configuration or " single " configuration,
The ALU 204 of neural processing unit 126 data literal wide to single and single wide weight word (example
Such as 16 positions) perform computing to produce single wide result.In comparison, in second configuration, i.e., it is denoted as herein " narrow
" configure or " even numbers " configuration, data literal and two narrow weight words that neural processing unit 126 can be narrow to two
(such as 8 positions) performs computing and produces two narrow results respectively.In one embodiment, the configuration of neural processing unit 126 is (wide
Or narrow) reached by neural processing unit instruction (such as the instruction of address 0 in aforementioned Figure 20) of initialization.In addition, this configuration
The MTNN instructions that can also have function 1432 to specify by one to set the configuration (wide or narrow) of neural processing unit setting come
Reach.For a preferred embodiment, program storage 129 instructs or determines the MTNN of configuration (wide or narrow) to instruct and can fill up and match somebody with somebody
Put buffer.For example, the output of allocating cache device is supplied to ALU 204, run function unit 212 and produces
The logic of raw multitask cache control signal 213.Substantially, the component of the neural processing unit 126 of Figure 18 is identical with Fig. 2
The component of numbering can perform similar function, can therefrom obtain reference to understand the embodiment of Figure 18.Below for the reality of Figure 18
Apply example to illustrate with not existing together for Fig. 2 comprising it.
The neural processing unit 126 of Figure 18 includes two buffer 205A and 205B, two three input multitask buffers
208A and 208B, ALU 204, two accumulator 202A and 202B and two run function unit 212A
With 212B.Buffer 205A/205B has respectively the half (such as 8 positions) of the width of the buffer 205 of Fig. 2.Buffer 205A/
205B respectively from weight random access memory 124 receive a corresponding narrow weight word 206A/B206 (such as 8 positions) and
Output it 203A/203B to provide in a follow-up time-frequency cycle to the operand selection logic 1898 of ALU 204.God
In when wide configuration, buffer 205A/205B will together be operated and deposited at random from weight with receiving Jing processing units 126
Wide weight word 206A/206B (such as 16 positions) of the one of access to memory 124, similar to the buffer in the embodiment of Fig. 2
205;When neural processing unit 126 is in narrow configuration, buffer 205A/205B actually will be independent work, each
The narrow weight word 206A/206B (such as 8 positions) from weight random access memory 124 is received, thus, nerve is processed
Unit 126 is actually equivalent to two narrow neural processing units each independent work.But, no matter neural processing unit
Why is 126 configuration aspect, and the identical carry-out bit of weight random access memory 124 can all be coupled and provided to buffer
205A/205B.For example, the buffer 205A of neural processing unit 0 receives byte 0, the buffer of neural processing unit 0
205B receives byte 1, the buffer 205A of neural processing unit 1 and receives byte 2, the buffer of neural processing unit 1
205B receives byte 3, the rest may be inferred, and the buffer 205B of neural processing unit 511 will receive byte 1023.
Multitask buffer 208A/208B has respectively the half (such as 8 positions) of the width of the buffer 208 of Fig. 2.Many
Business buffer 208A can select one and store to its buffer and in the follow-up time-frequency cycle in input 207A, 211A and 1811A
There is provided by output 209A, multitask buffer 208B can select one in input 207B, 211B and 1811B and store to its caching
Device is simultaneously provided to operand selection logic 1898 in the follow-up time-frequency cycle by 209B is exported.Input 207A is deposited from data random access
Reservoir 122 receives a narrow data literal (such as 8 positions), and input 207B receives a narrow number from data random access memory 122
According to word.When neural processing unit 126 is in wide configuration, multitask buffer 208A/208B actually will be one
Running is played to receive wide data literal 207A/207B (such as 16 positions) from data random access memory 122, is similar to
Multitask buffer 208 in the embodiment of Fig. 2;When neural processing unit 126 is in narrow configuration, multitask buffer
208A/208B actually will be independent work, each receive the narrow data literal from data random access memory 122
207A/207B (such as 8 positions), thus, neural processing unit 126 to be actually equivalent to two narrow neural processing units each
From independent work.But, though the configuration aspect of neural processing unit 126 why, data random access memory 122 it is identical
Carry-out bit can all couple and provide at most task buffer device 208A/208B.For example, the multitask of neural processing unit 0 is delayed
Storage 208A receives byte 0, the multitask buffer 208B of neural processing unit 0 and receives byte 1, neural processing unit 1
Multitask buffer 208A receive byte 2, the multitask buffer 208B of neural processing unit 1 and receive byte 3, according to this
Analogize, the multitask buffer 208B of neural processing unit 511 will receive byte 1023.
Input 211A receives the output 209A of the multitask buffer 208A of neighbouring neural processing unit 126, input
211B receives the output 209B of the multitask buffer 208B of neighbouring neural processing unit 126.Input 1811A receives neighbouring god
The output 209B of the multitask buffer 208B of Jing processing units 126, and be input into 1811B and receive neighbouring nerve processing unit 126
Multitask buffer 208A output 209A.Neural processing unit 126 shown in Figure 18 belongs at the N number of nerve shown in Fig. 1
One of reason unit 126 is simultaneously denoted as neural processing unit J.That is, nerve processing unit J is at this N number of nerve
The one of reason unit represents example.For a preferred embodiment, the multitask buffer 208A input 211A of neural processing unit J
The multitask buffer 208A output 209A of the neural processing unit 126 of example J-1 can be received, and neural processing unit J's is more
Task buffer device 208A input 1811A can receive the multitask buffer 208B outputs of the neural processing unit 126 of example J-1
209B, and the multitask buffer 208A output 209A of neural processing unit J can simultaneously provide to the nerve of example J+1 and process
The multitask buffer 208B of the neural processing unit 126 of the multitask buffer 208A input 211A and example J of unit 126
Input 211B;The input 211B of the multitask buffer 208B of neural processing unit J can receive the neural processing unit of example J-1
126 multitask buffer 208B output 209B, and the input 1811B meetings of the multitask buffer 208B of neural processing unit J
Receive the multitask buffer 208A output 209A of the neural processing unit 126 of example J, also, many of neural processing unit J
The output 209B of business buffer 208B can be provided to the multitask buffer 208A of the neural processing unit 126 of example J+1 simultaneously
The multitask buffer 208B input 211B of the neural processing unit 126 of input 1811A and example J+1.
Each in the control multitask buffer 208A/208B of control input 213, from these three inputs one is selected
Store to its corresponding buffer, and provide to corresponding output 209A/209B in subsequent step.When neural processing unit
126 be instructed to from data random access memory 122 loading one arrange when (such as the multiply-accumulate instruction of address 1 in Figure 20, in detail
As be described hereinafter), no matter this neural processing unit 126 is that, in wide configuration or narrow configuration, control input 213 can control multitask and delay
Each multitask buffer in storage 208A/208B, from the corresponding narrow of the select column of data random access memory 122
A corresponding narrow data literal 207A/207B is selected in word (such as 8).
(for example scheme when neural processing unit 126 is received to be indicated and need and the data columns value of previous receipt is rotated
The multiply-accumulate rotation instruction of address 2 in 20, the details will be described later), if neural processing unit 126 is, in narrow configuration, to control defeated
Entering 213 will control the corresponding input 1811A/ of each multitask buffer selection in multitask buffer 208A/208B
1811B.In the case, multitask buffer 208A/208B actually can be independent work and make neural processing unit 126 real
Just as two independent narrow neural processing units on border.Thus, the multitask buffer 208A of N number of neural processing unit 126
Will be such as the circulator of same 2N narrow word with 208B cooperating syringes, this part is subsequently more detailed corresponding to having at Figure 19
Explanation.
When neural processing unit 126 is received to be indicated to need to rotate the data columns value of previous receipt, if refreshing
Jing processing units 126 are that, in wide configuration, it is many that control input 213 will control in multitask buffer 208A/208B each
Task buffer device selects corresponding input 211A/211B.In the case, multitask buffer 208A/208B can cooperating syringe
And actually just look like this neural processing unit 126 be single wide neural processing unit 126.Thus, N number of nerve processes single
The multitask buffer 208A of unit 126 will be similar to corresponding to figure with 208B cooperating syringes such as the circulator of same N number of wide word
Mode described by 3.
ALU 204 includes that operand selects 1898, wide multiplier 242A of logic, a narrow multiplier
242B, a wide dual input multiplexer 1896A, a narrow dual input multiplexer 1896B, wide adder 244A is narrow with one
Adder 244B.In fact, this ALU 204 can be regarded as including that operand selects logic, a wide arithmetical logic
Unit 204A (including aforementioned wide multiplier 242A, aforementioned wide multiplexer 1896A and aforementioned wide adder 244A) and a narrow calculation
Art logical block 204B (including aforementioned narrow multiplier 242B, aforementioned narrow multiplexer 1896B and aforementioned narrow adder 244B).With regard to one
For preferred embodiment, wide multiplier 242A can be multiplied two wide words, similar to the multiplier 242 of Fig. 2, such as one 16
Take advantage of the multiplier of 16 in position.Narrow multiplier 242B can be multiplied two narrow words, and such as one 8 multipliers for taking advantage of 8 are to produce
Raw one 16 result.When neural processing unit 126 is in narrow configuration, the assistance of logic 1898 is selected through operand, i.e.,
Wide multiplier 242A can be made full use of, makes two narrow words be multiplied as a narrow multiplier, so nerve processing unit
126 will be such as the narrow neural processing unit of two effective operations.For a preferred embodiment, wide adder 244A can be by width
The output of multiplexer 1896A is added with the output 217A of wide accumulator 202A and produce a sum 215A and make for width accumulator 202A
With adder 244 of its running similar to Fig. 2.Narrow adder 244B can be by the output of narrow multiplexer 1896B and narrow accumulator
202B output 217B are added and are used for narrow accumulator 202B with producing a sum 215B.In one embodiment, narrow accumulator 202B
With the width of 28, to avoid that when the accumulating operation of up to 1024 16 products is carried out the degree of accuracy can be lost.At nerve
During wide configuration, narrow multiplier 244B, narrow accumulator 202B and narrow run function unit 212B are preferably in reason unit 126
Starting state is not reducing energy dissipation.
Operand selects logic 1898 selection operation number can provide to arithmetical logic from 209A, 209B, 203A and 203B
Other components of unit 204, the details will be described later.For a preferred embodiment, operand selects logic 1898 also to have other work(
Can, for example perform signed magnitude data literal and extend with the symbol of weight word.For example, if neural processing unit
126 is that, in narrow configuration, operand selects logic 1898 symbol of narrow data literal and weight word can be extended into into wide word
Width, be then just supplied to wide multiplier 242A.Similarly, if ALU 204 receives instruction and to transmit one
Narrow data/weight word (skip wide multiplier 242A) using wide multiplexer 1896A, and operand selects the meeting of logic 1898 by narrow number
The width of wide word is extended into according to the symbol of word and weight word, wide adder 244A is then just supplied to.It is preferably real with regard to one
For applying example, this logic for performing symbol extension function exists in the arithmetic logical operation of the neural processing unit 126 of Fig. 2
204 inside.
Wide multiplexer 1896A receives the output and the operation that logic 1898 is selected from operand of wide multiplier 242A
Number, and select one to be supplied to wide adder 244A, narrow multiplexer 1896B to receive the defeated of narrow multiplier 242B from these inputs
The operand for going out and logic 1898 being selected from operand, and select one to be supplied to narrow adder from these inputs
244B。
The configuration of the neural processing unit 126 of the operand selection meeting foundation of logic 1898 and ALU 204 will
The arithmetic of execution and/or logical operation provide operand, the finger that this arithmetic/logic is performed according to neural processing unit 126
Specified function is made to determine.For example, if instructing the one multiply-accumulate computing of execution of instruction ALU 204
Neural processing unit 126 is in wide configuration, and it is wide that operand selects logic 1898 that output 209A is just concatenated for constituting with 209B
Word is provided to an input of wide multiplier 242A, and the width word that output 203A concatenate composition with 203B is provided to another
Input, and narrow multiplier 242B is then not start, thus, the running of neural processing unit 126 will as single similar to
The wide neural processing unit 126 of the neural processing unit 126 of Fig. 2.But, if instruction indicates that ALU performs one and takes advantage of
Method accumulating operation and neural processing unit 126 are that, in narrow configuration, operand selects logic 1898 just after an extension or will to expand
The narrow data literal 209A of version is provided to an input of wide multiplier 242A after, and by the narrow weight word of version after extension
203A is provided to another input;Additionally, operand selects logic 1898 narrow data literal 209B can be provided to narrow multiplier
One input of 242B, and narrow weight word 203B is provided to another input.To reach narrow word is extended as previously mentioned
Or the computing of expansion, if narrow word carries symbol, operand selects logic 1898 to carry out symbol extension to narrow word;If
Narrow word without symbol, operand select logic 1898 will above narrow word is added off bit.
In another example, if neural processing unit 126 configures and instructs instruction ALU 204 in wide
The accumulating operation of a weight word is performed, wide multiplier 242A will be skipped, and operand selects the logic 1898 will will be defeated
Go out 203A and offer is concatenated with 203B to wide multiplexer 1896A to be supplied to wide adder 244A.But, if neural processing unit
126 configure and instruct the accumulating operation that instruction ALU 204 performs a weight word, wide multiplier 242A in narrow
Will be skipped, and the output 203A of version is provided to wide multiplexer after operand selects logic 1898 that one will extend
1896A is being supplied to wide adder 244A;Additionally, narrow multiplier 242B can be skipped, operand selects logic 1898 can to prolong
The output 203B of version is provided to narrow multiplexer 1896B to be supplied to narrow adder 244B after exhibition.
In another example, if neural processing unit 126 configures and instructs instruction ALU 204 in wide
The accumulating operation of a data literal is performed, wide multiplier 242A will be skipped, and operand selects the logic 1898 will will be defeated
Go out 209A and offer is concatenated with 209B to wide multiplexer 1896A to be supplied to wide adder 244A.But, if neural processing unit
126 configure and instruct the accumulating operation that instruction ALU 204 performs a data literal, wide multiplier 242A in narrow
Will be skipped, and the output 209A of version is provided to wide multiplexer after operand selects logic 1898 that one will extend
1896A is being supplied to wide adder 244A;Additionally, narrow multiplier 242B can be skipped, operand selects logic 1898 can to prolong
The output 209B of version is provided to narrow multiplexer 1896B to be supplied to narrow adder 244B after exhibition.Weight/data literal it is cumulative
Calculating contributes to average calculating operation, and average calculating operation is available if image processing is in the common source of some interior artificial neural network applications
(pooling) layer.
For a preferred embodiment, neural processing unit 126 also includes the second wide multiplexer (not shown), to skip
Wide adder 244A, is beneficial to the narrow data/power after the wide data/weight word under wide configuration or the extension under narrow configuration
Weigh the wide accumulator 202A of word loading, and the second narrow multiplexer (not shown), to skip narrow adder 244B, be beneficial to by
Narrow data under narrow configuration/weight word loads narrow accumulator 202B.For a preferred embodiment, this ALU
204 also combine (not shown) including wide with narrow comparator/multiplexer, and this comparator/multiplexer combination reception is corresponding to tire out
Plus device numerical value 217A/217B and corresponding multiplexer 1896A/1896B outputs, use accumulator value 217A/217B with
Maximum is selected between one data/weight word 209A/209B/203A/203B, the common source of some artificial neural network applications
(pooling) layer uses this computing, this part in following sections, such as corresponding to Figure 27 and Figure 28 at, have in more detail
It is bright.Additionally, operand selects logic 1898 to provide the operand of value of zero (for the add operation for Jia zero or to clear
Except accumulator), and the operand (for taking advantage of one multiplying) of numerical value one is provided.
Narrow run function unit 212B receives the output 217B of narrow accumulator 202B and it is performed run function to produce
Narrow result 133B, wide run function unit 212A receives the output 217A of wide accumulator 202A and it is performed run function to produce
Raw width result 133A.When neural processing unit 126 is in narrow configuration, it is cumulative that wide run function unit 212A can according to this configure understanding
The output 217A of device 202A simultaneously performs run function to produce narrow result to it, and such as 8, this part such as corresponds in following sections
Figure 29 A to Figure 30 places are described in more detail.
As it was previously stated, single nerve processing unit 126 effectively can function as two narrow nerves when in narrow configuration
Processing unit operating, therefore, for less word, compared to during wide configuration, can generally provide up to twice
Disposal ability.For example, it is assumed that neural net layer has 1024 neurons, and each neuron is received from preceding layer
1024 narrow inputs (and with narrow weight word), will so produce 1,000,000 links.For with 512 nerve process
For the neutral net unit 121 of unit 126, under narrow configuration (neural processing unit narrow equivalent to 1024), although process
Be narrow word rather than wide word, but its treatable connective number of institute can reach four times of wide configuration, and (1,000,000 link
Upper 256K is linked), and substantially half of required time (about 1026 time-frequency cycles are to upper 514 time-frequency cycles).
In one embodiment, the dynamic configuration nerve processing unit 126 of Figure 18 is included similar to multitask buffer 208A
With the three of 208B input multitask buffers to replace buffer 205A and 205B, to constitute a circulator, place's reason weight with
Machine access memory 124 receive weight text line, this operational part similar to Fig. 7 embodiment described by mode but application
In the dynamic configuration described in Figure 18.
Figure 19 be a block schematic diagram, show according to Figure 18 embodiment, using Fig. 1 neutral net unit 121 it is N number of
2N multitask buffer 208A/208B of neural processing unit 126, for the data random access memory 122 by Fig. 1
The column data word 207 for obtaining performs the running such as same circulator.In the embodiment of Figure 19, N is 512, and nerve is processed
Unit 121 has 1024 multitask buffer 208A/208B, is denoted as 0 to 511, is respectively corresponding to 512 nerves and processes single
Unit 126 and actually 1024 narrow neural processing units.Two narrow neural processing unit difference in neural processing unit 126
A and B is denoted as, in each multitask buffer 208, its corresponding narrow neural processing unit is also indicated.Further
For, the multitask buffer 208A for being denoted as 0 neural processing unit 126 is denoted as 0-A, and the nerve for being denoted as 0 processes single
The multitask buffer 208B of unit 126 is denoted as 0-B, is denoted as the multitask buffer 208A marks of 1 neural processing unit 126
1-A is shown as, the multitask buffer 208B for being denoted as 1 neural processing unit 126 is denoted as 1-B, is denoted as at 511 nerve
The multitask buffer 208A of reason unit 126 is denoted as 511-A, and the multitask for being denoted as 511 neural processing unit 126 is delayed
Storage 208B is denoted as 511-B, and its numerical value is also corresponded to the narrow neural processing unit described in follow-up Figure 21.
Each multitask buffer 208A receives its phase in wherein one row that the D of data random access memory 122 is arranged
Corresponding narrow data literal 207A, and each multitask buffer 208B is arranged wherein in the D of data random access memory 122
Its corresponding narrow data literal 207B is received in one row.That is, multitask buffer 0-A receiving data arbitrary accesses are deposited
The narrow data literal 0 of the row of reservoir 122, the narrow data literal of the row of multitask buffer 0-B receiving datas random access memory 122
1, the narrow data literal 2 of the row of multitask buffer 1-A receiving datas random access memory 122, multitask buffer 1-B is received
The narrow data literal 3 of the row of data random access memory 122, the rest may be inferred, and multitask buffer 511-A receiving datas are deposited at random
The narrow data literal 1022 of the row of access to memory 122, and multitask buffer 511-B is then receiving data random access memory
The narrow data literal 1023 of 122 row.Additionally, multitask buffer 1-A receives the output 209A of multitask buffer 0-A as it
Input 211A, multitask buffer 1-B receive the output 209B of multitask buffer 0-B and are input into 211B as it, and the rest may be inferred,
Multitask buffer 511-A receives the output 209A of multitask buffer 510-A and is input into 211A, multitask buffer as it
511-B receives the output 209B of multitask buffer 510-B and 211B is input into as it, and multitask buffer 0-A receives many
The output 209A of task buffer device 511-A is input into 211A as it, and multitask buffer 0-B receives multitask buffer 511-B
Output 209B as its be input into 211B.Each multitask buffer 208A/208B can receive control input 213 to control it
Select to be input into after data literal 207A/207B or rotation after 211A/211B or rotation and be input into 1811A/1811B.Finally, it is many
Task buffer device 1-A receives the output 209B of multitask buffer 0-B and 1811A is input into as it, and multitask buffer 1-B is received
The output 209A of multitask buffer 1-A is input into 1811B as it, and the rest may be inferred, and multitask buffer 511-A receives multitask
The output 209B of buffer 510-B is input into 1811A as it, and multitask buffer 511-B receives multitask buffer 511-A's
Output 209A is input into 1811B as it, and multitask buffer 0-A receives the output 209B works of multitask buffer 511-B
1811A is input into for it, multitask buffer 0-B receives the output 209A of multitask buffer 0-A and is input into 1811B as it.Often
Individual multitask buffer 208A/208B can receive control input 213 and select data literal 207A/207B or rotation to control it
It is input into after turning after 211A/211B or rotation and is input into 1811A/1811B.In an operational pattern, in the first time-frequency cycle, control
Input 213 can control each multitask buffer 208A/208B and select data literal 207A/207B to store to buffer for follow-up
There is provided to ALU 204;And in the follow-up time-frequency cycle (such as aforesaid M-1 time-frequencies cycle), control input 213 can be controlled
Make each multitask buffer 208A/208B select rotation after be input into 1811A/1811B store to buffer for subsequently provide to
ALU 204, this part can be described in more detail in following sections.
Figure 20 is a form, shows the program storage 129 of a neutral net unit 121 for being stored in Fig. 1 and by this
The program that neutral net unit 121 is performed, and this neutral net unit 121 has nerve process as shown in the embodiment of figure 18
Unit 126.Program of the example program of Figure 20 similar to Fig. 4.Illustrate below for its difference.Positioned at the initial of address 0
Changing neural processing unit instruction specifies nerve processing unit 126 to enter narrow configuration.Additionally, as shown in FIG., positioned at address 2
Multiply-accumulate rotation instruction specify a numerical value be 1023 count value and need 1023 time-frequency cycles.This is because Figure 20
Assume to be of virtually 1024 narrow (such as 8) neurons (i.e. neural processing unit), each narrow nerve in one layer in example
There are 1024 links from 1024 neurons of preceding layer to be input into for unit, therefore a total of 1024K link.Each nerve
Unit receives 8 bit data values and this 8 bit data value is multiplied by into appropriate 8 weighted value from each link input.
Figure 21 is to show that neutral net unit 121 performs the sequential chart of the program of Figure 20, and this neutral net unit 121 has
Neural processing unit 126 as shown in figure 18 is implemented in narrow configuration.Sequential chart of the sequential chart of Figure 21 similar to Fig. 5.Following pin
Its difference is illustrated.
In the sequential chart of Figure 21, these neural processing units 126 can be in narrow configuration, this is because positioned at address 0
The neural processing unit instruction of initialization is initialized with narrow configuration.So, this 512 neural processing units 126 are actually transported
Make to get up just as 1024 narrow neural processing units (or neuron), this 1024 narrow neural processing units are in field with god
Jing processing units 0-A and neural processing unit 0-B (two narrow neural processing units of the neural processing unit 126 for being denoted as 0),
(two narrow nerves of the neural processing unit 126 for being denoted as 1 process single to neural processing unit 1-A and neural processing unit 1-B
Unit), (it is denoted as 511 neural processing unit so on up to neural processing unit 511-A and nerve processing unit 511-B
126 two narrow neural processing units), indicated.To simplify explanation, narrow neural processing unit 0-A, 0-B are only shown in figure
With the computing of 511-B.Because being 1023 positioned at the count value of the multiply-accumulate rotation instruction of address 2, and need 1023
Therefore the individual time-frequency cycle is operated, and the columns of the sequential chart of Figure 21 includes up to 1026 time-frequency cycles.
In the time-frequency cycle 0, each of this 1024 neural processing units can perform the initialization directive of Fig. 4, i.e. Fig. 5
The running of shown appointment null value to accumulator 202.
In the time-frequency cycle 1, each of this 1024 narrow neural processing units can perform taking advantage of positioned at address 1 in Figure 20
Method accumulated instruction.As shown in FIG., narrow neural processing unit 0-A by accumulator 202A numerical value (i.e. zero) plus data random access
The product of the narrow word 0 of row 17 of unit 122 and the narrow word 0 of row 0 of weight random access units 124;Narrow neural processing unit 0-B
By accumulator 202B numerical value (i.e. zero) plus data random access unit 122 the narrow word 1 of row 17 and weight random access units
The product of the 124 narrow word 1 of row 0;Accumulator 202B numerical value (i.e. zero) is added so on up to narrow neural processing unit 511-B
The narrow word 1023 of row 17 of upper data random access unit 122 and taking advantage of for the narrow word 1023 of row 0 of weight random access units 124
Product.
In the time-frequency cycle 2, each of this 1024 narrow neural processing units can perform taking advantage of positioned at address 2 in Figure 20
Method adds up and rotates the first time iteration of instruction.As shown in FIG., narrow neural processing unit 0-A adds accumulator 202A numerical value 217A
On by narrow data literal after the multitask buffer 208B output rotations that received of 209B of narrow neural processing unit 511-B
1811A (the narrow data literal 1023 for namely being received by data random access memory 122) and weight random access units
The product of the 124 narrow word 0 of row 1;Accumulator 202B numerical value 217B is added and is processed single by narrow nerve by narrow neural processing unit 0-B
Narrow data literal 1811B is (namely random by data after the rotation that the multitask buffer 208A output 209A of first 0-A are received
The access narrow data literal 0 that received of memory 122) product with the narrow word 1 of row 1 of weight random access units 124;According to this
Analogize, until narrow neural processing unit 511-B adds accumulator 202B numerical value 217B by many of narrow neural processing unit 511-A
Narrow data literal 1811B (is namely stored by data random access after the rotation that task buffer device 208A output 209A are received
The narrow data literal 1022 that device 122 is received) product with the narrow word 1023 of row 1 of weight random access units 124.
In the time-frequency cycle 3, each of this 1024 narrow neural processing units can perform taking advantage of positioned at address 2 in Figure 20
Method adds up and rotates second iteration of instruction.As shown in FIG., narrow neural processing unit 0-A adds accumulator 202A numerical value 217A
On by narrow data literal after the multitask buffer 208B output rotations that received of 209B of narrow neural processing unit 511-B
1811A (the narrow data literal 1022 for namely being received by data random access memory 122) and weight random access units
The product of the 124 narrow word 0 of row 2;Accumulator 202B numerical value 217B is added and is processed single by narrow nerve by narrow neural processing unit 0-B
Narrow data literal 1811B is (namely random by data after the rotation that the multitask buffer 208A output 209A of first 0-A are received
The access narrow data literal 1023 that received of memory 122) product with the narrow word 1 of row 2 of weight random access units 124;
The rest may be inferred, until narrow neural processing unit 511-B adds accumulator 202B numerical value 217B by narrow neural processing unit 511-A
The multitask buffer 208A output rotations that received of 209A after narrow data literal 1811B (namely by data random access
The narrow data literal 1021 that memory 122 is received) product with the narrow word 1023 of row 2 of weight random access units 124.Such as
Shown in Figure 21, this computing can persistently be carried out in follow-up 1021 time-frequency cycles, until the time-frequency cycle 1024 of described below.
In the time-frequency cycle 1024, each of this 1024 narrow neural processing units can be performed and be located in Figure 20 address 2
Multiply-accumulate rotation instruction the 1023rd iteration.As shown in FIG., narrow neural processing unit 0-A is by accumulator 202A numerical value
Narrow data text after the rotation that 217A is received plus the multitask buffer 208B output 209B by narrow neural processing unit 511-B
Word 1811A (the narrow data literal 1 for namely being received by data random access memory 122) and weight random access units
The product of the 124 narrow word 0 of row 1023;Narrow neural processing unit 0-B adds accumulator 202B numerical value 217B by narrow nerve
Narrow data literal 1811B is (namely by data after the rotation that the multitask buffer 208A output 209A of reason unit 0-A are received
The narrow data literal 2 that random access memory 122 is received) take advantage of with the narrow word 1 of row 1023 of weight random access units 124
Product;The rest may be inferred, until narrow neural processing unit 511-B adds accumulator 202B numerical value 217B by narrow neural processing unit
Narrow data literal 1811B is (namely random by data after the rotation that the multitask buffer 208A output 209A of 511-A are received
The access narrow data literal 0 that received of memory 122) take advantage of with the narrow word 1023 of row 1023 of weight random access units 124
Product.
The run function unit 212A/ of each in time-frequency cycle 1025, this 1024 narrow neural processing units
212B can perform the run function in Figure 20 positioned at address 3 and instruct.Finally, in the time-frequency cycle 1026, at this 1024 narrow nerves
Each meeting in reason unit will be relative in the row 16 of its narrow result 133A/133B write back data random access memory 122
Answer narrow word, to perform Figure 20 in be located at address 4 write run function unit instruct.That is, neural processing unit 0-A's is narrow
As a result 133A can be written into the narrow word 0 of data random access memory 122, the narrow result 133B meeting of neural processing unit 0-B
The narrow word 1 of data random access memory 122 is written into, the rest may be inferred, until the narrow result of neural processing unit 511-B
133B can be written into the narrow word 1023 of data random access memory 122.Figure 22 shows aforementioned corresponding to Figure 21 with block diagram
Computing.
Figure 22 is the block schematic diagram of the neutral net unit 121 for showing Fig. 1, and this neutral net unit 121 has as schemed
Neural processing unit 126 shown in 18 is performing the program of Figure 20.This neutral net unit 121 includes that 512 nerves process single
First 126, i.e., 1024 narrow neural processing unit, data random access memory 122, and weight random access memory 124,
Data random access memory 122 receives its address input 123, and weight random access memory 124 receives its address input
125.Although not showing in figure, but, in the time-frequency cycle 0, this 1024 narrow neural processing units can all perform the first of Figure 20
Beginningization is instructed.As shown in FIG., in the time-frequency cycle 1,1024 8 data literals of row 17 can be from data random access memory
122 read and provide to this 1024 narrow neural processing units.In the time-frequency cycle 1 to 1024,1024 8 of row 0 to 1023
Weight word can read respectively from weight random access memory 124 and provide to this 1024 narrow neural processing units.Although
Do not show in figure, but, in the time-frequency cycle 1, this 1024 narrow neural processing units can be to the data literal of loading and weight
Word performs its corresponding multiply-accumulate computing.In the time-frequency cycle 2 to 1024, many of this 1024 narrow neural processing units
The circulator of the running of business buffer 208A/208B such as same 1024 8 words, can be by the previously loaded data random access
The data literal of the row 17 of memory 122 is rotated to neighbouring narrow neural processing unit, and these narrow neural processing units can be right
Data literal and the corresponding narrow weight word loaded by weight random access memory 124 are performed and taken advantage of after corresponding rotation
Method accumulating operation.Although not showing in figure, in the time-frequency cycle 1025, this 1024 narrow run function unit 212A/212B meetings
Perform enabled instruction.In the time-frequency cycle 1026, this 1024 narrow neural processing units can be by its 1024 corresponding 8 result
The row 16 of 133A/133B write back datas random access memory 122.
It is possible thereby to find, the embodiment compared to Fig. 2, the embodiment of Figure 18 allows program designer that there is elasticity can select
Select and perform calculating with weight word (such as 8) with weight word (such as 16) and narrow data using wide data, with response to specific
For the demand of the degree of accuracy using under.From one towards from the point of view of, for the application of narrow data, the embodiment of Figure 18 compared to
The embodiment of Fig. 2 can provide the efficiency of twice, but must increase extra narrow component (such as multitask buffer 208B, caching
Device 205B, narrow ALU 204B, narrow accumulator 202B, narrow run function unit 212B) used as cost, these are extra
Narrow component can make neural processing unit 126 increase by about 50% area.
Three moulds nerve processing unit
Figure 23 be show Fig. 1 can dynamic configuration neural processing unit 126 another embodiment block schematic diagram.Figure
23 neural processing unit 126 may not only be applied to wide configuration and narrow configuration, also may be used to the third configuration, hereon referred to as " funnel
(funnel) " configure.Neural processing unit 126 of the neural processing unit 126 of Figure 23 similar to Figure 18.But, in Figure 18
Wide adder 244A is replaced in the neural processing unit 126 of Figure 23 by one three input width adder 2344A, this three input
Wide adder 2344A receives one the 3rd addend 2399, and it is an extension version of the output of narrow multiplexer 1896B.With Figure 23
Neural processing unit neutral net unit performed by program similar to Figure 20 program.But, wherein positioned at address 0
These neural processing units 126 can be initialized as funnel configuration, rather than narrow configuration by the neural processing unit instruction of initialization.This
Outward, the count value that the multiply-accumulate rotation positioned at address 2 is instructed is 511 rather than 1023.
When configuring in funnel, the running of neural processing unit 126 similar in narrow configuration, in performing such as Figure 20
During the multiply-accumulate instruction of location 1, neural processing unit 126 can receive two narrow data literal 207A/207B and two narrow weights
Word 206A/206B;Data literal 209A and weight word 203A can be multiplied to produce wide multiplexer by wide multiplier 242A
Product 246A that 1896A is selected;Narrow multiplier 242B can be multiplied to produce data literal 209B and weight word 203B narrow many
Product 246B that work device 1896B is selected.But, wide adder 2344A can by product 246A (by wide multiplexer 1896A select) with
And product 246B/2399 (being selected by wide multiplexer 1896B) is all added with wide accumulator 202A output 217A, and narrow adder
244B is then not start with narrow accumulator 202B.Additionally, configuring in funnel and performing such as the multiply-accumulate rotation of address 2 in Figure 20
When turning instruction, control input 213 can make multitask buffer 208A/208B rotate two narrow words (such as 16), that is to say, that
Multitask buffer 208A/208B can select its corresponding input 211A/211B, just as the same in wide configuration.But, it is wide
Data literal 209A and weight word 203A can be multiplied to produce the product that wide multiplexer 1896A is selected by multiplier 242A
246A;Data literal 209B and weight word 203B can be multiplied to produce narrow multiplexer 1896B and be selected by narrow multiplier 242B
Product 246B;Also, wide adder 2344A can be by product 246A (being selected by wide multiplexer 1896A) and product 246B/2399
(being selected by wide multiplexer 1896B) is all added with wide accumulator 202A output 217A, and narrow adder 244B and narrow accumulator
202B is not start as aforementioned.Finally, it is wide when configuring in funnel and performing the run function instruction of address 3 in such as Figure 20
Run function unit 212A can perform run function to produce narrow result 133A to result sum 215A, and narrow run function list
First 212B is then not start.Thus, being only denoted as the narrow neural processing unit of A can produce narrow result 133A, the narrow of B is denoted as
Narrow result 133B produced by neural processing unit is then invalid.Therefore, the row of write-back result are (such as the instruction of address 4 in Figure 20
Indicated row 16) can be comprising cavity, this is because only narrow result 133A is effectively, narrow result 133B is then invalid.Therefore, exist
Conceptive, in each time-frequency cycle, each neuron (the neural processing unit of Figure 23) can perform two link data inputs, i.e.,
Two narrow data literals are multiplied by into its corresponding weight and by the two product additions, in comparison, the enforcement of Fig. 2 and Figure 18
Example only carries out a link data input within each time-frequency cycle.
Deposit at random it is found that producing simultaneously write back data random access memory 122 or weight in the embodiment of Figure 23
The quantity of the result word (neuron output) of access to memory 124 is subduplicate the one of received data input (link) quantity
Half, and the row that write back of result have cavity, i.e., be exactly invalid every a narrow text results, more precisely, are denoted as the narrow of B
Neural processing unit result does not have meaning.Therefore, the embodiment of Figure 23 is especially effective for the neutral net with continuous two-layer
Rate, for example, the neuronal quantity that ground floor has for the second layer twice (for example ground floor have 1024 neurons fill
Divide 512 neurons for being connected to the second layer).Additionally, other performance elements 122 (such as media units, such as x86 it is senior to
Amount expanding element) when necessary, union operation (pack operation) can be performed (i.e. with cavity) to dispersion results row
So that it is closely (not having cavity).Subsequently when neural processing unit 121, in execution, other are associated with data random access storage
During the calculating of other row of device 122 and/or weight random access memory 124, you can the data after this is processed are arranged based on
Calculate.
Hybrid neural networks unitary operation:Convolution and common source operational capability
The advantage of the neutral net unit 121 described in the embodiment of the present invention is that this neutral net unit 121 can be simultaneously
Operated in the way of performing oneself internal processes similar to a coprocessor and with similar to the process list of a processor
Unit performs issued framework instruction (or the microcommand gone out by framework instruction translation).Framework instruction is included in nerve
In framework program performed by the processor of NE 121.Thus, neutral net unit 121 can be operated in a mixed manner,
And the high usage of neural processing unit 121 can be maintained.For example, Figure 24 to Figure 26 shows that neutral net unit 121 is performed
The running of convolution algorithm, wherein, neutral net unit is fully utilized, and Figure 27 to Figure 28 shows that neutral net unit 121 is performed
The running of common source computing.The application that convolutional layer, common source layer and other numerical datas are calculated, for example image processing is (as edge is detectd
Survey, sharpened, obfuscation, identification/classification) need to use these computings.But, the hybrid operation of neural processing unit 121
It is not limited to perform convolution or common source computing, this composite character can also be used for performing other computings, such as described in Fig. 4 to Figure 13
The multiply-accumulate computing of traditional neural network and run function computing.That is, (more precisely, the reservation station of processor 100
108) MTNN instructions 1400 and MFNN instruction 1500 to neutral nets unit 121 can be issued, in response to the instruction that this is issued, nerve net
Network unit 121 can write data into 122/124/129 and by result from the memory write by neutral net unit 121
Read in 122/124, at the same time, in order to perform processor 100 (through MTNN1400 instructions) write-in program memory 129
Program, neutral net unit 121 can read and write memory 122/124/129.
Figure 24 is a block schematic diagram, and display uses the data for performing convolution algorithm by the neutral net unit 121 of Fig. 1
One example of structure.This block diagram includes the data random access memory of convolution kernel 2402, data array 2404 and Fig. 1
122 with weight random access memory 124.For a preferred embodiment, data array 2404 is (such as corresponding to image picture
Element) it is loaded into the system storage (not shown) that is connected to processor 100 and by processor 100 through performing MTNN instructions 1400
The weight random access memory 124 of loading neutral net unit 121.Convolution algorithm is rolled up the first array with the second array
Product, this second array is convolution kernel as herein described.As described herein, convolution kernel is a coefficient matrix, and these coefficients also may be used
Referred to as weight, parameter, element or numerical value.For a preferred embodiment, this convolution kernel 2042 is the frame performed by processor 100
The static data of structure program.
This data array 2404 is the two-dimensional array of a data value, and each data value (such as image pixel value) is big
The size (such as 16 or 8) of the little word for being data random access memory 122 or weight random access memory 124.
In this example, data value is 16 words, and neutral net unit 121 is the neural processing unit for being configured with 512 wide configurations
126.Additionally, in this embodiment, neural processing unit 126 is deposited with receiving including multitask buffer from weight arbitrary access
The weight word 206 of reservoir 124, the multitask buffer 705 of such as Fig. 7, uses to being connect by weight random access memory 124
The column data value received performs collective's circulator computing, and this part can be described in more detail in following sections.In this example,
Data array 2404 is the pel array of a 2560 row X1600 row.As shown in FIG., when framework program is by data array 2404
When carrying out convolutional calculation with convolution kernel 2402, data array 2402 can be divided into 20 data blocks, and each data block is respectively
The data array 2406 of 512x400.
In this example, convolution kernel 2402 is one by coefficient, weight, parameter or element, the 3x3 arrays of composition.These
The first row of coefficient is denoted as C0, and 0;C0,1;With C0,2;The secondary series of these coefficients is denoted as C1, and 0;C1,1;With C1,2;This
3rd row of a little coefficients are denoted as C2, and 0;C2,1;With C2,2.For example, the convolution kernel with following coefficient can be used to perform
Edge detection:0,1,0,1, -4,1,0,1,0.In another embodiment, the convolution kernel with following coefficient can be used to perform Gauss
Fuzzy operation:1,2,1,2,4,2,1,2,1.In this example, it will usually perform a division again to the numerical value after final adding up,
Wherein, divisor is the totalling of the absolute value of each element of convolution kernel 2042, and 16 are in this example.In another example, remove
Number can be the number of elements of convolution kernel 2042.In another example, divisor can be that convolution algorithm is compressed to into a target
The numerical value that number range is used, this divisor is by the element numerical value of convolution kernel 2042, target zone and performs convolution algorithm
The scope of input value array is determined.
Refer to Figure 24 and Figure 25 of wherein details be described in detail in detail, framework program by the coefficient write data of convolution kernel 2042 with
Machine accesses memory 122.For a preferred embodiment, the continuous nine row (convolution kernel 2402 of data random access memory 122
Interior number of elements) each column on all words, can utilize convolution kernel 2402 different elements be added with arranging as its primary sequence
To write.That is, as shown in FIG., write with the first coefficient C0,0 in each word of same row;Next column be then with
Second coefficient C0,1 write;Next column is then with the 3rd coefficient C0,2 writes;Again next column is then with the 4th coefficient C1,0 write;
The rest may be inferred, until each word of the 9th row is with the 9th coefficient C2,2 writes.In order to what is be partitioned into data array 2404
The data matrix 2406 of data block carries out convolution algorithm, and neural processing unit 126 can repeat reading data and deposit at random according to order
Nine row of the coefficient of convolution kernel 2042 are loaded in access to memory 122, this part is particularly corresponding to the portion of Figure 26 A in following sections
Point, can be described in more detail.
Refer to Figure 24 and Figure 25 of wherein details is described in detail in detail, the numerical value of data matrix 2406 is write weight by framework program
Random access memory 124.When neutral net unit program performs convolution algorithm, result array can be write back weight arbitrary access
Memory 124.For a preferred embodiment, the first data matrix 2406 can be write weight random access memory by framework program
Device 124 simultaneously makes neutral net unit 121 come into operation, when neutral net unit 121 is to the first data matrix 2406 and convolution
When core 2402 performs convolution algorithm, the second data matrix 2406 can be write weight random access memory 124 by framework program, such as
This, neutral net unit 121 is completed after the convolution algorithm of the first data matrix 2406, you can start to perform the second data matrix
2406 convolution algorithm, this part is described in more detail in follow-up corresponding at Figure 25.In this way, framework program can come and go
In two regions of weight random access memory 124, to guarantee that neutral net unit 121 is fully used.Therefore, Figure 24
Example shows the first data matrix 2406A and the second data matrix 2406B, and the first data matrix 2406A is corresponding to occupying
First data block of row 0 to 399 in weight random access memory 124, and the second data matrix 2406B is corresponding to occupying power
Second data block of row 500 to 899 in weight random access memory 124.Additionally, as shown in FIG., the meeting of neutral net unit 121
The result of convolution algorithm is write back into the row 900-1299 and row 1300-1699 of weight random access memory 124, subsequent framework
Program can read these results from weight random access memory 124.It is loaded into the data square of weight random access memory 124
The data value of battle array 2406 is denoted as " Dx, y ", wherein " x " is the columns of weight random access memory 124, " y " is that weight is deposited at random
The word of access to memory claims line number.For example, the data literal 511 positioned at row 399 is denoted as in fig. 24 D399,
511, this data literal is received by the multitask buffer 705 of neural processing unit 511.
Figure 25 is a flow chart, and the processor 100 for showing Fig. 1 performs framework program with using neutral net unit 121 pairs
The data array 2404 of Figure 24 performs the convolution algorithm of convolution kernel 2042.This flow process starts from step 2502.
In step 2502, processor 100 performs the processor 100 for having framework program, can be by the convolution kernel of Figure 24
2402 write data random access memory 122 in the way of describing shown by Figure 24.Additionally, framework program can be by the beginning of variable N
Beginning turns to numerical value 1.The data block that neutral net unit 121 is being processed in variable N unlabeled datas array 2404.Additionally, framework
Variable NUM_CHUNKS can be initialized as numerical value 20 by program.Next flow process advances to step 2504.
In step 2504, as shown in figure 24, processor 100 can by the data matrix 2406 of data block 1 write weight with
Machine access memory 124 (such as the data matrix 2406A of data block 1).Next flow process advances to step 2506.
In step 2506, processor 100 can be using a specified function 1432 with the MTNN of write-in program memory 129
Instruction 1400, by the program storage 129 of convolution program write neutral net unit 121.Processor 100 subsequently can be using a finger
The MTNN for determining function 1432 to start configuration processor instructs 1400, to start neutral net unit convolution program.Neutral net list
One example of first convolution program is corresponding to can be described in more detail at Figure 26 A.Next flow process advances to step 2508.
In steps in decision-making 2508, whether the numerical value of framework program validation variable N is less than NUM_CHUNKS.If so, flow process meeting
Advance to step 2512;Otherwise proceed to step 2514.
In step 2512, as shown in figure 24, it is random that the data matrix 2406 of data block N+1 is write weight by processor 100
Access memory 124 (such as the data matrix 2406B of data block 2).Therefore, when neutral net unit 121 is to current data
When block performs convolution algorithm, the data matrix 2406 of next data block can be write weight arbitrary access and be deposited by framework program
Reservoir 124, thus, after the convolution algorithm for completing current data block, that is, writing after weight random access memory 124, nerve
NE 121 can immediately begin to perform convolution algorithm to next data block.
In step 2514, neutral net unit program that processor 100 confirms to be carrying out (for data block 1 but from
Step 2506 starts to perform, and is then to start to perform from step 2518 for data block 2-20) whether complete to perform.Just
For one preferred embodiment, processor 100 reads the status register of neutral net unit 121 through MFNN instructions 1500 are performed
127 being confirmed whether to have completed to perform.In another embodiment, neutral net unit 121 can produce interruption, represent complete
Into convolution program.Next flow process advances to steps in decision-making 2516.
In steps in decision-making 2516, whether the numerical value of framework program validation variable N is less than NUM_CHUNKS.If so, flow process
Advance to step 2518;Otherwise proceed to step 2522.
In step 2518, processor 100 can update convolution program to be implemented in data block N+1.More precisely, locate
Reason device 100 can be by the train value of the neural processing unit instruction of the initialization that address 0 is corresponded in weight random access memory 124 more
The new first row for data matrix 2406 (for example, is updated to the row 0 of data matrix 2406A or the row of data matrix 2406B
500), and can update output row (being for example updated to row 900 or 1300).Can start to perform after this renewal with preprocessor 100
Neutral net unit convolution program.Next flow process advances to step 2522.
In step 2522, neutral net list of the processor 100 from read block N of weight random access memory 124
The implementing result of first convolution program.Next flow process advances to steps in decision-making 2524.
In steps in decision-making 2524, whether the numerical value of framework program validation variable N is less than NUM_CHUNKS.If so, flow process
Advance to step 2526;Otherwise just terminate.
In step 2526, the numerical value of N can be increased by one by framework program.Next flow process returns to steps in decision-making 2508.
Figure 26 A are the program listing of neutral net unit program, and this neutral net unit program utilizes the convolution kernel of Figure 24
The convolution algorithm of 2402 execution data matrixes 2406 is simultaneously write back weight random access memory 124.This program by address 1 to
The instruction cycles that 9 instruction is constituted circulate certain number of times.Initialization nerve processing unit instruction positioned at address 0 specifies each
Neural processing unit 126 performs the number of times of this instruction cycles, and the loop count having in the example of Figure 26 A is 400, correspondence
Columns in the data matrix 2406 of Figure 24, and the recursion instruction (being located at address 10) for being located at circulation terminal can make previous cycle
Count value is successively decreased, if result is nonzero value, is just returned to the top (returning to the instruction of address 1) of instruction cycles.Initially
Change neural processing unit instruction also can be cleared to zero by accumulator 202.For a preferred embodiment, positioned at the circulation of address 10
Accumulator 202 can be also cleared to zero by instruction.In addition, as the aforementioned multiply-accumulate instruction positioned at address 1 also can be by accumulator 202
It is cleared to zero.
For the execution each time of instruction cycles in program, this 512 neural processing units 126 can simultaneously perform 512
The convolution algorithm of 512 corresponding 3x3 submatrixs of 3x3 convolution kernels and data matrix 2406.Convolution algorithm is by convolution
The totalling of nine products that the corresponding element in the element of core 2042 and corresponding submatrix is calculated.In the reality of Figure 26 A
In applying example, the origin of each (central element) of this 512 corresponding 3x3 submatrixs is data literal Dx+1, y in Figure 24
+ 1, wherein y (line number) are that neural processing unit 126 is numbered, and x (column number) is present weight random access memory 124
In the column number that read by the multiply-accumulate instruction of address 1 in the program of Figure 26 A (this column number also can be by the initial of address 0
Changing neural processing unit instruction carries out initialization process, also can be incremented by when performing positioned at the multiply-accumulate instruction of address 3 and 5,
Also can be updated by the decrement commands positioned at address 9).Thus, in each circulation of this program, this 512 nerves process single
Unit 126 can calculate 512 convolution algorithms and the result of this 512 convolution algorithms is write back into weight random access memory 124
Instruction column.Edge treated (edge handling) is omitted herein to simplify explanation, but be should be noted that and utilized this
Collective's hyperspin feature of a little nerve processing units 126 can cause (for the image processor i.e. image of data matrix 2406
Data matrix) multirow data in have two rows from the vertical edge of its side to another vertical edge (such as from left side
, to right side edge, vice versa for edge) produce around (wrapping).Illustrate now for instruction cycles.
Address 1 is multiply-accumulate instruction, and this instruction can specify the row 0 of data random access memory 122 and utilize in the dark
The row of present weight random access memory 124, this row are preferably loaded in sequencer 128 (and by the instruction positioned at address 0
It is initialized with zero to perform the computing that first time instruction cycles are transmitted).That is, the instruction positioned at address 1 can make often
Individual neural processing unit 126 reads its corresponding word from the row 0 of data random access memory 122, random from present weight
The access row of memory 124 read its corresponding word, and perform a multiply-accumulate computing to this two words.Thus, citing comes
Say, neural processing unit 5 is multiplied (wherein " x " is that present weight random access memory 124 is arranged) by C0,0 and Dx, 5, by result
Plus the numerical value 217 of accumulator 202, and sum is write back into accumulator 202.
Address 2 is a multiply-accumulate instruction, and this instruction can specify the row of data random access memory 122 to be incremented by (i.e.
Increase to 1), subsequently read this from the incremental rear address of data random access memory 122 again and arrange.This instructs and can specify will be every
Numerical value in the multitask buffer 705 of individual neural processing unit 126 is rotated to neighbouring neural processing unit 126, in this model
The row of the value of data matrix 2406 for the instruction in response to address 1 being in example and being read from weight random access memory 124.In figure
In the embodiment of 24 to Figure 26, these neural processing units 126 to by the numerical value of multitask buffer 705 to anticlockwise,
Rotate to neural processing unit J-1 from neural processing unit J, rather than if aforementioned Fig. 3, Fig. 7 and Figure 19 are from neural processing unit J
Rotate to neural processing unit J+1.It should be noted that in the dextrorotary embodiment of neural processing unit 126, framework program
Can be that numerical value writes data random access memory 122 (such as around its central row rotation) with different order by convolution kernel 2042
To reach the purpose of similar convolution results.Additionally, when needed, framework program can perform extra convolution kernel and pre-process (for example
Mobile (transposition)).Additionally, the count value specified is 2.Therefore, the instruction positioned at address 2 can make each god
Jing processing units 126 read its corresponding word from the row 1 of data random access memory 122, by received text after rotation extremely
Multitask buffer 705, and multiply-accumulate computing is performed to the two words.Because count value is 2, this instruction can also make each
Neural processing unit 126 repeats aforementioned running.That is, sequencer 128 can make the column address of data random access memory 122
123 are incremented by (increasing to 2), and each neural processing unit 126 can read it from the row 2 of data random access memory 122
Corresponding word and will received text at most task buffer device 705 after rotation, and the two words are performed multiply-accumulate
Computing.Thus, for example, it is assumed that present weight random access memory 124 is classified as 27, after the instruction for performing address 2, god
Jing processing units 5 can by C0,1 and D27,6 product and C0,2 and D27,7 product accumulation is to its accumulator 202.Thus, complete
Into after the instruction of address 1 and address 2, C0,0 and D27,5 product, C0,1 and D27,6 product and C0,2 and D27,7 will tire out
Add to accumulator 202, add other all from first front transfer instruction cycles accumulated value.
Computing performed by the instruction of address 3 and 4 similar to address 1 and 2 instruction, using weight random access memory
Effect of 124 row increment pointers, these instructions can carry out computing, and this to the next column of weight random access memory 124
A little instructions can carry out computing to follow-up three row of data random access memory 122, i.e. row 3 to 5.That is, at nerve
Reason unit 5 as a example by, after completing the instruction of address 1 to 4, C0,0 and D27,5 product, C0,1 and D27,6 product, C0,2 with
D27,7 product, C1,0 and D28,5 product, C1,1 and D28,6 product and C1,2 and D28,7 product can be added to
Accumulator 202, add other all from first front transfer instruction cycles accumulated value.
Similar to the instruction of address 3 and 4, these are instructed can at random deposit to weight for computing performed by the instruction of address 5 and 6
The next column of access to memory 124, and follow-up three row of data random access memory 122, i.e. row 6 to 8, carry out computing.
That is, by taking neural processing unit 5 as an example, after completing the instruction of address 1 to 6, C0,0 and D27,5 product, C0,1 and D27,6
Product, C0,2 and D27,7 product, C1,0 and D28,5 product, C1,1 and D28,6 product, C1,2 and D28,7, C2,0
With D29,5 product, C2,1 and D29,6 product and C2,2 and D29,7 product can be added to accumulator 202, add it
The accumulated value of its all instruction cycles from first front transfer.That is, after completing the instruction of address 1 to 6, it is assumed that instruction is followed
When ring starts, weight random access memory 124 is classified as 27, by taking neural processing unit 5 as an example, it will using convolution kernel 2042 pairs
Following 3x3 submatrixs carry out convolution algorithm:
In general, after completing the instruction of address 1 to 6, this 512 neural processing units 126 have all used convolution kernel
2042 pairs of following 3x3 submatrixs carry out convolution algorithm:
When wherein r is that instruction cycles start, the column address value of weight random access memory 124, and n is that nerve process is single
The numbering of unit 126.
The instruction of address 7 can transmit the numerical value 217 of accumulator 202 through run function unit 121.This transmission function can be transmitted
One word, its size (in bits) is equal to by data random access memory 122 and weight random access memory
124 words for reading (in this example i.e. 16).For a preferred embodiment, user may specify output format, for example
How many position is decimal (fractional) position in carry-out bit, and this part can be described in more detail in following sections.In addition, this
Specify and may specify a division run function, and non-designated transmission run function, this division run function can be by accumulator
202 numerical value 217 divided by a divisor, as described in corresponding to Figure 29 A and Figure 30 herein, such as using " divider " of Figure 30
One of 3014/3016.For example, for one has the convolution kernel 2042 of coefficient, there are 16 points Ru aforementioned
One of coefficient Gaussian Blur core, the instruction of address 7 can specify a division run function (such as divided by 16), and non-designated one
Transmission function.In addition, framework program can be before data random access memory 122 be write, to convolution kernel by convolution kernel coefficient
2042 coefficients perform this computing divided by 16, and adjust the position of the binary point of the numerical value of convolution kernel 2042 according to this, for example
Using the data binary point 2922 of Figure 29 as described below.
The instruction of address 8 can by the output of run function unit 212 write weight random access memory 124 by exporting
Row specified by the currency of row buffer.This currency can be initialized by the instruction of address 0, and by the incremental finger in instruction
Pin is just incremented by this numerical value often passing through one cycle.
As described in the example that Figure 24 to Figure 26 has a 3x3 convolution kernels 2402, when neural processing unit 126 is every about three
The frequency cycle can read weight random access memory 124 to read a row of data matrix 2406, and during every about 12
The frequency cycle can be by convolution kernel matrix of consequence write weight random access memory 124.Furthermore, it is assumed that in one embodiment, have
Write and read buffers such as the buffer 1704 of Figure 17, while neural processing unit 126 is read out with write, place
Reason device 100 can be read out and write to weight random access memory 124, and buffer 1704 is every about 16 time-frequency weeks
Phase can perform to weight random access memory and once read and write activity, to read data matrix and write convolution respectively
Core matrix of consequence.Therefore, the approximately half of frequency range of weight random access memory 124 can be by neutral net unit 121 with mixed
The convolution kernel operation that conjunction mode is performed is consumed.This example includes a 3x3 convolution kernel 2042, and but, the present invention is not limited to
This, the convolution kernel of other sizes, such as 2x2,4x4,5x5,6x6,7x7,8x8 are equally applicable to different neutral net units
Program.In the case of using larger convolution kernel, because the rotation version of multiply-accumulate instruction is (such as the address 2,4 and 6 of Figure 26 A
Instruction, larger convolution kernel may require that using these instruction) have larger count value, neural processing unit 126 read power
The time accounting of weight random access memory 124 can be reduced, therefore, the frequency range of weight random access memory 124 use than
Can reduce.
In addition, framework program can make neutral net unit program to the row that no longer need to use in input data matrix 2406
Override, rather than by convolution algorithm result write back weight random access memory 124 different lines (as row 900-1299 with
1300-1699).For example, for the convolution kernel of a 3x3, data matrix 2406 can be write weight by framework program
The row 2-401 of random access memory 124, and write-not row 0-399, and neural processing unit program then can at random be deposited from weight
The row 0 of access to memory 124 start to write convolution algorithm result, and often pass through once command circulation and be just incremented by columns.Such as
This, the row that neutral net unit program only can will no longer be required to use are override.For example, finger is passed through in first time
After order circulation (or more precisely, the row of its loading weight random access memory 124 after the instruction for performing address 1
0), the data of row 0 can be written, but, arrange the data of 1-3 need to leave for the computing that passes through instruction cycles for the second time and
Can not be written;Similarly, after instruction cycles are passed through for the second time, the data of row 1 can be written, and but, arrange 2-4
Data need leave for third time pass through the computing of instruction cycles and can not be written;The rest may be inferred.In this embodiment,
The height (such as 800 row) of each data matrix 2406 (data block) can be increased, thus less data block can be used.
In addition, framework program can make neutral net unit program that the result of convolution algorithm is write back into the top of convolution kernel 2402
Data random access memory 122 arrange (such as row 8 top), rather than convolution algorithm result write back into weight arbitrary access deposit
Reservoir 124, when neutral net unit 121 writes result, framework program can read from data random access memory 122 and tie
Really (for example it is most recently written the address of row 2606 using data random access memory in Figure 26 122).This configuration is suitable for having
The embodiment of single port weight random access memory 124 and dual port data random access memory.
According to the computing of neutral net unit 121 in the embodiment of Figure 24 to Figure 26 A it is found that the program of Figure 26 A
Perform every time and may require that about 5000 time-frequency cycles, thus, in Figure 24 the data array 2404 of whole 2560x1600 convolution
Computing needs about 100,000 time-frequency cycle, hence it is evident that less than the time-frequency cycle performed in a conventional manner required for same task
Number.
Figure 26 B are an embodiment of some fields of the control buffer 127 of the neutral net unit 121 for showing Fig. 1
Block schematic diagram.This status register 127 includes a field 2602, it is indicated that quilt recently in weight random access memory 124
The address of the row of the write of neural processing unit 126;One field 2606, it is indicated that quilt recently in data random access memory 122
The address of the row of the write of neural processing unit 126;One field 2604, it is indicated that quilt recently in weight random access memory 124
The address of the row that neural processing unit 126 reads;And a field 2608, it is indicated that in data random access memory 122 most
The address of the row for closely being read by neural processing unit 126.Thus, the framework program for being implemented in processor 100 just can confirm that god
The processing progress of Jing NEs 121, when entering to data random access memory 122 and/or weight random access memory 124
When the reading and/or write of row data.Using this ability, add as it is aforementioned selection input data matrix is override (or
Data random access memory 122 is write the result into Ru aforementioned), as described in following example, the data array 2404 of Figure 24 is just
The data block of 5 512x1600 can be considered as to perform, rather than the data block of 20 512x400.Processor 100 is random from weight
The row 2 of access memory 124 start to write the data block of first 512x1600, and make neutral net unit program start (this
Program has the cycle count that a numerical value is 1600, and weight random access memory 124 is exported into row initialization for 0).When
When neutral net unit 121 performs neutral net unit program, processor 100 can monitor weight random access memory 124
Outgoing position/address, nationality is read in weight random access memory 124 with (1) (using MFNN instructions 1500) to be had by nerve
The row of effective convolution operation result of NE 121 (by row 0) write;And (2) are by second 512x1600 data
Matrix 2406 (starting from row 2) is override in the effective convolution operation result being read, and so works as neutral net unit 121 pairs
Neutral net unit program is completed in first 512x1600 data block, processor 100 can immediately update when necessary nerve
NE program is simultaneously again started up neutral net unit program to be implemented in second 512x1600 data block.This program can be again
Perform in triplicate and be left three 512x1600 data blocks, so that neutral net unit 121 can be used fully.
In one embodiment, run function unit 212 is effective with can effectively perform one to the numerical value 217 of accumulator 202
The ability of division arithmetic, this part especially corresponds in following sections and is had in more detail at Figure 29 A, Figure 29 B and Figure 30
It is bright.For example, the run function neutral net unit instruction that the division arithmetic divided by 16 is carried out to the numerical value of accumulator 202 can use
In the Gaussian Blur matrix of described below.
Convolution kernel 2402 used in the example of Figure 24 is a small-sized static for being applied to whole data matrix 2404
Convolution kernel, but, the present invention is not limited to this, and this convolution kernel is alternatively a large-scale matrix, with specific weight corresponding to number
According to the different pieces of information value of array 2404, for example, it is common in the convolution kernel of convolutional neural networks.When neutral net unit 121 is with this side
When formula is used, framework program can by the location swap of data matrix and convolution kernel, also will data matrix be positioned over data with
Convolution kernel is positioned in weight random access memory 124 in machine access memory 122, and performs neutral net unit journey
The columns processed needed for sequence also can be relatively fewer.
Figure 27 is a block schematic diagram, shows a model of the weight random access memory 124 that input data is inserted in Fig. 1
Example, this input data performs common source computing (pooling operation) by the neutral net unit 121 of Fig. 1.Common source computing is
Performed by the common source layer of artificial neural network, the maximum of subregion or submatrix and calculated sub-matrix through acquirement input matrix
Value or mean value are common source matrix with matrix as a result, to reduce input data matrix (such as image after image or convolution)
Size (dimension).In the example of Figure 27 and Figure 28, common source computing calculates the maximum of each submatrix.Common source computing
It is particularly useful for the artificial neural network for such as performing object classification or detect.In general, common source computing can essentially make
The factor of input matrix reduction is first prime number of detected submatrix, particularly can be by each dimension direction of input matrix
All reduce first prime number in the corresponding dimension direction of submatrix.In the example of Figure 27, input data is a wide word (such as 16
Position) 512x1600 matrixes, be stored in the row 0 to 1599 of weight random access memory 124.In figure 27, these words with
Its column row location mark, e.g., positioned at the row 0 of row 0 word indicating be D0,0;Positioned at the row 1 of row 0 word indicating be D0,1;
Positioned at the row 2 of row 0 word indicating be D0,2;The rest may be inferred, positioned at the row 511 of row 0 word indicating be D0,511.In the same manner, position
In the row 0 of row 1 word indicating be D1,0;Positioned at the row 1 of row 1 word indicating be D1,1;Positioned at the word indicating of 1 row of row 2 be D1,2;
The rest may be inferred, positioned at the row 511 of row 1 word indicating be D1,511;So the rest may be inferred, positioned at the word indicating of the row 0 of row 1599
For D1599,0;Positioned at the row 1 of row 1599 word indicating be D1599,1 be located at the row 2 of row 1599 word indicating be D1599,2;According to
This analogizes, positioned at the row 511 of row 1599 word indicating be D1599,511.
Figure 28 is the program listing of neutral net unit program, and this neutral net unit program performs the input data of Figure 27
The common source computing of matrix is simultaneously write back weight random access memory 124.In the example of Figure 28, common source computing can calculate defeated
Enter the maximum of each 4x4 submatrix in data matrix.This program can be performed a plurality of times the instruction cycles being made up of instruction 1 to 10.
The neural processing unit instruction of initialization positioned at address 0 can specify the number of times of each execute instruction of neural processing unit 126 circulation,
Loop count in the example of Figure 28 is 400, and the recursion instruction in circulation end (in address 11) can make previous cycle
Count value is successively decreased, and if produced result is nonzero value, the top for being just returned to instruction cycles (returns to address 1
Instruction).Input data matrix in weight random access memory 124 substantially can be considered as 400 by neutral net unit program
The individual mutual exclusion group being made up of four adjacent columns, that is, arrange 0-3, row 4-7, row 8-11, the rest may be inferred, until arrange 1596-1599.Often
One group being made up of four adjacent columns includes 128 4x4 submatrixs, four row and four phases of these submatrixs thus group
The 4x4 submatrixs that the infall element of adjacent rows is formed, these adjacent lines at once 0-3, row 4-7, row 8-11, so on up to
Row 508-511.It is the 4th neural processing unit 126 (of one group of calculating per four in this 512 neural processing units 126
I.e. 128 altogether) meeting 4x4 submatrixs corresponding to perform common source computing, and other three neural processing units 126 are not then made
With.More precisely, neural processing unit 0,4,8, so on up to neural processing unit 508,4x4 that can be corresponding to its
Submatrix performs common source computing, and the leftmost side line number of this 4x4 submatrix corresponding to neural processing unit numbering, and lower section arranges
Corresponding to the train value of present weight random access memory 124, this numerical value can be initialized as zero simultaneously by the initialization directive of address 0
And can increase by 4 after each instruction cycles are repeated, this part can be described in more detail in following sections.This 400 times instructions are followed
To the 4x4 submatrixs group number in the input data matrix of Figure 27, (i.e. input data matrix has the repetitive operation correspondence of ring
1600 row are divided by 4).The neural processing unit instruction of initialization can also remove accumulator 202 makes it be zeroed.With regard to a preferred embodiment
Speech, the recursion instruction of address 11 can also remove accumulator 202 makes it be zeroed.In addition, the maxwacc instructions of address 1 can specify clear
Except accumulator 202 makes it be zeroed.
Every time in the instruction cycles of configuration processor, this 128 neural processing units 126 for being used can be to input data
128 other 4x4 submatrixs in the current four row group of matrix, while performing 128 common source computings.Furthermore, it is understood that
This common source computing can confirm the maximum element in 16 elements of this 4x4 submatrix.In the embodiment of Figure 28, for this
For each neural processing unit y in 128 neural processing units 126 for being used, the lower left element of 4x4 submatrixs
For element Dx, y in Figure 27, wherein x is the columns of present weight random access memory 124 when instruction cycles start, and this
By the maxwacc instruction readings of address 1 in the program of Figure 28, (this columns also can be processed column data by the initialization of address 0 nerve
Unit instruction is initialized, and is incremented by when the maxwacc for performing address 3,5 and 7 every time is instructed).Therefore, for this program
Each circulation for, this 128 neural processing units 126 that used can work as corresponding 128 4x4 of prostatitis group
The maximum element of submatrix, writes back the specified row of weight random access memory 124.Retouched below for this instruction cycles
State.
The maxwacc instructions of address 1 can be arranged using present weight random access memory 124 in the dark, and this row is preferably filled
Be loaded in sequencer 128 (and be initialized with zero by the instruction positioned at address 0 and pass through instruction cycles for the first time to perform
Computing).The instruction of address 1 can make each neural processing unit 126 from weight random access memory 124 when prostatitis is read
Its corresponding word, by this word compared with the numerical value 217 of accumulator 202, and the maximum of the two numerical value is stored in cumulative
Device 202.So that it takes up a position, for example, neural processing unit 8 can confirm the numerical value 217 of accumulator 202 and data literal Dx, 8 (wherein " x "
That present weight random access memory 124 is arranged) in maximum and write back accumulator 202.
Address 2 is a maxwacc instruction, and this instruction can be specified and cache the multitask of each neural processing unit 126
Numerical value in device 705 is rotated to neighbouring to neural processing unit 126, and here is just random from weight in response to the instruction of address 1
The row input data array of values that access memory 124 reads.In the embodiment of Figure 27 to Figure 28, neural processing unit 126
, to anticlockwise, namely rotate to neural processing unit J-1 from neural processing unit J to by the numerical value of multiplexer 705, it is such as right above
Should be described in the chapters and sections in Figure 24 to Figure 26.Additionally, it is 3 that this instruction can specify a count value.Thus, the instruction of address 2 can make often
Individual neural processing unit 126 by the at most task buffer device 705 of received text after rotation and confirm this rotation after word and accumulator
Maximum in 202 numerical value, is then repeated two more times this computing.That is, each neural processing unit 126 can be held
Row three times is by the at most task buffer device 705 of received text after rotation and confirms that word is maximum with the numerical value of accumulator 202 after rotation
The computing of value.Thus, for example, it is assumed that when starting this instruction cycles, present weight random access memory 124 is classified as 36,
By taking neural processing unit 8 as an example, after the instruction for performing address 1 and 2, neural processing unit 8 will be stored up in its accumulator 202
Deposit accumulator 202 and four word D36 of weight random access memory 124 when circulation starts, 8, D36,9, D36,10 with
D36, the maximum in 11.
Computing performed by the maxwacc instructions of address 3 and 4 is deposited similar to the instruction of address 1 using weight arbitrary access
Effect that the row increment pointers of reservoir 124 have, instruction of the address 3 with 4 can hold to the next column of weight random access memory 124
OK.That is, it is assumed that the row of present weight random access memory 124 are 36 when instruction cycles start, with neural processing unit 8
As a example by, after the instruction for completing address 1 to 4, neural processing unit 8 will store tired when circulation starts in its accumulator 202
Plus device 202 and eight word D36 of weight random access memory 124,8, D36,9, D36,10, D36,11, D37,8, D37,
9th, D37,10 and D37, the maximum in 11.
The maxwacc of address 5 to 8 instructs performed computing similar to the instruction of address 1 to 4, the instruction of address 5 to 8
Lower two row of weight random access memory 124 can be performed.That is, it is assumed that present weight is random when instruction cycles start
The access row of memory 124 are 36, by taking neural processing unit 8 as an example, after the instruction for completing address 1 to 8, and neural processing unit 8
Accumulator 202 and 16 words of weight random access memory 124 when circulation starts will be stored in its accumulator 202
D36,8, D36,9, D36,10, D36,11, D37,8, D37,9, D37,10, D37,11, D38,8, D38,9, D38,10, D38,
11st, D39,8, D39,9, D39,10 and D39, the maximum in 11.That is, it is assumed that when instruction cycles start present weight with
The machine access row of memory 124 are 36, by taking neural processing unit 8 as an example, after the instruction for completing address 1 to 8, and neural processing unit
8 will complete to confirm the maximum of following 4x4 submatrixs:
D36,8 D36,9 D36,10 D36,11
D37,8 D37,9 D37,10 D37,11
D38,8 D38,9 D38,10 D38,11
D39,8 D39,9 D39,10 D39,11
Substantially, it is each in this 128 neural processing units 126 for being used after the instruction for completing address 1 to 8
Individual neural processing unit 126 will complete the maximum for confirming following 4x4 submatrixs:
Dr, n Dr, n+1 Dr, n+2 Dr, n+3
Dr+1, n Dr+1, n+1 Dr+1, n+2 Dr+1, n+3
Dr+2, n Dr+2, n+1 Dr+2, n+2 Dr+2, n+3
Dr+3, n Dr+3, n+1 Dr+3, n+2 Dr+3, n+3
Wherein r is the column address value of present weight random access memory 124 when instruction cycles start, and n is neural process
Unit 126 is numbered.
The instruction of address 9 can transmit the numerical value 217 of accumulator 202 through run function unit 212.This transmission function can be transmitted
One word, its size (in bits) is equal to the word read by weight random access memory 124 (in this example
I.e. 16).Such as a preferred embodiment, user may specify output format, and how many position is decimal in carry-out bit
(fractional) position, this part can be described in more detail in following sections.
The instruction of address 10 can write the numerical value 217 of accumulator 202 slow by output row in weight random access memory 124
Row specified by the currency of storage, this currency can be initialized by the instruction of address 0, and using the incremental finger in instruction
Pin is incremented by this numerical value after circulation is passed through every time.Furthermore, it is understood that the instruction of address 10 can be wide by the one of accumulator 202
Word (such as 16) write weight random access memory 124.For a preferred embodiment, this instruction can by this 16 positions according to
Write according to output binary point 2916, this part is following more detailed corresponding to having at Figure 29 A and Figure 29 B
Explanation.
It has been observed that the row of iteration once command recurrent wrIting weight random access memory 124 can be comprising with invalid value
Cavity.That is, the wide word 1 to 3 of result 133,5 to 7,9 to 11, the rest may be inferred, until wide word 509 to 511 all
It is invalid or untapped.In one embodiment, run function unit 212 includes that result is incorporated into row buffering by multiplexer enable
The adjacent word of device, the column buffer 1104 of such as Figure 11 is arranged with writing back output weight random access memory 124.With regard to one compared with
For good embodiment, run function instruction can specify word number in each cavity, and the word numerical control multiplexing in this cavity
Device amalgamation result.In one embodiment, empty number can be designed to numerical value 2 to 6, with merge common source 3x3,4x4,5x5,6x6 or
The output of 7x7 submatrixs.In addition, being implemented in the framework program of processor 100 can read institute from weight random access memory 124
Sparse (i.e. with cavity) the result row for producing, and using other performance elements 112, for example, merge the media of instruction using framework
Unit, such as x86 single-instruction multiple-data stream (SIMD)s extension (SSE) instruction, performs pooling function.By similar to it is aforementioned while carrying out in the way of
And using the mixing essence of neutral net unit 121, the framework program for being implemented in processor 100 can be with reading state buffer
127 are most recently written row (such as field 2602 of Figure 26 B) with produced by reading with monitor weight random access memory 124
One sparse result is arranged, and is merged and write back the same row of weight random access memory 124, so just completes to prepare and can make
For an input data matrix, there is provided use to next layer of neutral net, such as convolutional layer or traditional neural network layer (namely
Multiply-accumulate layer).Additionally, embodiment as herein described performs common source computing with 4x4 submatrixs, but the present invention is not limited to
This, the neutral net unit program of Figure 28 can be adjusted, and with the submatrix of other sizes, such as 3x3,5x5,6x6 or 7x7, holds
Row common source computing.
As aforementioned it is found that the quantity that the result of write weight random access memory 124 is arranged is input data matrix
Columns a quarter.Finally, in this example and be not used data random access memory 122.But, it is possible with number
According to random access memory 122, rather than weight random access memory 124 is performing common source computing.
In the embodiment of Figure 27 and Figure 28, the maximum in common source computing accounting operator region.But, the program of Figure 28 can
The adjusted mean value to calculate subregion, profit enters to pass through by maxwacc instructions with sumwacc instruction replacements (by weight word
Add up with the numerical value 217 of accumulator 202) and be divided by each sub-regions by accumulation result by the run function instruction modification of address 9
First prime number (preferably pass through multiplying reciprocal as described below), be 16 in this example.
By in computing of the neutral net unit 121 according to Figure 27 and Figure 28 it is found that performing the program of Figure 28 each time
Need to perform a common source computing to the whole 512x1600 data matrixes shown in Figure 27 using about 6000 time-frequency cycles,
The time-frequency periodicity that this computing is used is considerably less than the time-frequency periodicity that traditional approach performs similar required by task.
In addition, framework program can make neutral net unit program by the result write back data random access memory of common source computing
Device 122 is arranged, rather than results back into weight random access memory 124, when neutral net unit 121 write the result into data with
(for example the address of row 2606 is most recently written during machine access memory 122 using the data random access memory 122 of Figure 26 B),
Framework program can read result from data random access memory 122.This configuration is suitable for, and there is single port weight arbitrary access to deposit
The embodiment of reservoir 124 and dual port data random access memory 122.
Fixed point arithmetic computing, with user binary point is provided, and full precision fixed point is cumulative, and user specifies inverse
Value, the random rounding-off of accumulator value, and optional startup/output function
In general, the hardware cell that arithmetical operation is performed in digital computing system performs the right of arithmetical operation according to it
As for integer or floating number, being commonly divided into " integer " unit and " floating-point " unit.Floating number has numerical value (magnitude)
(or mantissa) and index, generally also symbol.Index is radix (radix) point (usually binary point) relative to numerical value
Position pointer.In comparison, integer does not have index, and only has numerical value, generally also symbol.Floating point unit can allow
Program designer can obtain its work numeral to be used from a very large-scale difference numerical value, and hardware is then
It is responsible for when needed adjusting this digital exponential quantity, processes without program designer.For example, it is assumed that two floating numbers
0.111x 1029With 0.81x 1031It is multiplied.Although (floating point unit typically operates in the floating number based on 2, institute in this example
Use decimal fraction, or the floating number based on 10.) floating point unit can be responsible for automatically mantissa multiplication, index be added,
Subsequently again result is normalized to into numerical value .8911x 1059.In another example, it is assumed that two same floating numbers are added.It is floating
Dot element can automatically be responsible for that the binary point of mantissa aligns to produce numerical value as .81111x 10 before addition31It is total
Number.
But, it is well known that so complicated computing and the size of floating point unit can be caused to increase, power consumption increase, often refer to
Time-frequency periodicity needed for order increases and/or cycle time is elongated.For this reason that, many devices are (such as embedded processing
Device, microcontroller and relatively low cost and/or lower powered microprocessor) there is no floating point unit.Can be with by previous cases
It was found that, to be associated with floating add (i.e. right with the logic that the index of multiplication/division is calculated comprising performing for the labyrinth of floating point unit
The index of operand performs plus/minus computing to produce the adder of the exponential number of floating-point multiplication/division, by operand index phase
Subtract to confirm the subtracter of the binary point alignment offset amount of floating add), comprising mantissa in order to reach floating add
Binary point alignment deviator, comprising the deviator being standardized to floating point result.Additionally, flow process is entered
Row generally also need to perform the logic of the rounding-off computing of floating point result, perform integer data format between floating-point format and different floating-points
The detector with leading one of logic, leading zero of the conversion between form (such as expanding precision, double precision, single precision, half precision),
And the logic of the special floating number of process, such as outlying observation, nonumeric and infinite value.
Additionally, the correctness checking with regard to floating point unit can be big because of needing the numerical space being verified to increase in design
Width increases its complexity, and can extend product development cycle and Time To Market.Additionally, it has been observed that floating-point operation arithmetic needs are right
The mantissa field of each floating number for calculating is stored respectively and used with exponent field, and can increase required storage area
And/or reduce accuracy in the case where given storage area is to store integer.Many of which shortcoming can pass through integer list
Unit performs arithmetical operation to avoid.
Program designer generally needs to write the program for processing decimal, and decimal is the numerical value of incomplete number.This program
May need to perform not having on the processor of floating point unit, although or processor there is floating point unit, but by processing
The integer unit of device performs integer instructions can be than very fast.For the advantage using integer processor in efficiency, program designer
Known fixed point arithmetic computing can be used to fixed-point value (fixed-point numbers).Such program can include performing
In integer unit processing the instruction of integer or integer data.Software knows that data are decimals, and this software is simultaneously right comprising instruction
Integer data performs computing and processes the problem that this data is actually decimal, such as alignment offset device.Substantially, pinpoint soft
Part can manually perform the function that some or all floating point unit can be performed.
Herein, one " fixed point " number (or value or operand or input or output) is a numeral, its bit of storage quilt
It is interpreted as including a fractional part of the position to represent this fixed-point number, this position is referred to here as " decimal place ".The bit of storage bag of fixed-point number
It is contained in memory or buffer, such as 8 or 16 words of in memory or buffer.Additionally, the storage of fixed-point number
Position is deposited all for expressing a numerical value, and in some cases, one of position can be used for expression symbol, but, not have
The bit of storage of one fixed-point number can be used for expressing the index of this number.Additionally, the decimal place quantity of this fixed-point number or title binary system
Scaling position is specified in one and is different from the storage area of fixed-point number bit of storage, and is referred in shared or general mode
Go out the quantity of decimal place or claim binary point position, be shared with a fixed-point number set comprising this fixed-point number, such as it is defeated
Enter the set of the output result of operand, accumulating values or pe array.
In embodiment described here, ALU is integer unit, and but, run function unit is then comprising floating
Point arithmetic hardware auxiliary accelerates.Can so make ALU part become less and more quick, be beneficial to
More ALUs are used on fixed chip space.This is also illustrated in unit chip and can spatially arrange more god
Jing is first, and is particularly conducive to neutral net unit.
Additionally, being required for index bit of storage compared to each floating number, the fixed-point number in embodiment as herein described is with one
Individual pointer gauge belongs to the quantity of the bit of storage of decimal place up in whole digital collections, and but, this pointer is single, common positioned at one
The storage area enjoyed and widely point out all numerals of whole set, such as the input set of a series of computings, a series of fortune
The set of the cumulative number of calculation, the wherein set of output, the quantity of decimal place.For a preferred embodiment, neutral net unit
User the quantity of decimal bit of storage can be specified to this digital collection.Although it is understood, therefore, that in many cases
(mathematics as), the term of " integer " refers to that a tape symbol is completely counted, that is, one with fractional part numeral,
But, in the train of thought of this paper, the term of " integer " can represent the numeral with fractional part.Additionally, in the train of thought of this paper,
The term of " integer " be in order to make a distinction with floating number, for floating number, its part position meeting each in storage area
For expressing the index of floating number.Similarly, the multiplication of integers or addition of integer arithmetic computing, such as integer unit execution compare
Computing, it is assumed that in operand have index, therefore, the whole array part of integer unit, such as integer multiplier, integer adder,
Integer comparator, avoids the need for processing index comprising logic, for example, mantissa need not be moved for addition or comparison operation
To be directed at binary point, it is not necessary to be added index for multiplying.
Additionally, embodiment as herein described includes a large-scale hardware integer accumulator with to the whole of large series
Number computing is added up (such as 1000 multiply-accumulate computings) without losing accuracy.So can avoid at neutral net unit
Reason floating number, while and cumulative number can be made to maintain full precision, without making its saturation or inaccurate knot being produced because of overflow
Really.Once this series of integers computing adds up out a result and is input into this full precision accumulator, this fixed-point hardware auxiliary can perform necessity
Scaling and saturation arithmetic, use little required for the accumulated value decimal place quantity pointer and output valve specified using user
Numerical digit quantity, by this full precision accumulated value an output valve is converted to, and this part can be described in more detail in following sections.
It is input into or for passing for use in the one of run function when needing to be compressed accumulated value from full precision form
Pass, for a preferred embodiment, run function unit optionally can perform random rounding-off computing, this part to accumulated value
Can be described in more detail in following sections.Finally, the different demands of the given layer of foundation neutral net, neural processing unit can
Optionally to receive to indicate with using the accumulated value of different run functions and/or many multi-forms of output.
Figure 29 A are the block schematic diagram of an embodiment of the control buffer 127 for showing Fig. 1.This control buffer 127 can
Including multiple control buffers 127.As shown in FIG., this control buffer 127 includes fields:Configuration 2902, tape symbol
Data 2912, tape symbol weight 2914, data binary point 2922, weight binary point 2924, arithmetical logic list
Meta-function 2926, rounding control 2932, run function 2934, inverse 2942, side-play amount 2944, output random access memory
2952nd, binary point 2954 and output order 2956 are exported.The control value of buffer 127 can be instructed using MTNN
1400 carry out write activity with the instruction of NNU programs, such as enabled instruction.
2902 values of configuration specify neutral net unit 121 to be belonging to narrow configuration, wide configuration or funnel configuration, such as front institute
State.Configuration 2902 also sets the input text received by data random access memory 122 and weight random access memory 124
The size of word.In narrow configuration with funnel configuration, the size for being input into word is narrow (such as 8 or 9), but, is matched somebody with somebody in width
In putting, the size for being input into word is then wide (such as 12 or 16).Additionally, configuration 2902 also set with input word it is big
The size of little identical output result 133.
When tape symbol data value 2912 is genuine, that is, represent the data text received by data random access memory 122
Word is signed value, if false, then it represents that these data literals are not signed value.When tape symbol weighted value 2914 is genuine
Wait, that is, represent that the weight word received by weight random access memory 122 is signed value, if false, then it represents that these power
Word is weighed for not signed value.
The value of data binary point 2922 represents that the two of the data literal received by data random access memory 122 enter
Scaling position processed.For a preferred embodiment, for the position of binary point, data binary point
2922 values represent the position number of positions that binary point is calculated from right side.In other words, the table of data binary point 2922
Belong to the quantity of decimal place, i.e., the digit on the right side of binary point in the least significant bit for showing data literal.Similarly,
The value of weight binary point 2924 represents the binary point of the weight word received by weight random access memory 124
Position.For a preferred embodiment, when ALU function 2926 is that a multiplication is cumulative with cumulative or output, nerve
It is little that the digit being loaded on the right side of the binary point of the numerical value of accumulator 202 is defined as data binary system by processing unit 126
The totalling of several points 2922 and weight binary point 2924.If so that it takes up a position, for example, data binary point 2922
The value for being worth the weight binary point 2924 for 5 is 3, and the value in accumulator 202 will have 8 on the right side of binary point
Position.When ALU function 2926 be a sum/maximum accumulator with data/weight word or transmission data/
Weight word, neural processing unit 126 can be by the digit being loaded on the right side of the binary point of the numerical value of accumulator 202 difference
It is defined as data/weight binary point 2922/2924.In another embodiment, then refer to that one accumulator two of order enters
Decimal point processed 2923, and do not go to specify an other data binary point 2922 and weight binary point 2924.This portion
Dividing can be described in more detail in follow-up corresponding at Figure 29 B.
ALU function 2926 specifies the function performed by the ALU 204 of neural processing unit 126.
It has been observed that ALU function 2926 may include following computing but be not limited to:By data literal 209 and weight word 203
It is multiplied and is added this product with accumulator 202;Accumulator 202 is added with weight word 203;By accumulator 202 and data
Word 209 is added;Maximum in accumulator 202 and data literal 209;Maximum in accumulator 202 and weight word 209
Value;Output accumulator 202;Transmission data literal 209;Transmission weight word 209;Output null value.In one embodiment, this arithmetic
Logical block function 2926 is specified by neutral net unit initialization directive, and by ALU 204 using with
In response to an execute instruction (not shown).In one embodiment, this ALU function 2926 is by individual other neutral net list
Metainstruction is specified, and such as aforementioned multiply-accumulate and maxwacc is instructed.
Rounding control 2932 specifies the form for being rounded computing that rounder 3004 (in Figure 30) is used.In an embodiment
In, assignable rounding mode is included but is not limited to:It is unrounded, be rounded up to most recent value and random rounding-off.Preferably implement with regard to one
For example, processor 100 includes that random order originates 3003 (refer to Figure 30) to produce random order 3005, these random orders 3005
It is sampled to reduce producing the possibility of rounding-off deviation to perform random rounding-off.In one embodiment, when rounding bit 3005 is
One and to stick (sticky) position be zero, if the random order 3005 of sampling is true, neural processing unit 126 will be rounded up to, if
The random order 3005 for being sampling is false, and neural processing unit 126 would not be rounded up to.In one embodiment, random order source
The 3003 random characteristic electrons being had based on processor 100 are sampled to produce random order 3005, these random characteristic electrons
Such as the thermal noise of semiconductor diode or resistance, but the present invention is not limited to this.
Run function 2934 specifies the function for the numerical value 217 of accumulator 202 to produce the output of neural processing unit 126
133.As described herein, run function 2934 is included but is not limited to:S type functions;Hyperbolic tangent function;Soft plus function;Correction letter
Number;Divided by two specified power side;Reciprocal value that a user specifies is multiplied by reach equivalent division;Transmission is whole cumulative
Device;And by accumulator with standard size transmission, this part can be described in more detail in sections below.In one embodiment,
Run function is by specified by neutral net unit starting function instruction.In addition, run function also can by specified by initialization directive,
And use in response to output order, such as the run function unit output order of address 4 in Fig. 4, in this embodiment, positioned at figure
The run function instruction of address 3 in 4 can be contained in output order.
Specified one of 2942 values reciprocal are multiplied to reach the numerical value 217 of accumulator 202 are carried out with the numerical value 217 of accumulator 202
The numerical value of division arithmetic.That is, the value of inverse 2942 specified by user can be falling for the divisor that actually wishes to carry out
Number.This is conducive to arrange in pairs or groups convolution as described herein or common source computing.For a preferred embodiment, user can be by inverse
2942 values are appointed as two parts, and this can be described in more detail in follow-up corresponding at Figure 29 C.In one embodiment, control
Buffer 127 allows user to specify one in multiple built-in divider values and carries out division including a field (not shown), this
The sizableness of a little built-in divider values in the size of conventional convolution kernel, such as 9,25,36 or 49.In this embodiment, letter is started
Counting unit 212 can store the inverse of these built-in divisors, to be multiplied with the numerical value 217 of accumulator 202.
Side-play amount 2944 specifies the digit that the shift unit of run function unit 212 can move to right the numerical value 217 of accumulator 202,
With reach by its divided by two power side computing.This is conducive to the convolution kernel of the power side that collocation size is two to carry out computing.
Output random access memory 2952 value can be in data random access memory 122 and weight random access memory
One is specified to receive output result 133 in 124.
The output value of binary point 2954 represents the position of the binary point of output result 133.It is preferably real with regard to one
For applying example, for the position of the binary point of output result 133, output binary point 2954 is worth and represents
From the position number of positions that right side calculates.In other words, the least significant bit that binary point 2954 represents output result 133 is exported
In belong to the quantity of decimal place, i.e., the digit on the right side of binary point.Run function unit 212 can be entered based on output two
The numerical value of decimal point processed 2954 (in most cases, also can be little based on data binary point 2922, weight binary system
The numerical value of several points 2924, run function 2934 and/or configuration 2902) perform the computing of rounding-off, compression, saturation and size conversion.
Output order 2956 can be from many Control-oriented output results 133.In one embodiment, run function unit 121
Standard-sized concept can be utilized, standard size is the twice for configuring the 2902 width sizes (in bits) specified.Thus, citing
For, if the input that the setting of configuration 2902 is received by data random access memory 122 and weight random access memory 124
The size of word is 8, and standard size will be 16;In another example, if the setting of configuration 2902 is random by data
The size of the input word that access memory 122 is received with weight random access memory 124 is 16, and standard size will be
32.As described herein, the size of accumulator 202 is larger, and (for example, narrow accumulator 202B is 28, and wide cumulative
Device 202A is then 41) to maintain intermediate computations, the multiply-accumulate instruction of such as 1024 and 512 neutral net units, full precision.
Thus, the numerical value 217 of accumulator 202 will be more than standard size (in bits), and for most of numerical value of run function 2934
(except transmitting whole accumulator), run function unit 212 (for example below corresponding to Figure 30 paragraph described in standard size pressure
Contracting device 3008) numerical value 217 of accumulator 202 will be compressed to standard-sized size.First default value of output order 2956
Can indicate that run function unit 212 performs the run function 2934 specified to produce internal result and by this internal result as defeated
Go out result 133 to export, the size of this internal result is equal to the size for being originally inputted word, i.e., standard-sized half.Output life
Making 2956 the second default value can indicate that run function unit 212 performs the run function 2934 specified to produce internal result simultaneously
The lower half of this internal result is exported as output result 133, the size of this internal result is equal to and is originally inputted the big of word
Little twice, i.e. standard size;And the 3rd default value for exporting order 2956 can indicate run function unit 212 by standard size
The first half of inside result export as output result 133.4th default value of output order 2956 can indicate run function
Unit 212 exports undressed minimum effective word of accumulator 202 as output result 133;And export order 2956
The 5th default value can indicate run function unit 212 using the effective word in undressed centre of accumulator 202 as output
As a result 133 output;6th default value of output order 2956 can indicate run function unit 212 by accumulator 202 without place
The effective word of highest (its width is by specified by configuring 2902) of reason is exported as output result 133, and this corresponds to above Fig. 8
Chapters and sections to Figure 10 are described in more detail.It has been observed that exporting the size of whole accumulator 202 or standard-sized internal result
Contribute to allowing other performance elements 112 of processor 100 to perform run function, such as soft very big run function.
Field described by Figure 29 A (and Figure 29 B and Figure 29 C) is located inside control buffer 127, but, the present invention
This is not limited to, wherein one or more fields may be alternatively located at the other parts of neutral net unit 121.With regard to a preferred embodiment
For, many of which field may be embodied in neutral net unit instruction internal, and be decoded by sequencer 128 micro- to produce
3416 (refer to Figure 34) of instruction control ALU 204 and/or run function unit 212.Additionally, these fields
May be embodied in be stored in micro- computing 3414 of media cache 118 and (refer to Figure 34), to control ALU 204
And/or run function unit 212.This embodiment can reduce initializing the use of neutral net unit instruction, and at other
This initialization neutral net unit instruction is then can remove in embodiment.
It has been observed that the instruction of neutral net unit can specify and memory operand (is such as stored from data random access
The word of device 122 and/or weight random access memory 123) or one rotation after operand (such as from multitask buffer
208/705) arithmetical logic ordering calculation is performed.In one embodiment, neutral net unit instruction can also be by an operand
The buffer for being appointed as run function is exported (such as the output of the buffer 3038 of Figure 30).Additionally, it has been observed that neutral net unit
Instruction can specify make data random access memory 122 or weight random access memory 124 when top address is incremented by.
In one embodiment, neutral net unit instruction may specify that immediately signed integer difference is added when prostatitis is incremented by or passs to reach
The purpose of numerical value beyond subtracting one.
Figure 29 B are the block schematic diagram of another embodiment of the control buffer 127 for showing Fig. 1.The control caching of Figure 29 B
Similar to the control buffer 127 of Figure 29 A, but, the control buffer 127 of Figure 29 B includes an accumulator binary system to device 127
Decimal point 2923.Accumulator binary point 2923 represents the binary point position of accumulator 202.Preferably implement with regard to one
For example, the value of accumulator binary point 2923 represents position number of positions of this binary point position from right side.Change speech
It, belongs to the quantity of decimal place, i.e., positioned at two in the least significant bit of the expression accumulator 202 of accumulator binary point 2923
Position on the right side of system decimal point.In this embodiment, accumulator binary point 2923 is explicitly indicated, rather than such as Figure 29 A
Embodiment is to confirm in the dark.
Figure 29 C are to show with the block schematic diagram of an embodiment of the inverse 2942 of two section store Figure 29 A.First
Part 2962 is a deviant, represents that user wants to be multiplied by repressed in the true reciprocal value of the numerical value 217 of accumulator 202
The quantity 2962 of leading zero.The quantity of leading zero is an immediately proceeding at the quantity on binary point right side continuously arranged zero.Second
Part 2694 is leading null suppression reciprocal value, that is, the true reciprocal value after all leading zeroes are removed.In one embodiment,
Suppressed leading zero quantity 2962 is with 4 storages, and leading null suppression reciprocal value 2964 is then with 8 not signed value storages.
As an example it is assumed that user wants the reciprocal value that the numerical value 217 of accumulator 202 is multiplied by numerical value 49.Numerical value 49
It will be 0.0000010100111 that reciprocal value is presented and set 13 decimal places with two dimension, wherein there is five leading zeroes.Thus,
Suppressed leading zero quantity 2962 can be inserted numerical value 5 by user, and leading null suppression reciprocal value 2964 is inserted into numerical value
10100111.In multiplier reciprocal " divider A " 3014 (refer to Figure 30) by the numerical value 217 of accumulator 202 and leading null suppression
After reciprocal value 2964 is multiplied, produced product can be moved to right according to leading zero quantity 2962 is suppressed.Such embodiment is helped
In the requirement for 2942 values reciprocal are expressed using relatively small number of position reaching pinpoint accuracy.
Figure 30 is the block schematic diagram of an embodiment of the run function unit 212 for showing Fig. 2.This run function unit
212 127, positive type converter (PFC) of control logic comprising Fig. 1 and output binary point aligner (OBPA)
3002 is little with output binary system to receive the numerical value 217 of accumulator 202 to receive 202 numerical value of accumulator, 217, rounder 3004
The pointer of the bit quantity that several aligners 3002 are removed, random order as the aforementioned originate 3003 with produce random order 3005,
Output and house of one the first multiplexer 3006 to receive positive type converter with export binary point aligner 3002
Enter the output of device 3004, standard size compressor reducer (CCS) and saturator 3008 with receive the first multiplexer 3006 output,
One digit selector and saturator 3012 are receiving output, an adjuster 3018 of standard size compressor reducer and saturator 3008
To receive standard size compressor reducer with the output of saturator 3008, a multiplier 3014 reciprocal to receive standard size compressor reducer
Output, a right shift device 3016 with saturator 3008 with receive standard size compressor reducer and saturator 3008 output,
One tanh (tanh) module 3022 with receive the output of digit selector and saturator 3012, a S patterns block 3024 with
Receive digit selector and the output of saturator 3012, one soft add module 3026 to receive the defeated of digit selector and saturator 3012
Go out, second multiplexer 3032 to be to receive tanh module 3022, S patterns block 3024, soft plus module 3026, adjuster
3018th, the output of multiplier 3014 reciprocal and right shift device 3016 and standard size compressor reducer and saturator 3008 is transmitted
Standard size export 3028, symbol restorer 3034 to receive output, a size conversion of the second multiplexer 3032
Device and saturator 3036 are receiving the output of symbol restorer 3034, one the 3rd multiplexer 3037 to receive size converter and full
Output with device 3036 and accumulator output 217 and an output state 3038 to receive the output of multiplexer 3037, and
Its output is the result 133 in Fig. 1.
Positive type converter receives the value 217 of accumulator 202 with output binary point aligner 3002.It is preferably real with regard to one
For applying example, it has been observed that the value 217 of accumulator 202 is a full precision value.That is, accumulator 202 has enough storages
To load cumulative number, this cumulative number is by a series of product phases produced by integer multiplier 242 by integer adder 244 to digit
Plus produced sum, and this computing do not give up in indivedual products of multiplier 242 or each sum of adder any one
Individual position is maintaining accuracy.For a preferred embodiment, at least there is accumulator 202 enough digits to load neutral net
Unit 121 can be programmed the maximum quantity for performing the product accumulation for producing.For example, the program of Fig. 4 is refer to, is matched somebody with somebody in width
Put down, it is 512 that neutral net unit 121 can be programmed the maximum quantity of the product accumulation for performing generation, and cumulative number 202
Width is 41.In another example, the program of Figure 20 is refer to, under narrow configuration, neutral net unit 121 can be programmed and hold
The maximum quantity of the product accumulation that row is produced is 1024, and the bit width of cumulative number 202 is 28.Substantially, full precision accumulator 202
With at least Q position, wherein Q is the totalling of M and log2P, and wherein M is that the bit width of the integer multiplication of multiplier 242 (is illustrated and
Say, be 16 for narrow multiplier 242, be 32 for wide multiplier 242), and P is accumulator 202 can tire out
Plus product maximum allowable quantity.For a preferred embodiment, the maximum quantity of product accumulation is according to neutral net list
Specified by the program specification of the program designer of unit 121.In one embodiment, it is assumed that a previous multiplications accumulated instruction to from
Data/loading data/weight word 206/207 of weight random access memory 122/124 is arranged (such as the instruction of address 1 in Fig. 4)
On the basis of, sequencer 128 can perform counting of the multiply-accumulate neutral net unit instruction (such as the instruction of address 2 in Fig. 4)
Maximum is such as 511.
There is enough bit widths using one and cumulative fortune can be performed to the full precision value of allowed cumulative maximum quantity
The accumulator 202 of calculation, you can simplify the design of the ALU 204 of nerve processing unit 126.Particularly, so process
The demand for needing the sum produced to integer adder 244 using logic to perform saturation arithmetic can be relaxed, because integer adds
Musical instruments used in a Buddhist or Taoist mass 244 can make a small-sized accumulator produce overflow, and need to keep track the binary point position of accumulator with true
Recognize and whether produce overflow to be confirmed whether to need to perform saturation arithmetic.For example, for non-full precision accumulator but tool
For having saturation logic with the design of the overflow for processing non-full precision accumulator, it is assumed that there is situations below.
(1) scope of data literal value is between 0 and 1 and all bit of storage are all to store decimal place.Weight text
The scope of word value is all bit of storage between -8 and+8 and in addition to three all to store decimal place.As one
The scope of the accumulated value of the input of tanh run function is and all storages in addition to three between -8 and 8
Position is all to store decimal place.
(2) bit width of accumulator is non-full precision (the such as only bit width of product).
(3) assume that accumulator is full precision, final accumulated value is also big to date between -8 and 8 (such as+4.2);But, exist
Product in this sequence before " point A " can be produced relatively frequently on the occasion of and the product after point A then can relatively frequently produce negative value.
In the case, it is possible to obtain incorrect result (such as the result beyond+4.2).This is because in front of point A
Some points, when needing to make accumulator to reach the numerical value for exceeding its saturation maximum+8, such as+8.2, will lose what is had more
0.2.Accumulator can even make remaining product accumulation result maintain saturation value, and can lose it is more on the occasion of.Therefore, accumulator
End value may be less than using the numerical value that be calculated of accumulator with full precision bit width (i.e. less than+4.2).
Positive type converter 3004 can be converted into positive type when the numerical value 217 of accumulator 202 is to bear, and produce volume
The positive and negative of script numerical value is pointed out in outer position, and this meeting is passed down to the pipeline of run function unit 212 with herewith numerical value.By negative
Being converted to positive type can simplify the computing of follow-up run function unit 121.For example, after this treatment, only on the occasion of meeting
Input tanh module 3022 and S patterns block 3024, thus the design of these modules can be simplified.In addition it is also possible to simplify
Rounder 3004 and saturator 3008.
Output binary point aligner 3002 can move right or scale this positive type value so as to slow in alignment with control
The output binary point 2954 specified in storage 127.For a preferred embodiment, binary point aligner is exported
3002 can calculate the decimal digits of the numerical value 217 of accumulator 202 (such as by specified by accumulator binary point 2923 or several
According to the totalling of binary point 2922 and weight binary point 2924) decimal digits of output is deducted (such as by exporting
Specified by binary point 2954) difference as side-play amount.So, for example, if the binary fraction of accumulator 202
It is 3 that point 2923 exports binary point 2954 for 8 (i.e. above-described embodiments), and output binary point aligner 3002 is just
This positive type numerical value can be moved to right into 5 positions to produce the result provided to multiplexer 3006 and rounder 3004.
Rounder 3004 can perform rounding-off computing to 202 numerical value of accumulator 217.For a preferred embodiment, rounder
3004 can align one rounding-off of positive type numerical value generation that type transducer is produced with output binary point aligner 3002
Version afterwards, and by this be rounded after version provide to multiplexer 3006.Rounder 3004 can be performed according to aforementioned rounding control 2932
Rounding-off computing, as described herein, aforementioned rounding control can include using the random rounding-off of random order 3005.Multiplexer 3006 can be according to
According to rounding control 2932 (as described herein, can be comprising random rounding-off), select first, namely from just in its multiple input
Type transducer is with the positive type numerical value of output binary point aligner 3002 or after the rounding-off of rounder 3004
Version, and the numerical value after selection is supplied to into standard size compressor reducer and saturator 3008.For a preferred embodiment, if
It is that rounding control is specified not to be rounded, multiplexer 3006 will select positive type converter to be aligned with output binary point
The output of device 3002, will otherwise select the output of rounder 3004.In other embodiments, also can be by run function unit
212 perform extra rounding-off computing.For example, in one embodiment, when digit selector 3012 to standard size compressor reducer with
When output (as be described hereinafter) position of saturator 3008 is compressed, digit selector 3012 can be based on the low cis-position position lost and is rounded
Computing.In another example, the product of multiplier reciprocal 3014 (as be described hereinafter) can be subjected to rounding-off computing.In another example
In, size converter 3036 needs to change out appropriate Output Size (as be described hereinafter), and this conversion may relate to lose some and be used for
The low cis-position position of rounding-off is determined, rounding-off computing is carried out.
The output valve of multiplexer 3006 can be compressed to standard size by standard size compressor reducer 3008.If so that it takes up a position, for example,
Be neural processing unit 126 be in it is narrow configuration or funnel configuration 2902, standard size compressor reducer 3008 can be by the multiplexing of 28
The output valve of device 3006 is compressed to 16;And if neural processing unit 126 is in wide configuration 2902, standard size compressor reducer
The output valve of multiplexer 3006 of 41 can be compressed to 32 by 3008.But, before standard size is compressed to, if value before compression
More than the maximum that standard type can be expressed, saturator 3008 will make value before this compression be filled up to standard type can express
Maximum.For example, if before compression in value positioned at highest be effectively compressed before any position on the left of value position be all numerical value 1,
Saturator 3008 will be filled up to maximum (such as filling up as all 1).
For a preferred embodiment, tanh module 3022, S patterns block 3024 and soft plus module 3026 are all wrapped
Containing look-up table, such as programmable logic array (PLA), read-only storage (ROM), combinational logic lock.In one embodiment,
For size that is simplified and reducing these modules 3022/3024/3026, there is provided the input value to these modules has 3.4 type
The integer character of formula, i.e., three and four decimal places, namely input value is with four positions are positioned at binary point right side and have
There are three positions to be located at binary point left side.Because in the extreme place of the input value scope (- 8 ,+8) of 3.4 patterns, output valve
Progressively near its min/max, therefore these numerical value can be selected.But, the present invention is not limited to this, and the present invention also can be answered
Binary point is placed on the embodiment of diverse location for other, such as with 4.3 patterns or 2.5 patterns.Digit selector
3012 selection can select the position for meeting 3.4 pattern specifications in the position of standard size compressor reducer and the output of saturator 3008, and this is related to
And compression is processed, that is, some positions can be lost, because standard type then has more digit.But ,/compression mark is being selected
It is full if value is more than the maximum that 3.4 patterns can be expressed before compression before object staff cun compressor reducer and the output valve of saturator 3008
With device 3012 value before compression will be made to be filled up to the maximum that 3.4 patterns can be expressed.For example, if being worth middle position before compression
Any position on the left of the effective 3.4 pattern position of highest is all numerical value 1, and saturator 3012 will be filled up to maximum and (such as be filled up to
All 1).
Tanh module 3022, S patterns block 3024 can be to standard size compressor reducer and saturators with soft plus module 3026
3.4 pattern numerical value of 3008 outputs perform corresponding run function (described above) to produce a result.With regard to a preferred embodiment
For, produced by tanh module 3022 and S patterns block 3024 is 7 results of 0.7 pattern, i.e. zero integer word
It is first with seven decimal places, namely input value to there are seven positions to be located at binary point right side.It is soft for a preferred embodiment
Plus module 3026 produce be 3.4 patterns 7 results, i.e. its pattern is identical with the entry type of this module 3026.Just
For one preferred embodiment, tanh module 3022, S patterns block 3024 can be extended to mark with the output of soft plus module 3026
Pseudotype formula (for example adding leading zero when necessary) is simultaneously aligned and binary point is counted by binary point 2954 is exported
Value is specified.
Adjuster 3018 can produce version after the correction of the output valve of standard size compressor reducer and saturator 3008.Namely
Say, if standard size compressor reducer is negative, adjuster with the output valve (as aforementioned its symbol is moved down with pipeline) of saturator 3008
3018 can export null value;Otherwise, adjuster 3018 will be inputted value output.For a preferred embodiment, adjuster
3018 are output as standard type and with the binary point by specified by the numerical value of output binary point 2954.
Multiplier reciprocal 3014 can by the output of standard size compressor reducer and saturator 3008 be specified in reciprocal value 2942
User specify reciprocal value be multiplied, to produce standard-sized product, this product actually be standard size compressor reducer with
The output valve of saturator 3008, using the quotient that the inverse of reciprocal value 2942 is calculated as divisor.With regard to a preferred embodiment
Speech, multiplier reciprocal 3014 is output as standard type and with the binary system specified by the numerical value of output binary point 2954
Decimal point.
Right shift device 3016 can be by the output of standard size compressor reducer and saturator 3008, to be specified in offset value
2944 user specifies digit to move, to produce standard-sized quotient.For a preferred embodiment, right shift
Device 3016 is output as standard type and with the binary point specified by the numerical value of output binary point 2954.
Multiplexer 3032 select run function 2934 be worth it is specified be properly entered, and is selected to provide to symbol recovery
Device 3034, if the numerical value 217 of accumulator 202 of script is negative value, what symbol restorer 3034 will export multiplexer 3032
Positive type numerical value is converted to negative type, for example, be converted to two complement code types.
Size converter 3036 can be according to the numerical value of the output order 2956 as described in Figure 29 A, by symbol restorer 3034
Output conversion to appropriate size.For a preferred embodiment, the output of symbol restorer 3034 has one by exporting
The binary point that the numerical value of binary point 2954 is specified.For a preferred embodiment, for the first of output order
For default value, size converter 3036 can give up the upper portion of the output of symbol restorer 3034.If additionally, symbol recovers
Device 3034 is output as the maximum that character size can be expressed just and more than configuration 2902 specified, or is output as bearing simultaneously
And the minimum of a value that can be expressed less than character size, saturator 3036 will output it fill up so far character size respectively can
Expression maximum/minimum value.For second and the 3rd default value, size converter 3036 can transmit the defeated of symbol restorer 3034
Go out.
Multiplexer 3037 can be according to output order 2956, in data converter and the output of saturator 3036 and accumulator 202
One is selected in output 217 to be supplied to output state 3038.Furthermore, it is understood that for the first and the of output order 2956
Two default values, multiplexer 3037 can select the lower section word of size converter and the output of saturator 3036, and (size is by configuring
2902 specify).For the 3rd default value, multiplexer 3037 can select the top text of size converter and the output of saturator 3036
Word.For the 4th default value, multiplexer 3037 can select the lower section word of the numerical value 217 of undressed accumulator 202;For
Five default values, multiplexer 3037 can select the midamble of the numerical value 217 of undressed accumulator 202;And for the 6th acquiescence
Value, multiplexer 3037 can select the top word of the numerical value 217 of undressed accumulator 202.It has been observed that with regard to a preferred embodiment
For, run function unit 212 can add null value upper position in the top word of the numerical value 217 of undressed accumulator 202.
Figure 31 is an example of the running of the run function unit 212 for showing Figure 30.As shown in FIG., neural processing unit
126 configuration 2902 is set as narrow configuration.Additionally, signed number is true with the value of tape symbol weight 2914 according to 2912.Additionally, data
The value of binary point 2922 represents that for the word of data random access memory 122, its binary point position is right
The exemplary values that there is 7 positions, the first data literal that neural processing unit 126 is received side are rendered as 0.1001110.Additionally,
Weight binary point 2924 is worth expression for the word of weight random access memory 124, its binary point position
The exemplary values that putting right side has 3 positions, the first weight word that neural processing unit 126 is received are rendered as 00001.010.
First data are rendered as with 16 products (this product can be added with the initial zero value of accumulator 202) of weight word
000000.1100001100.Because data binary point 2912 is 7 and weight binary point 2914 is 3, for institute
For the implicit binary point of accumulator 202, its right side has 10 positions.In the case of narrow configuration, such as the present embodiment
Shown, accumulator 202 has 28 bit wides.For example, (such as Figure 20 all 1024 after all arithmetic logical operations is completed
Multiply-accumulate computing), the numerical value 217 of accumulator 202 can be 000000000000000001.1101010100.
Output binary point 2954 value represents that there are 7 positions on the binary point right side of output.Therefore, transmitting defeated
After going out binary point aligner 3002 and standard size compressor reducer 3008,202 numerical value of accumulator 217 can scaled, house
The numerical value for entering and being compressed to standard type, i.e., 000000001.1101011.In this example, binary fraction dot address is exported
7 decimal places of expression, and accumulator 202 binary point positional representation, 10 decimal places.Therefore, binary point is exported
Aligner 3002 can calculate difference 3, and the numerical value 217 of accumulator 202 is moved to right 3 positions to zoom in and out to it by transmission.In figure
Show that the numerical value 217 of accumulator 202 can lose 3 least significant bits (binary number 100) in 31.Additionally, in this example, house
Enter 2932 values of control to represent using random rounding-off, and assume that sampling random order 3005 is true in this example.Thus, such as front
State, least significant bit will be rounded up to, this is because the rounding bit of the numerical value 217 of accumulator 202 (this 3 because accumulator
The zoom operation of 202 numerical value 217 and the highest significant position in the position that is moved out of) for one, and (this 3 because accumulator 202 for glutinous position
The zoom operation of numerical value 217 and in the position that is moved out of, the boolean of 2 least significant bits or operation result) be zero.
In this example, run function 2934 is represented and uses S type functions.Thus, digit selector 3012 will be selected
Select the position of standard type value and make the input of S patterns block 3024 that there are three integer characters and four decimal places, it has been observed that i.e. institute
The numerical value 001.1101 for showing.The output numerical value of S patterns block 3024 can be put in standard type, i.e., shown numerical value
000000000.1101110。
The first default value, i.e., the character size that output configuration 2902 is represented, here are specified in the output order 2956 of this example
In the case of i.e. narrow word (8).Thus, standard S type output valve can be converted to 8 amounts, its tool by size converter 3036
There is an implicit binary point, i.e., there are 7 positions on the right side of this binary point, and produce an output valve
01101110, as shown in FIG..
Figure 32 is second example of the running of the run function unit 212 for showing Figure 30.The example of Figure 32 is described to work as and opened
When dynamic function 2934 is represented with standard size transmission 202 numerical value 217 of accumulator, the computing of run function unit 212.Such as institute in figure
Show, this configuration 2902 is set as the narrow configuration of neural processing unit 216.
In this example, the width of accumulator 202 is 28 positions, and the position right side of the binary point of accumulator 202 has
10 positions are (this is because in one embodiment data binary point 2912 is with the totalling of weight binary point 2914
10, or in another embodiment accumulator binary point 2923 is clearly designated as with numerical value 10).For example,
After all arithmetic logical operations are performed, the numerical value 217 of accumulator 202 shown in Figure 32 is
000001100000011011.1101111010。
In this example, the output value of binary point 2954 represents that for output binary point right side has 4
Individual position.Therefore, after transmission output binary point aligner 3002 with standard size compressor reducer 3008, accumulator 202
The meeting saturation of numerical value 217 is simultaneously compressed to shown standard type value 111111111111.1111, and this numerical value is connect by multiplexer 3032
Receive using as standard size delivery value 3028.
Two output orders 2956 are shown in this example.The second default value is specified in first output order 2956, i.e., defeated
Go out the lower section word of standard type size.Because the size indicated by configuration 2902 is narrow word (8), standard size will be
16, and size converter 3036 understands the position of lower section 8 of selection standard size delivery value 3028 to produce 8 as illustrated in the drawing
Numerical value 11111111.The top word of the 3rd default value, i.e. outputting standard pattern size is specified in second output order 2956.Such as
This, size converter 3036 understands the position of top 8 of selection standard size delivery value 3028 to produce 8 bit value as illustrated in the drawing
11111111。
Figure 33 is the 3rd example of the running of the run function unit 212 for showing Figure 30.The example of Figure 33 is disclosed to work as and opened
Dynamic function 2934 represents the running of run function unit 212 when transmitting whole undressed 202 numerical value 217 of accumulator.Such as
Shown in figure, this configuration 2902 is set as the wide configuration (the input words of such as 16) of neural processing unit 126.
In this example, the width of accumulator 202 is 41 positions, and accumulator 202 has 8 on the right side of binary point position
Individual position (this is because in one embodiment data binary point 2912 and the totalling of weight binary point 2914 are 8,
Or in another embodiment accumulator binary point 2923 is clearly designated as with numerical value 8).For example, holding
After all arithmetic logical operations of row, the numerical value 217 of accumulator 202 shown in Figure 33 is
001000000000000000001100000011011.11011110。
Three output orders 2956 are shown in this example.First output order specify the 4th default value, that is, export without
The lower section word of the numerical value of accumulator 202 of process;The 5th default value is specified in second output order, that is, export undressed tired
Plus the midamble of the numerical value of device 202;And the 6th default value is specified in the 3rd output order, that is, export undressed accumulator
The top word of 202 numerical value.Because the size indicated by configuration 2902 is wide word (16), as shown in figure 33, in response to first
Output order 2956, multiplexer 3037 can select 16 place values 0001101111011110;In response to the second output order 2956, multiplexing
Device 3037 can select 16 place values 0000000000011000;And in response to the 3rd output order 2956, multiplexer 3037 can select 16
Place value 0000000001000000.
It has been observed that neutral net unit 121 can be implemented in integer data rather than floating data.Thus, contributing to letter
Change each and every one neural processing unit 126, or the part of ALU 204 at least within.For example, this arithmetical logic list
Unit 204 avoids the need for being incorporated in floating-point operation the adder for needing to be added for the index by multiplier for multiplier 242.Class
As, this ALU 204 avoids the need for for adder 234 that being incorporated in floating-point operation needs for being directed at addend
Binary point shift unit.Art tool usually intellectual is when being understood that, floating point unit is often very multiple
It is miscellaneous;Therefore, exemplifications set out herein is simplified only for ALU 204, is aided in hardware fixed point using described
And allow user to may specify that the integer embodiment of associated binary decimal point can also be used for simplifying other parts.Compared to
The embodiment of floating-point, can be produced at the nerve of less (and very fast) using integer unit as ALU 204
Reason unit 126, and be conducive to that a large-scale array of neural processing unit 126 is integrated in neutral net unit 121.Start
The part of function unit 212 can be based on the little of specified user, the decimal place quantity of cumulative number needs and output valve needs
Numerical digit quantity is processing the scaling and saturation arithmetic of the numerical value 217 of accumulator 202, and preferably is specified based on user.Any volume
Outer complexity increases with adjoint size, and the energy in the fixed-point hardware auxiliary of run function unit 212 and/or time consumption
Damage, can be shared through the mode that run function unit 212 is shared between ALU 204, this is because
As shown in the embodiment of Figure 11, the quantity of run function unit 1112 can be reduced using the embodiment of sharing mode.
Embodiment as herein described can enjoy many utilization integer arithmetic units to reduce the advantage (phase of hardware complexity
Compared with using floating point arithmetic unit), and while can be used for the arithmetical operation of decimal, i.e. the numeral with binary point.
The advantage of floating-point arithmetic is that it can provide date arithmetic and falls in a very wide numerical value to the individual number of data
In the range of (be actually limited only in the size of index range, therefore can be a very big scope).That is, each is floating
Points have its potential unique exponential quantity.But, embodiment as herein described understands and using tool in some applications
There is input data highly parallel and fall within the range of a relative narrower and make all panel datas that there is the spy of identical " index "
Property.Thus, these embodiments allow user binary point position is once assigned to into all of input value and/or is added up
Value.Similarly, through understanding and using parallel output have a characteristic of similar scope, these embodiments allow user by binary system
Scaling position is once assigned to all of output valve.Artificial neural network is an example of this kind of application, but the present invention
Embodiment also apply be applicable to perform the calculating of other application.Through by binary point position be once assigned to it is multiple input and
It is non-to individual other input number, compared to floating-point operation is used, embodiments of the invention can be efficiently empty using memory
Between (if desired for less memory) and/or lift precision in the case of the memory using similar quantity, this is because
Position for the index of floating-point operation can be used to lift numerical precision.
Additionally, embodiments of the invention understand that (such as overflow is lost less in the integer arithmetic to a large series
Important decimal place) precision may be lost when performing cumulative, therefore provide a solution, mainly use one it is sufficiently large
Accumulator avoiding precision from losing.
The direct execution of the micro- computing of neutral net unit
Figure 34 is the block schematic diagram of the part details of the processor 100 and neutral net unit 121 that show Fig. 1.God
Jing NEs 121 include the pipeline stages 3401 of neural processing unit 126.Each pipeline stages 3401 is distinguished with level buffer, and
Including combinational logic reaching the computing of the neural processing unit 126 of this paper, such as Boolean logic lock, multiplexer, adder, multiplication
Device, comparator etc..Pipeline stages 3401 receive micro- computing 3418 from multiplexer 3402.Micro- computing 3418 can flow downward to pipeline
Level 3401 simultaneously controls its combinational logic.Micro- computing 3418 is a position set.For a preferred embodiment, micro- computing 3418 is wrapped
Include the position of the storage address 123 of data random access memory 122, the storage address 125 of weight random access memory 124
Position, the position of the storage address 131 of program storage 129, the also control signal 213/713 of multitask buffer 208/705, many
The field (the control buffer of such as Figure 29 A to Figure 29 C) of control buffer 217.In one embodiment, micro- computing 3418 includes
About 120 positions.Multiplexer 3402 receives micro- computing from three different sources, and selects one of conduct to be supplied to pipeline
Micro- computing 3418 of level 3401.
One micro- computing source of multiplexer 3402 is the sequencer 128 of Fig. 1.The meeting of sequencer 128 will be by program storage
The 129 neutral net unit Instruction decodings for receiving and one micro- computing 3416 of generation according to this provide first defeated to multiplexer 3402
Enter.
Second of multiplexer 3402 micro- computing source is to receive microcommand 105 and from general from the reservation station 108 of Fig. 1
Buffer 116 receives the decoder 3404 of operand with media cache 118.For a preferred embodiment, it has been observed that micro- finger
Make 105 by instruction translator 104 in response to MTNN instruction 1400 with MFNN instruction 1500 translation produced by.Microcommand 105 can be wrapped
Include an immediate field to specify a specific function (by specified by a MTNN instruction 1400 or a MFNN instruction 1500), example
As the beginning of the internal program of program storage 129 with stop performing, directly from media cache 118 perform a micro- computing or such as
The memory of aforementioned read/write neutral net unit.Decoder 3404 can decode microcommand 105 and generation one according to this is micro-
Computing 3412 is provided to the second of multiplexer and is input into.For a preferred embodiment, for MTNN instruction 1400/MFNN instructions
For 1500 some functions 1432/1532, decoder 3404 need not produce a micro- computing 3412 and be sent to pipeline downwards
3401, such as write control buffer 127, the program started in executive memory 129, time-out executive memory
The program in program, wait program storage 129 in 129 completes to perform, nerve is read and reseted from status register 127
NE 121.
3rd micro- computing source of multiplexer 3402 is media cache 118 itself.For a preferred embodiment, such as
Above corresponding to described in Figure 14, MTNN instructions 1400 may specify a function to indicate that neutral net unit 121 directly performs one
There is provided by media cache 118 to the 3rd micro- computing 3414 being input into of multiplexer 3402.Directly perform by framework media buffer
Micro- computing 3414 that device 118 is provided is conducive to testing neutral net unit 121, such as built-in self-test (BIST), or
Except wrong action.
For a preferred embodiment, decoder 3404 can produce the choosing that a mode pointer 3422 controls multiplexer 3402
Select.When MTNN instructions 1400 specify a function to start to perform a program from program storage 129, the meeting of decoder 3404
Producing the value of a mode pointer 3422 makes multiplexer 3402 select the micro- computing 3416 from sequencer 128, until making a mistake or
Encountering a MTNN instruction 1400 until decoder 3404 specifies a function to stop performing the journey from program storage 129
Sequence.When MTNN instructions 1400 specify a function to indicate that neutral net unit 121 directly performs what is provided by media cache 118
One micro- computing 3414, decoder 3404 can produce a value of mode pointer 3422 makes multiplexer 3402 select from specified matchmaker
Micro- computing 3414 of body buffer 118.Otherwise, decoder 3404 will produce a value of mode pointer 3422 and make multiplexer 3402
Select the micro- computing 3412 from decoder 3404.
Variable rate neutral net unit
In many cases, holding state (idle) etc. will be entered after the configuration processor of neutral net unit 121 pending
The thing of the process of device 100 some needs process before next program is performed.As an example it is assumed that being in one similar to Fig. 3
To the situation described in Fig. 6 A, neutral net unit 121 (alternatively referred to as can before award nerve net to a multiply-accumulate run function program
Network layers program (feed forward neural network layer program)) continuously perform two or more times.Compare
In the time that the configuration processor of neutral net unit 121 is spent, the significant need of processor 100 spend longer time by
The weighted value of 512KB writes weight random access memory 124 so that next time neutral net unit program is used.In other words,
Neutral net unit 121 understands configuration processor at short notice, is subsequently put into holding state, until processor 100 will be following
Weighted value write weight random access memory 124 use for program performing next time.This situation can refer to Figure 36 A, in detail such as
It is aftermentioned.In the case, frequency is run to extend the time of configuration processor when neutral net unit 121 can adopt relatively low, uses
Make the energy resource consumption needed for configuration processor be dispersed to longer time scope, and make neutral net unit 121, or even whole place
Reason device 100, maintains lower temperature.This situation is referred to as mitigation pattern, can refer to Figure 36 B, and the details will be described later.
Figure 35 is a block diagram, shows the processor 100 with variable rate neutral net unit 121.This class of processor 100
The processor 100 of Fig. 1 is similar to, and the component in figure with identical label is also similar.The processor 100 of Figure 35 and have when
Frequency produces the functional unit that logic 3502 is coupled to processor 100, and these functional units instruct acquisition unit 101, and instruction is fast
Take 102, instruction translator 104, renaming unit 106, reservation station 108, neutral net unit 121, other performance elements 112,
Memory sub-system 114, general caching device 116 and media cache 118.Time-frequency produces logic 3502 includes time-frequency generator,
Such as phase-locked loop (PLL), with produce one have it is main when frequency or claim time-frequency frequency time frequency signal.For example, this
Frequency can be 1GHz when main, 1.5GHz, 2GHz etc..When frequency represent periods per second, such as time frequency signal is in height
Concussion number of times between low state.It is preferred that this time frequency signal has equilibration period (duty cycle), i.e. the half in this cycle is
High state and second half be low state;In addition, this time frequency signal can also have the non-equilibrium cycle, that is, time frequency signal is in height
The time of state is longer than the time that it is in low state, and vice versa.It is preferred that frequency when phase-locked loop is to produce multiple
Main time frequency signal.It is preferred that processor 100 include power management module, according to many factors adjust automatically it is main when frequency,
The dynamic detection operation temperature of these factors including processor 100, utilization rate (utilization), and from systems soft ware
(such as operating system, basic input output system (BIOS)) indicates the order of required efficiency and/or energy-saving index.In an embodiment
In, power management module includes the microcode of processor 100.
Time-frequency produces logic 3502 and including time-frequency distribution network, or time-frequency tree (clock tree).Time-frequency tree can be by master
Time frequency signal is wanted to be disseminated to the functional unit of processor 100, as shown in figure 35, this distribution action is exactly by time frequency signal 3506-1
Instruction acquisition unit 101 is sent to, time frequency signal 3506-2 is sent to into instruction cache 102, by time frequency signal 3506-10 transmission
To instruction translator 104, time frequency signal 3506-9 is sent to into renaming unit 106, time frequency signal 3506-8 is sent to into guarantor
Station 108 is stayed, time frequency signal 3506-7 is sent to into neutral net unit 121, time frequency signal 3506-4 is sent to into other execution
Unit 112, by time frequency signal 3506-3 memory sub-system 114 is sent to, and time frequency signal 3506-5 is sent to into general caching
Device 116, and time frequency signal 3506-6 is sent to into media cache 118, these signals are referred to collectively as time frequency signal 3506.This
Time-frequency tree has node or line, to transmit main time frequency signal 3506 to its corresponding functional unit.Additionally it is preferred that when
Frequency produces logic 3502 and may include time-frequency buffer, when needing to provide cleaner time frequency signal and/or need to lift main
During the voltage quasi position of frequency signal, especially for node farther out, time-frequency buffer can regenerate main time frequency signal.This
Outward, each functional unit and there is the period of the day from 11 p.m. to 1 a.m frequency of its own to set, regenerates and/or lifted received corresponding when needed
The voltage quasi position of main time frequency signal 3506.
Neutral net unit 121 includes that time-frequency reduces logic 3504, and time-frequency reduces logic 3504 and receives mitigation pointer 3512
With main time frequency signal 3506-7, to produce the second time frequency signal.Frequency when second time frequency signal has.If not now frequency phase
Frequency when being same as main, be in mitigation pattern from it is main when frequency reduce numerical value with reduce heat energy generation, this mathematical program
Change to mitigation pointer 3512.Time-frequency reduces logic 3504 and produces logic 3502 similar to time-frequency, and it has time-frequency distribution network, or
Time-frequency tree, to spread the several functions square of the second time frequency signal to neutral net unit 121, this distribution action is exactly by time-frequency
Signal 3508-1 is sent to neural pe array 126, and time frequency signal 3508-2 is sent to into sequencer 128 with will time-frequency
Signal 3508-3 is sent to interface logic 3514, and these signals are referred to collectively as the second time frequency signal 3508.It is preferred that these are neural
Processing unit 126 include multiple pipeline stages 3401, as shown in figure 34, pipeline stages 3401 include pipeline hierarchical cache device, to from
Time-frequency reduces logic 3504 and receives the second time frequency signal 3508-1.
Neutral net unit 121 also has interface logic 3514 to receive main time frequency signal 3506-7 and the second time-frequency letter
Number 3508-3.Interface logic 3514 is coupled to lower part (such as reservation station 108, the media cache 118 of the front end of processor 100
With general caching device 116) and between the several functions square of neutral net unit 121, in real time frequency reduces logic to these function blocks
3504, data random access memory 122, weight random access memory 124, program storage 129 and sequencer 128.Connect
Mouth logic 3514 includes data random access memory buffering 3522, and weight random access memory buffers translating for 3524, Figure 34
Code device 3404, and relax pointer 3512.Relax pointer 3512 and load a numerical value, this numerical value specifies nerve pe array
126 can perform the instruction of neutral net unit program with speed how slowly.It is preferred that relaxing pointer 3512 specifies divider value N, when
Frequency reduce logic 3504 by main time frequency signal 3506-7 divided by this divider value to produce the second time frequency signal 3508, thus, the
The when frequency of two time frequency signals will be 1/N.It is preferred that the numerical value of N is programmable to any in multiple different default values
Individual, these default values can make time-frequency reduce the correspondence of logic 3504 and produce multiple the second time frequency signals with frequency during difference
3508, these when frequency less than it is main when frequency.
In one embodiment, time-frequency reduces logic 3504 includes time-frequency divider circuit, to by main time frequency signal
3506-7 is divided by the mitigation numerical value of pointer 3512.In one embodiment, time-frequency reduces logic 3504 includes time-frequency lock (such as AND locks),
Time-frequency lock can pass through an enabling signal to gate main time frequency signal 3506-7, and enabling signal is in the every N number of of main time frequency signal
Only a true value can be produced in cycle.By one comprising counter by taking the circuit for producing enabling signal as an example, this counter can be to
On count up to N.When the output that adjoint logic circuit detects counter is matched with N, logic circuit will be believed in the second time-frequency
Numbers 3,508 one true value pulses of generation are laid equal stress on and set counter.It is preferred that relaxing the numerical value of pointer 3512 can give program by framework instruction
Change, the MTNN instructions 1400 of such as Figure 14.It is preferred that indicating that neutral net unit 121 starts to perform nerve net in framework program
Before network unit program, the framework program for operating on processor 100 can be by the sequencing of mitigation value to pointer 3512 is relaxed, and this part exists
Subsequently corresponding to can be described in more detail at Figure 37.
Weight random access memory buffering 3524 is coupled to weight random access memory 124 and media cache 118
Between as data transfer therebetween buffering.It is preferred that buffering of the weight random access memory buffering 3524 similar to Figure 17
One or more embodiments of device 1704.It is preferred that weight random access memory buffering 3524 is received from media cache 118
The part of data using with it is main when frequency main time frequency signal 3506-7 as time-frequency, and weight random access memory is slow
Punching 3524 is from the part of the receiving data of weight random access memory 124 with the second time frequency signal with frequency when second
3508-3 as time-frequency, when second frequency can according to sequencing in relax the numerical value of pointer 3512 from it is main when frequency downgrade or
It is no, namely be implemented in mitigation or normal mode according to neutral net unit 121 to be downgraded or no.In one embodiment, weigh
Weight random access memory 124 is single port, and as described in Figure 17 above, weight random access memory 124 can also be delayed by media
Storage 118 is through weight random access memory buffering 3524, and is buffered by the row of neural processing unit 126 or Figure 11
1104, with the mode of arbitrating (arbitrated fashion) access.In another embodiment, weight random access memory 124
For dual-port, as described in Figure 16 above, each port can be buffered by media cache 118 through weight random access memory
3524 and accessed in a parallel fashion by neural processing unit 126 or column buffer 1104.
Similar to weight random access memory buffering 3524, data random access memory buffering 3522 is coupled to data
As the buffering of the transmission of data therebetween between random access memory 122 and media cache 118.It is preferred that data are deposited at random
One or more embodiments of access to memory buffering 3522 similar to the buffer 1704 of Figure 17.It is preferred that data random access
Memorizer buffer 3522 from the part of the receiving data of media cache 118 with it is main when frequency main time frequency signal
3506-7 is used as time-frequency, and data random access memory buffering 3522 is from the receiving data of data random access memory 122
Using the second time frequency signal 3508-3 with frequency when second as time-frequency, frequency can be according to sequencing in mitigation when second for part
The numerical value of pointer 3512 from it is main when frequency downgrade or no, namely be implemented in mitigation or normal mode according to neutral net unit 121
Formula is being downgraded or no.In one embodiment, data random access memory 122 is single port, as described in Figure 17 above, number
Also 3522 can be buffered through data random access memory by media cache 118 according to random access memory 122, and by god
The row buffering 1104 of Jing processing units 126 or Figure 11, is accessed in the mode of arbitrating.In another embodiment, data random access is deposited
Reservoir 122 is dual-port, and as described in Figure 16 above, each port can be stored by media cache 118 through data random access
Device buffers 3522 and is accessed in a parallel fashion by neural processing unit 126 or column buffer 1104.
No matter it is preferred that data random access memory 122 and/or weight random access memory 124 be single port or
Dual-port, interface logic 3514 can include that data random access memory buffering 3522 is buffered with weight random access memory
3524 with synchronous main time-frequency domain and the second time-frequency domain.It is preferred that data random access memory 122, weight arbitrary access is deposited
Reservoir 124 and program storage 129 all have static RAM (SRAM), wherein enabling letter comprising individual other reading
Number, write enables signal and selects to enable signal with memory.
It has been observed that neutral net unit 121 is the performance element of processor 100.Performance element is that frame is performed in processor
Microcommand or the functional unit of execution framework instruction itself that structure instruction translation goes out, for example, perform framework in Fig. 1 and instruct 103 turns
The microcommand 105 for translating or framework instruction 103 itself.Performance element receives operand, example from the general caching device of processor
Such as from general caching device 116 and media cache 118.Performance element to be performed and can produce a result after microcommand or framework instruction,
This result can be written into general caching device.MTNN instructions 1400 described in Figure 14 and Figure 15 are framework instruction with MFNN instructions 1500
103 example.Microcommand is instructed to realize framework.More precisely, performance element one that framework instruction translation is gone out
Will be for the input of framework instruction performs the fortune of framework instruction or the collective of multiple microcommands performs
Calculate, to produce the result of framework instruction definition.
Figure 36 A are a sequential chart, and there is video-stream processor 100 neutral net unit 121 to operate on a fortune of general modfel
Make example, this general modfel i.e. with it is main when frequency operation.In sequential chart, the process of time is right by a left side.Processor 100
With it is main when frequency perform framework program.More precisely, the front end of processor 100 (for example instructs acquisition unit 101, instruction
Cache 102, instruction translator 104, renaming unit 106 and reservation station 108) with it is main when frequency seize, decode and issue frame
Structure is instructed to neutral net unit 121 and other performance elements 112.
Originally, framework program performing framework instruction (such as MTNN instructions 1400), processor front end 100 sends out this framework instruction
Cloth to neutral net unit 121 starts to perform the neutral net list in its program storage 129 to indicate neutral net unit 121
Metaprogram.Before, framework program can perform framework and instruct and the numerical value write of frequency when specifying main is relaxed into pointer 3512,
Even if neutral net unit is in general modfel.More precisely, sequencing drops can time-frequency to the numerical value for relaxing pointer 3512
Low logic 3504 with main time frequency signal 3506 it is main when frequency produce the second time frequency signal 3508.It is preferred that in this example
In, time-frequency reduces the voltage quasi position that the time-frequency buffer of logic 3504 lifts merely main time frequency signal 3506.In addition before,
Framework program can perform framework instruction to write data random access memory 122, weight random access memory 124 and by god
Jing NE program write-in programs memory 129.In response to neutral net unit program MTNN instructions 1400, neutral net unit
121 can start with it is main when frequency perform neutral net unit program, this is because relax pointer 3512 be with it is main when frequency
Value gives sequencing.Neutral net unit 121 start perform after, framework program can continue with it is main when frequency perform framework refer to
Order, deposits including mainly being write with MTNN instructions 1400 and/or reading data random access memory 122 with weight arbitrary access
Reservoir 124, to complete the example next time (instance) for neutral net unit program, or claims to call (invocation)
Or the preparation of execution (run).
In the example of Figure 36 A, complete random for data random access memory 122 and weight compared to framework program
Access memory 124 writes/reads the spent time, and neutral net unit 121 can be with the substantially less time (such as four
/ mono- time) complete the execution of neutral net unit program.For example, with it is main when frequency operation in the case of, god
Jing NEs 121 spend about 1000 time-frequency cycles to perform neutral net unit program, and but, framework program can spend
About 4000 time-frequency cycles.Thus, neutral net unit 121 would be at holding state within the remaining time, in this model
In example, this is a considerable time, such as about 3000 it is main when frequency cycle.As shown in the example of Figure 36 A, according to god
The size of Jing networks is different from configuration, can again perform previous mode, and may continuously carry out many times.Because neutral net
Unit 121 is a quite big and intensive functional unit of transistor in processor 100, and the running of neutral net unit 121 will
Substantial amounts of heat energy can be produced, especially with it is main when frequency operation when.
Figure 36 B are a sequential chart, and there is video-stream processor 100 neutral net unit 121 to operate on a fortune of mitigation pattern
Make example, frequency when frequency is less than main when relaxing the running of pattern.The sequential chart of Figure 36 B similar to Figure 36 A, in Figure 36 A,
Processor 100 with it is main when frequency perform framework program.This example assumes framework program and neutral net unit journey in Figure 36 B
Sequence is same as the framework program and neutral net unit program of Figure 36 A.But, before neutral net unit program is started, frame
Structure program can perform MTNN instructions 1400 and relax pointer 3512 with a mathematical programization, and this numerical value can make time-frequency reduce logic 3504
With less than it is main when frequency second when frequency produce the second time frequency signal 3508.That is, framework program can make nerve net
Mitigation pattern of the network unit 121 in Figure 36 B, rather than the general modfel of Figure 36 A.Thus, neural processing unit 126 will be with the
Frequency performs neutral net unit program when two, under mitigation pattern, frequency when frequency is less than main when second.It is false in this example
Surely relax pointer 3512 be with one by frequency when second be appointed as a quarter it is main when frequency numerical value give sequencing.Such as
This, it can be it in general mould that neutral net unit 121 performs the time that neutral net unit program spent under mitigation pattern
Time taking four times are spent under formula, as shown in Figure 36 A and Figure 36 B, tranmittance can find that neutral net unit 121 is in compared with this two figure
The time span of holding state can significantly shorten.Thus, neutral net unit 121 performs neutral net unit journey in Figure 36 B
The big appointment of sequence catabiotic duration be in Figure 36 A neutral net unit 121 four times of configuration processor under general modfel.
Therefore, neutral net unit 121 performs big appointment of the heat energy that produces within the unit interval of neutral net unit program and is in Figure 36 B
The a quarter of Figure 36 A, and there is advantage as herein described.
Figure 37 is a flow chart, shows the running of the processor 100 of Figure 35.The running of this flow chart description is similar to above
Corresponding to the running of Figure 35, Figure 36 A and Figure 36 B figures.This flow process starts from step 3702.
In step 3702, processor 100 performs MTNN instructions 1400 and weight is write into weight random access memory
124 and write data into data random access memory 122.Next flow process advances to step 3704.
In step 3704, processor 100 performs MTNN instructions 1400 and relaxes pointer 3512 with a numerical value sequencing,
The when frequency of frequency when this numerical value specifies one to be less than main, even if also neutral net unit 121 is in mitigation pattern.Next
Flow process advances to step 3706.
In step 3706, processor 100 performs MTNN instructions 1400 and indicates that neutral net unit 121 starts to perform nerve
NE program, that is, be similar to the mode that Figure 36 B are presented.Next flow process advances to step 3708.
In step 3708, neutral net unit 121 starts to perform this neutral net unit program.Meanwhile, processor 100
MTNN instructions 1400 can be performed and (also new data may can be write new weight write weight random access memory 124
Enter data random access memory 122), and/or perform MFNN instructions 1500 and read from data random access memory 122
As a result (also result can may be read from weight random access memory 124).Next flow process advances to step 3712.
In step 3712, processor 100 performs MFNN and instructs 1500 (such as reading state buffers 127), to detect
Neutral net unit 121 has terminated program performing.Assume framework one good numerical value of mitigation pointer 3512 of procedure Selection, nerve net
The time that the execution neutral net unit program of network unit 121 is spent will be same as the executable portion framework program of processor 100
To access the time that weight random access memory 124 and/or data random access memory 122 are spent, such as Figure 36 B institutes
Show.Next flow process advances to step 3714.
In step 3714, processor 100 performs MTNN instructions 1400 and utilizes a mathematical programization to relax pointer 3512, this
Frequency when numerical value specifies main, even if also neutral net unit 121 is in general modfel.Next step 3716 is advanced to.
In step 3716, processor 100 performs MTNN instructions 1400 and indicates that neutral net unit 121 starts to perform nerve
NE program, that is, be similar to the mode that Figure 36 A are presented.Next flow process advances to step 3718.
In step 3718, neutral net unit 121 starts to perform neutral net unit program with general modfel.This flow process
Terminate at step 3718.
It has been observed that compared under general modfel perform neutral net unit program (i.e. with processor it is main when frequency
Perform), performing under mitigation pattern can disperse the execution time and be avoided that generation high temperature.Furthermore, it is understood that working as neutral net
Unit relax pattern configuration processor when, neutral net unit be with it is relatively low when frequency produce heat energy, these heat energy can be suitable
The sharply packaging body via neutral net unit (such as the base material of semiconductor device, metal level and lower section) with surrounding and cooling
Mechanism's (such as fin, fan) discharges, and also therefore, the device (such as transistor, electric capacity, wire) in neutral net unit just compares
May operate at a lower temperature.On the whole, relax mode also contribute to reduce processor crystal grain other
Unit temp in part.Relatively low operational temperature, especially for the junction temperature of these devices for, electric leakage can be mitigated
The generation of stream.Additionally, because the magnitude of current flowed in the unit interval is reduced, inductance noise also can be reduced with IR pressure drops noise.This
Outward, temperature reduce for the metal-oxide half field effect transistor (MOSFET) in processor Negative Bias Temperature Instability (NBTI) with
Positive bias unstability (PBSI) also has positive influences, and can lift the life-span of reliability and/or device and processor part.
Temperature reduces and can mitigate Joule heat and electromigration effect in the metal level of processor.
With regard to neutral net unit shared resource framework program and nand architecture program between communication mechanism
It has been observed that in the example of Figure 24 to Figure 28 and Figure 35 to Figure 37, data random access memory 122 and weight with
The resource of machine access memory 124 is shared.Neural processing unit 126 is deposited at random with the front end shared data of processor 100
Access to memory 122 and weight random access memory 124.More precisely, before neural processing unit 126 and processor 100
End, such as media cache 118, all can be read out to data random access memory 122 with weight random access memory 124
With write.In other words, the framework program for being implemented in processor 100 and the neutral net list for being implemented in neutral net unit 121
Metaprogram meeting shared data random access memory 122 and weight random access memory 124, and in some cases, it is such as front
It is described, need for the flow process between framework program and neutral net unit program is controlled.The resource of program storage 129 exists
It is also down to a certain degree shared, this is because framework program can write to it, and sequencer 128 can be read it
Take.Embodiment as herein described provides a dynamical solution, to control between framework program and neutral net unit program
The flow process of access shared resource.
In the embodiments described herein, neutral net unit program is also referred to as nand architecture program, and neutral net unit refers to
Order is also referred to as nand architecture instruction, and neutral net unit instruction set (also referred to as neural processing unit instruction set as previously mentioned) is also referred to as
For nand architecture instruction set.Nand architecture instruction set is different from framework instruction set.Will comprising instruction translator 104 in processor 100
Framework instruction translation goes out in the embodiment of microcommand, and nand architecture instruction set is also different from microinstruction set.
Figure 38 is a block diagram, displays the details of the serial device 128 of neutral net unit 121.Serial device 128 provides memory
Address is supplied to the nand architecture of serial device 128 to instruct, as previously mentioned to program storage 129 with selection.As shown in figure 38, deposit
Memory address is loaded in the program counter 3802 of sequencer 128.Sequencer 128 would generally be with the ground of program storage 129
Location order is incremented by proper order, except non-sequencer 128 suffers from nand architecture instruction, for example, circulates or branch instruction, and in the case,
Program counter 3802 can be updated to sequencer 128 destination address of control instruction, that is, be updated to be located at the mesh of control instruction
The address of target nand architecture instruction.Therefore, it is loaded into the address 131 of program counter 3802 and can specifies and is currently seized for god
The nand architecture of the nand architecture program that Jing processing units 126 are performed instructs the address in program storage 129.Program counter
3802 numerical value can be taken by framework program through the neutral net unit program counter field 3912 of status register 127
, as described in follow-up Figure 39.Can so make framework program according to the progress of nand architecture program, determine for data are stored at random
The position of device 122 and/or the reading/writing data of weight random access memory 124.
Sequencer 128 also includes cycle counter 3804, and this cycle counter 3804 nand architecture recursion instruction that can arrange in pairs or groups is entered
Row running, such as in Figure 26 A address 10 be recycled to address 11 in 1 instruction and Figure 28 be recycled to 1 instruction.In Figure 26 A and figure
In 28 example, the numerical value in cycle counter 3804 specified by the nand architecture initialization directive of load address 0 for example loads number
Value 400.Each time sequencer 128 suffers from recursion instruction and jumps to target instruction target word (such as the multiplication in Figure 26 A positioned at address 1
Maxwacc in accumulated instruction or Figure 28 positioned at address 1 is instructed), sequencer 128 will make cycle counter 3804 successively decrease.
Once cycle counter 3804 is reduced to zero, sequencer 128 goes to sort in the instruction of Next nand architecture.In another enforcement
In example, the loop count specified in a recursion instruction can be loaded when suffering from recursion instruction first in cycle counter,
To save the demand using nand architecture initialization directive loop initialization counter 3804.Therefore, the number of cycle counter 3804
Value would indicate that the circulation group of nand architecture program waits the number of times for performing.The numerical value of cycle counter 3804 can be passed through by framework program
The cycle count field 3914 of status register 127 is obtained, as shown in follow-up Figure 39.Framework program can so be made according to nand architecture
The progress of program, determines for data deposit at random memory 122 and/or the reading/writing data of weight random access memory 124
Position.In one embodiment, sequencer is followed including three extra cycle counters with the nest set arranged in pairs or groups in nand architecture program
Ring, the numerical value of these three cycle counters also can pass through status register 127 and read.There is one to indicate this in recursion instruction
Which is available to current recursion instruction and uses in four cycle counters.
Sequencer 128 also includes iterations counter 3806.The collocation nand architecture instruction of iterations counter 3806, example
Such as Fig. 4, the multiply-accumulate instruction of address 2 in Fig. 9, Figure 20 and Figure 26 A, and the maxwacc of address 2 is instructed in Figure 28, these
Instruction will be referred to as thereafter " execution " instruction.In previous cases, each execute instruction respectively specifies that execution counts 511,
511,1023,2 and 3.When sequencer 128 suffers from one specifies the execute instruction of a non-zero iteration count, the meeting of sequencer 128
With this designated value loading iterations counter 3806.Additionally, sequencer 128 can produce appropriate micro- computing 3418 with control figure
Logic in 34 in the neural pipeline stages 3401 of processing unit 126 is performed, and makes iterations counter 3806 successively decrease.If repeatedly
Generation counter 3806 is more than zero, and sequencer 128 can again produce appropriate micro- computing 3418 and control neural processing unit 126
Interior logic simultaneously makes iterations counter 3806 successively decrease.Sequencer 128 can continue to operate in this way, until iterations meter
The numerical value zero of number device 3806.Therefore, the numerical value of iterations counter 3806 is to be specified in nand architecture execute instruction and waits
(these computings such as accumulated value and data/weight word carry out multiply-accumulate, take maximum to the operation times of execution, add up
Computing etc.).The numerical value of iterations counter 3806 can utilize framework program to count through the iterations of status register 127
Field 3916 is obtained, as described in follow-up Figure 39.Can so make framework program according to the progress of nand architecture program, determine for data
The position of memory 122 and/or the reading/writing data of weight random access memory 124 is deposited at random.
Figure 39 is a block diagram, shows the control of neutral net unit 121 and some fields of status register 127.This
A little fields include that neural processing unit 126 performs the ground of the weight random access memory row that nand architecture program is most recently written
Location 2602, the address 2604 of the weight random access memory row that the nand architecture program that neural processing unit 126 performs reads recently,
Neural processing unit 126 performs the address 2606 of the data random access memory row that nand architecture program is most recently written, and god
The address 2608 of the data random access memory row that the nand architecture program that Jing processing units 126 performs reads recently, such as earlier figures
Shown in 26B.Additionally, these fields also include the field of neutral net unit program counter 3912, the field of cycle counter 3914,
With the field of iterations counter 3916.It has been observed that framework program can delay the digital independent in status register 127 to media
Storage 118 and/or general caching device 116, for example reading through MFNN instructions 1500 includes neutral net unit program counter
3912, the numerical value of cycle counter 3914 and the field of iterations counter 3916.The numerical value of program counter field 3912 is anti-
Reflect the numerical value of Figure 38 Programs counter 3802.The numerical value of cycle counter field 3914 reflects the number of cycle counter 3804
Value.The numerical value of the numerical value reflection iterations counter 3806 of iterations counter field 3916.In one embodiment, sequencing
Device 128 is needing every time adjustment programme counter 3802, cycle counter 3804, or during iterations counter 3806, all can
The numerical value of more new program counter field 3912, cycle counter field 3914 and iterations counter field 3916, thus,
When framework program reads, the numerical value of these fields will be numerical value instantly.In another embodiment, when neutral net unit
When 121 execution framework instructions are with reading state buffer 127, neutral net unit 121 only obtains program counter 3802, follows
The numerical value of inner loop counter 3804 and iterations counter 3806 is simultaneously provided back into framework instruction and (for example provides to media buffer
Device 118 or general caching device 116).
It is possible thereby to find, the numerical value of the field of the status register 127 of Figure 39 can be understood as nand architecture and instruct by god
During Jing NEs are performed, the information of its implementation progress.With regard to nand architecture program performing progress some it is specific towards,
Such as the numerical value of program counter 3802, the numerical value of cycle counter 3804, the numerical value of iterations counter 3806, nearest read/write
The address 125 of weight random access memory 124 field 2602/2604, and the recently data random access of read/write
The field 2606/2608 of the address 123 of memory 122, is described in previous chapters and sections.It is implemented in the framework of processor 100
Program can read the nand architecture program progress value of Figure 39 and use such information for doing decision-making from status register 127, for example
Carry out through such as comparing with the instruction of the framework such as branch instruction.For example, framework program can be determined for data random access
Memory 122 and/or weight random access memory 124 carry out the row of the read/write of data/weight, with control data with
Machine accesses the inflow of the data of memory 122 or weight random access memory 124 and flows out, in particular for large data group
And/or the overlapping of different nand architecture instructions is performed.These can refer to chapter before and after this paper using the example that framework program carries out decision-making
The description of section.
For example, as described in Figure 26 A above, the result of convolution algorithm is write back number by framework program setting nand architecture program
According to the row (such as the top of row 8) of the top of convolution kernel in random access memory 122 2402, and work as neutral net unit 121 using most
During the address write result of the nearly write row 2606 of data random access memory 122, framework program can be deposited from data random access
Reservoir 122 reads this result.
In another example, as described in Figure 26 B above, framework program is using the field of status register 127 from Figure 38
Validation of information nand architecture program the data array 2404 of Figure 24 is divided into into the data block of 5 512x 1600 to perform convolution fortune
The progress of calculation.Framework program at random deposits first data block of 512x 1600 write weight of this data array of 2560x 1600
Access to memory 124 simultaneously starts nand architecture program, and weight random access memory 124 is initialized defeated for 1600 for its cycle count
Fall out as 0.When neutral net unit 121 performs nand architecture program, framework program understands reading state buffer 127 to confirm weight
Random access memory 124 is most recently written row 2602, such framework program just can read by nand architecture program write it is effective
Convolution algorithm result, and using the data blocks of next 512x 1600 this effective convolution operation result is override after reading, such as
This, after neutral net unit 121 completes execution of the nand architecture program for first data block of 512x 1600, processor 100
Nand architecture program can be immediately updated when necessary and is again started up nand architecture program to perform the data of next 512x 1600
Block.
In another example, it is assumed that framework program makes neutral net unit 121 perform a series of typical neutral nets to take advantage of
Method adds up run function, wherein, weight is stored in weight random access memory 124 and result can be written back into data and deposit at random
Access to memory 122.In the case, would not again to it after a row of framework program reading weight random access memory 124
It is read out.Thus, in current weight by nand architecture program reading/use after, it is possible to started using framework program
By the weight on new weight manifolding weight random access memory 124, with nand architecture program is provided example next time (for example
Next neural net layer) use.In the case, framework program meeting reading state buffer 127 is deposited at random with obtaining weight
The address of the nearest reading row 2604 of access to memory writes new weight group to determine it in weight random access memory 124
Position.
In another example, it is assumed that framework program is known in nand architecture program including one with big iterations counting
Execute instruction, the multiply-accumulate instruction of the nand architecture of address 2 in such as Figure 20.In the case, framework program needs to know iteration
Counting how many times 3916, can know that generally also needing to how many time-frequency cycles can just complete this nand architecture instruction to determine framework
The whichever of one of following two or more actions to be taken of program.For example, if needing long time complete
Into execution, framework program will abandon control and give another framework program, such as operating system.Similarly, it is assumed that framework program
Know that nand architecture program includes a circulation group with sizable cycle count, the nand architecture program of such as Figure 28.Here
In the case of, framework program may require that knows cycle count 3914, can know generally also need to how many time-frequency cycles could
Complete this nand architecture to instruct to determine the whichever of one of following two or more actions to be taken.
In another example, it is assumed that framework program makes neutral net unit 121 perform similarly to described in Figure 27 and Figure 28
Common source computing, wherein the data of wanted common source be previously stored weight random access memory 124 and result can be written back into weight with
Machine accesses memory 124.But, different from the example of Figure 27 and Figure 28, it is assumed that it is random that the result of this example can be written back into weight
The top 400 of access memory 124 arranges, such as row 1600 to 1999.In the case, nand architecture program completes to read four row
It is wanted after the data of weight random access memory 124 of common source, and nand architecture program would not be read out again.Therefore, one
Current four column data of denier all by nand architecture program reading/use after, you can started new data (such as non-frame using framework program
The weight of the example next time of structure program, for example, for example perform typical multiply-accumulate run function computing to obtaining data
Nand architecture program) overriding weight random access memory 124 data.In the case, framework program meeting reading state delays
Storage 127 to obtain the address of the nearest reading row 2604 of weight random access memory, to determine new weight group write power
The position of weight random access memory 124.
Time recurrence (recurrent) neutral net accelerates
The memory that conventional feed forward neutral net is not previously entered with storage network.Feedforward neural network is normally used for
It is respective independence to perform the multiple inputs for being input into network with the time in task, and multiple outputs are also task so.Compare
Under, time recurrent neural network typically facilitates and performs the input sequence that is input into neutral net with the time in task and have
The task of importance.(order herein is commonly known as time step.) therefore, time recurrent neural network includes a concept
On memory or claim internal state, with the letter for being previously entered performed calculating and producing in loading network in response to sequence
Breath, the output of time recurrent neural network is associated with the input of this internal state and next time step.Following task, such as language
Sound is recognized, language model, and word is produced, language translation, and image description is produced and some form of handwriting identification, is to pass the time
Neutral net is returned to perform good example.
The example of three kinds of known time recurrent neural networks is Elman time recurrent neural networks, and the Jordan times pass
Neutral net is returned to remember (LSTM) neutral net with shot and long term.Elman times recurrent neural network includes content node to remember
The hiding layer state of time recurrent neural network in current time step, this state in next time step can as
The input of hidden layer.Jordan times recurrent neural network similar to Elman time recurrent neural networks, except content therein
The output layer state rather than hiding layer state of node meeting memory time recurrent neural network.Shot and long term Memory Neural Networks include by
The shot and long term memory layer that shot and long term memory cell is constituted.Each shot and long term memory cell have the current state of current time step with
Current output, and new or follow-up time step a new state and new output.Shot and long term memory cell includes input
Lock and output lock, and forget lock, forgeing lock can make neuron lose its state remembered.These three time recurrent neurals
Network can be described in more detail in following sections.
It is as described herein, for time recurrent neural network, such as Elman or Jordan time recurrent neural networks,
Neutral net unit perform every time all can use time step, obtaining one group of input layer value, and perform necessary calculating makes it
Propagated through time recurrent neural network, to produce output layer nodal value and hidden layer and content layer nodal value.Therefore,
Input layer value can be associated with calculating and hide, the time step of output and content layer nodal value;And hide, export and content layer
Nodal value can be associated with the time step for producing these nodal values.Input layer value is that time recurrent neural network is emulated
Systematic sampling value, such as image, phonetic sampling, the snapshot of commercial market data.For shot and long term Memory Neural Networks, god
Each execution of Jing NEs can all use a period of time intermediate step, obtain one group of memory cell input value and perform necessary calculating to produce
Raw memory cell output valve (and memory cell state and input lock, forgetting lock and output lock numerical value), this is it can be appreciated that be
Memory cell input value is propagated through shot and long term memory layer memory cell.Therefore, memory cell input value can be associated with calculating memory cell shape
State and input lock, forget the time step of lock and output lock numerical value;And memory cell state and input lock, lock is forgotten with output
Lock numerical value can be associated with the time step for producing these nodal values.
Content layer nodal value, also referred to as state node, are the state values of neutral net, and this state value is previous based on being associated with
The input layer value of time step, and only it is not associated with the input layer value of current time step.Neutral net unit
For the calculating performed by time step is (such as the hidden layer nodal value of Elman or Jordan time recurrent neural networks
Calculate) be previous time steps produce content layer nodal value a function.Therefore, network-like state value when time step starts
The output layer nodal value that (content node value) is produced during affecting now intermediate step.Additionally, at the end of time step
Network-like state value when network-like state value can be started by the input node value of now intermediate step with time step is affected.It is similar
Ground, for shot and long term memory cell, memory cell state value is associated with the memory cell input value of previous time steps, rather than only
It is associated with the memory cell input value of current time step.Because the calculating that neutral net unit is performed for time step is (for example
The calculating of next memory cell state) be previous time steps produce memory cell state value function, when time step starts
Network-like state value (memory cell state value) can affect in now intermediate step produce memory cell output valve, and now intermediate step knot
Network-like state value during beam can be affected by the memory cell input value of now intermediate step and former network state value.
Figure 40 is a block diagram, shows an example of Elman time recurrent neural networks.The Elman time recurrence of Figure 40
Neutral net includes input layer, or neuron is denoted as D0, D1 to Dn, referred to collectively as multiple input layer D and it is indivedual
It is commonly referred to as input layer D;Hiding node layer/neuron, is denoted as Z0, Z1 to Zn, referred to collectively as multiple hiding node layer Z and
It is commonly referred to as individually hiding node layer Z;Output node layer/neuron, is denoted as Y0, Y1 to Yn, referred to collectively as multiple output layer sections
Point Y and individually be commonly referred to as export node layer Y;And content node layer/neuron, C0, C1 to Cn are denoted as, it is referred to collectively as multiple
Content node layer C and be commonly referred to as content node layer C individually.In the example of the Elman time recurrent neural networks of Figure 40, each
There is hiding node layer Z an input to be linked to the output of each input layer D, and be linked to each content layer with an input
The output of node C;There is each output node layer Y an input to be linked to the output of each hiding node layer Z;And each content layer
There is node C an input to be linked to the output of corresponding hiding node layer Z.
In many aspects, the running of Elman time recurrent neural networks is similar to traditional feed forward-fuzzy control.
That is, for given node, each input of this node links can all an associated weight;Node is defeated one
Enter to link the numerical value meeting for receiving and the multiplied by weight for associating to produce a product;This node can will be associated with all inputs and link
Product addition is producing one total (shift term may also can be included in this sum);In general, can also perform to this sum
To produce the output valve of node, this output valve is sometimes referred to as the initiation value of this node to run function.For traditional feedforward network
For, data always flow along the direction of input layer to output layer.That is, input layer provides a numerical value to hidden layer
(generally having multiple hidden layers), and hidden layer can produce its output valve and provide to output layer, and output layer can be produced and can taken
Output.
But, different from traditional feedforward network, Elman times recurrent neural network also includes that some feedbacks link,
It is exactly the link from hiding node layer Z to content node layer C in Figure 40.The running of Elman time recurrent neural networks is as follows, when
Input layer D provides an input value to hiding node layer Z in a new time step, and content node C can provide a numerical value
To hidden layer Z, this numerical value is to hide node layer Z in response to being previously entered, that is, current time step, output valve.From this
For in meaning, the content node C of Elman time recurrent neural networks is depositing based on the input value of previous time steps
Reservoir.Figure 41 and Figure 42 will be associated with the neutral net list of the calculating of the Elman time recurrent neural networks of Figure 40 to execution
The running embodiment of unit 121 is illustrated.
In order to illustrate the present invention, Elman time recurrent neural networks are one and include at least one input node layer, one
The time recurrent neural network of concealed nodes layer, an output node layer and a content node layer.For preset time step
Suddenly, content node layer can store concealed nodes layer and produce and feed back to the result of content node layer in previous time step.This
The result for feeding back to content layer can be that the implementing result of run function or concealed nodes layer perform accumulating operation and be not carried out
The result of run function.
Figure 41 is a block diagram, is shown when neutral net unit 121 performs the Elman time recurrent neurals for being associated with Figure 40
During the calculating of network, in the data random access memory 122 and weight random access memory 124 of neutral net unit 121
Data configuration an example.Assume the Elman times recurrent neural network of Figure 40 with 512 inputs in the example of Figure 41
Node D, 512 concealed nodes Z, 512 content node C, with 512 output node Y.Additionally, also assuming that this Elman times pass
Return neutral net to link completely, i.e., all 512 input nodes D link each concealed nodes Z as input, whole 512
Individual content node C links each concealed nodes Z as input, and all 512 concealed nodes Z link each output node
Y is used as input.Additionally, this neutral net unit 121 is configured to 512 neural processing units 126 or neuron, for example, adopts width and match somebody with somebody
Put.Finally, this example assumes that be associated with content node C is numerical value 1 to the weight of the link of concealed nodes Z, thus is not required to storage
Deposit the weighted value that these are.
As shown in FIG., the lower section 512 of weight random access memory 124 arranges (row 0 to 511) loading and is associated with input section
The weighted value of the link between point D and concealed nodes Z.More precisely, as shown in FIG., row 0 are loaded and are associated with by input node
The weight that the input of D0 to concealed nodes Z links, that is, word 0 can be loaded being associated between input node D0 and concealed nodes Z0
Link weight, word 1 can load the weight of the link being associated between input node D0 and concealed nodes Z1, and word 2 can be filled
Load is associated with the weight of the link between input node D0 and concealed nodes Z2, and the rest may be inferred, and word 511 can be loaded and be associated with input
The weight of the link between node D0 and concealed nodes Z511;Row 1 are loaded and are associated with by the input of input node D1 to concealed nodes Z
The weight of link, that is, word 0 can load the weight of the link being associated between input node D1 and concealed nodes Z0, the meeting of word 1
Loading is associated with the weight of link between input node D1 and concealed nodes Z1, word 2 can load be associated with input node D1 with
The weight of the link between concealed nodes Z2, the rest may be inferred, and word 511 can be loaded and be associated with input node D1 and concealed nodes Z511
Between link weight;Until row 511, row 511 are loaded and are associated with what is linked by the input of input node D511 to concealed nodes Z
Weight, that is, word 0 can load the weight of the link being associated between input node D511 and concealed nodes Z0, word 1 can be loaded
The weight of the link being associated between input node D511 and concealed nodes Z1, word 2 can load be associated with input node D511 with
The weight of the link between concealed nodes Z2, the rest may be inferred, and word 511 can be loaded and be associated with input node D511 and concealed nodes
The weight of the link between Z511.This configuration is with purposes similar to the embodiment corresponded to above described in Fig. 4 to Fig. 6 A.
As shown in FIG., follow-up 512 row (row 512 to 1023) of weight random access memory 124 are the sides to be similar to
Formula loads the weight of the link being associated between concealed nodes Z and output node Y.
Data random access memory 122 loads Elman time recurrent neural networks nodal value and supplies a series of time steps
Use.Furthermore, it is understood that data random access memory 122 carries the nodal value for providing preset time step with three row as assembling.
As shown in FIG., so that one has the data random access memory 122 of 64 row as an example, this data random access memory 122
The nodal value used for 20 different time steps can be loaded.In the example of Figure 41, row 0 to 2 are loaded and used for time step 0
Nodal value, row 3 to 5 load the nodal value that uses for time step 1, and the rest may be inferred, and row 57 to 59 are loaded for time step 19
The nodal value for using.First row in each group loads the numerical value of now input node D of intermediate step.Secondary series in each group is loaded
The now numerical value of the concealed nodes Z of intermediate step.In each group the 3rd equips the numerical value for carrying the now output node Y of intermediate step.Such as
Shown in figure, each luggage of data random access memory 122 carries the section of its corresponding neuron or neural processing unit 126
Point value.That is, row 0 loads the nodal value for being associated with node D0, Z0 and Y0, its calculating is held by neural processing unit 0
OK;Row 1 loads the nodal value for being associated with node D1, Z1 and Y1, and its calculating is by performed by neural processing unit 1;The rest may be inferred,
Row 511 is loaded and is associated with the nodal value of node D511, Z511 and Y511, its calculate be by performed by neural processing unit 511, this
Part can be described in more detail in follow-up corresponding at Figure 42.
As pointed by Figure 41, for a preset time step, positioned at the row memory of each group three secondary series hide
The numerical value of node Z can be the numerical value of the content node C of next time step.That is, neural processing unit 126 is for the moment
The numerical value of the node Z for calculating in intermediate step and writing, can become this neural processing unit 126 is used in next time step
The numerical value (together with the numerical value of input node D of this next time step) of the node C that the numerical value of calculate node Z is used.It is interior
The initial value (numerical value of the node C that the numerical value of the node Z in time step 0 is to calculate row 1 is used) for holding node C assumes
It is zero.This can be described in more detail in the related Sections of the follow-up nand architecture program corresponding to Figure 42.
It is preferred that input node D numerical value (row 0,3 in the example of Figure 41, the rest may be inferred to row 57 numerical value) by holding
Row writes/inserts data random access memory 122 in the framework program of processor 100 through MTNN instructions 1400, and is
Read/used by the nand architecture program for being implemented in neutral net unit 121, the nand architecture program of such as Figure 42.On the contrary, hidden
The numerical value (row 1 and 2,4 and 5 in the example of Figure 41, the numerical value that the rest may be inferred to row 58 with 59) of Tibetan/output node Z/Y is then
Data random access memory 122 is write/insert by the nand architecture program for being implemented in neutral net unit 121, and be by holding
Row reads/uses in the framework program of processor 100 through MFNN instructions 1500.The example of Figure 41 assumes that this framework program can be held
Row following steps:(1) different for 20 time steps, by the numerical value of input node D data random access memory is inserted
122 (row 0,3, the rest may be inferred to row 57);(2) the nand architecture program of Figure 42 is started;(3) detect whether nand architecture program has performed
Finish;(4) numerical value (row 2,5, the rest may be inferred to row 59) of output node Y is read from data random access memory 122;And
(5) repeat step (1) to (4) is several times until completing task, such as meter needed for being recognized to the language of cellie
Calculate.
In another kind of executive mode, framework program can perform following steps:(1) to single time step, to be input into
The numerical value of node D inserts data random access memory 122 (such as row 0);(2) nand architecture program (Figure 42 nand architecture programs are started
Amendment after version, be not required to circulation, and only access single group three of data random access memory 122 to arrange);(3) detect non-
Whether framework program is finished;(4) numerical value (such as row 2) of output node Y is read from data random access memory 122;With
And (5) repeat step (1) to (4) is several times until completing task.This two kinds of mode whichever for it is excellent can be according to time recurrent neural
Depending on the sampling mode of the input value of network.For example, if this task is allowed in multiple time steps and input is taken
Sample (such as about 20 time steps) simultaneously performs calculating, and first kind of way is just ideal, because this mode may be brought more
Many computing resource efficiency and/or preferably efficiency, but, if this task is only allowed in single time step and performs sampling,
It is accomplished by using the second way.
3rd embodiment but, different from the second way single group of three columns is used similar to the aforementioned second way
According to random access memory 122, the nand architecture program of this mode uses multigroup three row memory, that is, in each time step
Using three row memories of different groups, this part is similar to first kind of way.In this 3rd embodiment, it is preferred that framework program
A step is included before step (2), in this step, framework program can be updated before nand architecture program starts to it, for example
The row of data random access memory 122 in the instruction of address 1 are updated to point to into next group of three row memories.
Figure 42 is a form, and display is stored in a program of the program storage 129 of neutral net unit 121, this program
Performed by neutral net unit 121, and the configuration according to Figure 41 reaches Elman time recurrent neural nets using data and weight
Network.Some instructions in the nand architecture program of Figure 42 (and Figure 45, Figure 48, Figure 51, Figure 54 and Figure 57) (are for example taken advantage of in detail as aforementioned
Method adds up (MULT-ACCUM), circulate (LOOP), initialization (INITIALIZE) instruction), paragraphs below assume these instruction with
Preceding description content is consistent, unless otherwise noted.
The example program bag of Figure 42 is instructed containing 13 nand architecture, respectively positioned at address 0 to 12.The instruction of address 0
(INITIALIZE NPU, LOOPCNT=20) removes accumulator 202 and cycle counter 3804 is initialized to numerical value 20,
To perform 20 circulation groups (instruction of address 4 to 11).It is preferred that this initialization directive also can be made at neutral net unit 121
In wide configuration, thus, neutral net unit 121 will be configured to 512 neural processing units 126.As described in following sections,
In the execution process instruction of address 1 to 3 and address 7 to 11, this 512 neural processing units 126 are corresponding as 512
Hiding node layer Z operated, and in the execution process instruction of address 4 to 6, this 512 neural conducts of processing unit 126
512 corresponding output node layer Y are operated.
The instruction of address 1 to 3 is not belonging to the circulation group of program and only can perform once.These instructions are calculated hides node layer
The initial value of Z simultaneously is written into the row 1 of data random access memory 122 and performs for the first time of the instruction of address 4 to 6 to make
With to calculate the output node layer Y of very first time step (time step 0).Additionally, these are calculated by the instruction of address 1 to 3
And write the numerical value of the hiding node layer Z of the row 1 of data random access memory 122 can become content node layer C numerical value supply
Address 7 performs with the first time of 8 instruction and uses, and with the numerical value for calculating hiding node layer Z the second time step (time step is supplied
It is rapid 1) to use.
In the implementation procedure of the instruction of address 1 and 2, each nerve in this 512 neural processing units 126 processes single
Unit 126 can perform 512 multiplyings, will take advantage of positioned at 512 input node D numerical value of the row 0 of data random access memory 122
The weight of the row of corresponding this neural processing unit 126 in the row 0 to 511 of upper weight random access memory 124, to produce
512 product accumulations are in the accumulator 202 of corresponding neural processing unit 126.In the implementation procedure of the instruction of address 3, this
The numerical value of 512 accumulators 202 of 512 neural processing units can be passed and write data random access memory 122
Row 1.That is, the output order of address 3 can be by the tired of each the neural processing unit 512 in 512 neural processing units
Plus the numerical value of device 202 writes the row 1 of data random access memory 122, this numerical value is initial hidden layer Z numerical value, subsequently, this
Instruction can remove accumulator 202.
The ground that computing performed by the instruction of the address 1 to 2 of the nand architecture program of Figure 42 is instructed similar to the nand architecture of Fig. 4
Computing performed by the instruction of location 1 to 2.Furthermore, it is understood that the instruction (MULT_ACCUM DR ROW 0) of address 1 can indicate this
Each neural processing unit 126 in 512 neural processing units 126 is by the relative of the row 0 of data random access memory 122
Answer word to read in its multitask buffer 208, the corresponding word of the row 0 of weight random access memory 124 is read in into it many
Task buffer device 705, data literal is multiplied with weight word and is produced product and this product is added into accumulator 202.Address 2
Instruction (MULT-ACCUM ROTATE, WR ROW+1, COUNT=511) indicates each god in this 512 neural processing units
Jing processing units 126 will proceed to its multitask buffer 208 (using by nerve from the word of adjacent nerve processing unit 126
512 collectives of multitask buffer 208 of NE 121 operate the circulator of 512 words for constituting, and these buffers are
Instruction for address 1 indicates the buffer for reading in the row of data random access memory 122), by weight random access memory
The corresponding word of the next column of device 124 reads in its multitask buffer 705, and the data literal generation that is multiplied with weight word is taken advantage of
This product is simultaneously added accumulator 202 by product, and performs foregoing operation 511 times.
Additionally, in Figure 42 address 3 single nand architecture output order (OUTPUT PASSTHRU, DR OUT ROW 1, CLR
ACC) write output order of the computing that can be instructed run function with address in Fig. 43 with 4 merges (although the program of Figure 42 is passed
The numerical value of accumulator 202 is passed, and the program of Fig. 4 is then to perform run function to the numerical value of accumulator 202).That is, Figure 42's
In program, the run function of the numerical value of accumulator 202 is implemented in, if any, specify (also in address 6 and 11 in output order
Output order in specify), rather than as the program of Fig. 4 be shown in it is specified during a different nand architecture run function is instructed.Fig. 4
Another embodiment of the nand architecture program of (and Figure 20, Figure 26 A and Figure 28), also will run function instruction computing with write
Output order (such as the address 3 and 4 of Fig. 4) merges into the model that single nand architecture output order as shown in figure 42 falls within the present invention
Farmland.The example of Figure 42 assumes that the node of hidden layer (Z) will not perform run function to accumulator value.But, hidden layer (Z) is right
Accumulator value performs the embodiment of run function and also belongs to the scope of the present invention, and these embodiments can utilize the finger of address 3 and 11
Order carries out computing, such as S types, tanh, correction function etc..
Only can perform once compared to the instruction of address 1 to 3, the instruction of address 4 to 11 be then be located in program circulation and
Some number of times can be performed, this number of times (such as 20) by specified by cycle count.19 execution before the instruction of address 7 to 11
Calculate the numerical value of hiding node layer Z and be written into data random access memory 122 for the instruction of address 4 to 6 second to
Perform using the output node layer Y (time step 1 to 19) for calculating remaining time step for 20 times.(the instruction of address 7 to 11
It is last/perform for the 20th time and to calculate the row hidden the numerical value of node layer Z and be written into data random access memory 122
61, but, these numerical value are simultaneously not used by.)
Address 4 and 5 instruction (MULT-ACCUM DR ROW+1, WR ROW 512and MULT-ACCUM ROTATE,
WR ROW+1, COUNT=511) first time perform in (correspond to time step 0), in this 512 neural processing units 126
Each neural processing unit 126 can perform 512 multiplyings, by 512 of the row 1 of data random access memory 122
It is random that the numerical value (these numerical value are produced and write by single execution of the instruction of address 1 to 3) of concealed nodes Z is multiplied by weight
The weight of the row of this neural processing unit 126 of correspondence in the row 512 to 1023 of access memory 124, is tired out with producing 512 products
It is added on the accumulator 202 of corresponding neural processing unit 126.In instruction (the OUTPUT ACTIVATION of address 6
FUNCTION, DR OUT ROW+1, CLR ACC) first time perform, can start letters for this 512 accumulating values are performed
Number (such as S types, tanh, correction function) exports the numerical value of node layer Y to calculate, and implementing result can write data and deposit at random
The row 2 of access to memory 122.
(time step 1), this 512 neural processing units are corresponded in second execution of the instruction of address 4 and 5
Each neural processing unit 126 in 126 can perform 512 multiplyings, by the row 4 of data random access memory 122
The numerical value (these numerical value are performed by the first time of the instruction of address 7 to 11 and produced and write) of 512 concealed nodes Z is multiplied by power
The weight of the row of this neural processing unit 126 of correspondence in the row 512 to 1023 of weight random access memory 124, to produce 512
Product accumulation in the accumulator 202 of corresponding neural processing unit 126, and in performing at second of instruction of address 6, can be right
Run function is performed to calculate the numerical value of output node layer Y, this result write data random access is deposited in this 512 accumulating values
The row 5 of reservoir 122;(time step 2 is corresponded in the third time of the instruction of address 4 and 5 is performed), this 512 nerves are processed
Each neural processing unit 126 in unit 126 can perform 512 multiplyings, by the row of data random access memory 122
The numerical value (these numerical value are produced and write by second execution of the instruction of address 7 to 11) of 7 512 concealed nodes Z is taken advantage of
The weight of the row of this neural processing unit 126 of correspondence in the row 512 to 1023 of upper weight random access memory 124, to produce
512 product accumulations are in the accumulator 202 of corresponding neural processing unit 126, and the third time in the instruction of address 6 is performed
In, run functions can be performed to calculate the numerical value of output node layer Y for this 512 accumulating values, this result write data with
Machine accesses the row 8 of memory 122;The rest may be inferred, and (time step is corresponded in the 20th execution of the instruction of address 4 and 5
19), each the neural processing unit 126 in this 512 neural processing units 126 can perform 512 multiplyings, by data with
Machine access memory 122 row 58 512 concealed nodes Z numerical value (these numerical value by address 7 to 11 instruction the 19th
It is secondary execution and produce with write) be multiplied by corresponding this neural processing unit in the row 512 to 1023 of weight random access memory 124
The weight of 126 row, to produce 512 product accumulations in the accumulator 202 of corresponding neural processing unit 126, and in address 6
Perform the 20th time of instruction, can perform run functions to calculate the number of output node layer Y for this 512 accumulating values
Value, implementing result writes the row 59 of data random access memory 122.
In the first time of the instruction of address 7 and 8 performs, each nerve in this 512 neural processing units 126 is processed
The numerical value of 512 content node C of the row 1 of data random access memory 122 is added to its accumulator 202 by unit 126, this
A little numerical value are produced by single execution of the instruction of address 1 to 3.Furthermore, it is understood that instruction (the ADD_D_ACC DR of address 7
ROW+0 each neural processing unit 126 in this 512 neural processing units 126) can be indicated data random access memory
122 read in its multitask buffer 208 when the corresponding word of prostatitis (being row 0 during performing in first time), and will
This word adds accumulator 202.The instruction (ADD_D_ACC ROTATE, COUNT=511) of address 8 indicate this 512 it is neural at
Each neural processing unit 126 in reason unit 126 will proceed to its multitask and delays from the word of adjacent nerve processing unit 126
Storage 208 be (512 words constituted using 512 collectives of multitask buffer 208 running by neutral net unit 121
Circulator, these multitask buffers are the caching of the row that the instruction of address 7 indicates to read in data random access memory 122
Device), this word is added into accumulator 202, and perform foregoing operation 511 times.
In second execution of the instruction of address 7 and 8, each nerve in this 512 neural processing units 126 is processed
The numerical value of the meeting just 512 content node C of the row 4 of data random access memory 122 of unit 126 is added to its accumulator
202, these numerical value are produced by the first time of the instruction of address 9 to 11 performs and write;The 3rd of the instruction of address 7 and 8 the
In secondary execution, each the neural processing unit 126 in this 512 neural processing units 126 can just data random access storage
The numerical value of 512 content node C of the row 7 of device 122 is added to its accumulator 202, these numerical value by address 9 to 11 instruction
Perform for second produced and write;The rest may be inferred, in the 20th execution of the instruction of address 7 and 8, this 512 nerves
In 512 of each meeting of neural processing unit 126 just row 58 of data random access memory 122 in processing unit 126
The numerical value for holding node C is added to its accumulator 202, and these numerical value are produced by the 19th execution of the instruction of address 9 to 11
And write.
It has been observed that the example of Figure 42 assumes that the weight for being associated with content node C to the link of hiding node layer Z has for one
Value.But, in another embodiment, these links being located in Elman time recurrent neural networks are then with non-zero power
Weight values, these weights are positioned over weight random access memory 124 (such as row 1024 to 1535) before the program performing of Figure 42,
The programmed instruction of address 7 is MULT-ACCUM DR ROW+0, WR ROW 1024, and the programmed instruction of address 8 is MULT-
ACCUM ROTATE, WR ROW+1, COUNT=511.It is preferred that the instruction of address 8 does not access weight random access memory
124, but the numerical value of multitask buffer 705 is read in the instruction of rotation address 7 from weight random access memory 124.511
Retain more frequency ranges by not entering line access to weight random access memory 124 in the time-frequency cycle of the individual instruction of execution address 8
Use for framework program access weight random access memory 124.
Address 9 and 10 instruction (MULT-ACCUM DR ROW+2, WR ROW 0and MULT-ACCUM ROTATE,
WR ROW+1, COUNT=511) first time perform in (time step 1) is corresponded to, in this 512 neural processing units 126
Each neural processing unit 126 can perform 512 multiplyings, by 512 of the row 3 of data random access memory 122
The numerical value of input node D is multiplied by the row of this neural processing unit 126 of correspondence in the row 0 to 511 of weight random access memory 124
Weight to produce 512 products, together with address 7 and 8 instruction for the cumulative fortune performed by 512 content node C numerical value
Calculate, add up in the accumulator 202 of corresponding neural processing unit 126 to calculate the numerical value of hiding node layer Z, in the finger of address 11
In making the first time of (OUTPUT PASSTHRU, DR OUT ROW+2, CLR ACC) perform, this 512 neural processing units 126
512 numerical value of accumulator 202 be passed and write the row 4 of data random access memory 122, and accumulator 202 can be clear
Remove;(time step 2 is corresponded in second execution of the instruction of address 9 and 10), in this 512 neural processing units 126
Each neural processing unit 126 can perform 512 multiplyings, by 512 of the row 6 of data random access memory 122
The numerical value of input node D is multiplied by the row of this neural processing unit 126 of correspondence in the row 0 to 511 of weight random access memory 124
Weight, to produce 512 products, together with address 7 and 8 instruction for the cumulative fortune performed by 512 content node C numerical value
Calculate, add up in the accumulator 202 of corresponding neural processing unit 126 to calculate the numerical value of hiding node layer Z, in the finger of address 11
During second of order is performed, 512 numerical value of accumulator 202 of this 512 neural processing units 126 be passed and write data with
Machine accesses the row 7 of memory 122, and accumulator 202 then can be eliminated;The rest may be inferred, the 19th of the instruction of address 9 and 10 the
(time step 19 is corresponded in secondary execution), each the neural processing unit 126 in this 512 neural processing units 126 can be held
512 multiplyings of row, by the numerical value of 512 input nodes D of the row 57 of data random access memory 122 be multiplied by weight with
The weight of the row of this neural processing unit 126 of correspondence in the row 0 to 511 of machine access memory 124, to produce 512 products, even
With the instruction of address 7 and 8 for the accumulating operation performed by 512 content node C numerical value, to add up and process single in corresponding nerve
The accumulator 202 of unit 126 to calculate the numerical value of hiding node layer Z, and in the performing for the 19th time of instruction of address 11, this
512 numerical value of accumulator 202 of 512 neural processing units 126 are passed and write the row of data random access memory 122
58, and accumulator 202 then can be eliminated.As it was previously stated, produced and write in address 9 performs with the 20th time of 10 instruction
The numerical value of the hiding node layer Z for entering can't be used.
The instruction (LOOP 4) of address 12 can make cycle counter 3804 successively decrease and count in new cycle counter 3804
Value is more than the instruction that address 4 is returned in the case of zero.
Figure 43 is the example that a block diagram shows Jordan time recurrent neural networks.The Jordan time recurrence of Figure 43
Neutral net similar to Figure 40 Elman time recurrent neural networks, with input layer/neuron D, hide node layer/
Neuron Z, exports node layer/neuron Y, with content node layer/neuron C.But, in the Jordan times recurrence god of Figure 43
In Jing networks, content node layer C is linked using the output feedback from its corresponding output node layer Y as its input, Er Feiru
Output in the Elman time recurrent neural networks of Figure 40 from hiding node layer Z links as its input.
In order to illustrate the present invention, Jordan time recurrent neural networks are one and include at least one input node layer, one
The time recurrent neural network of individual concealed nodes layer, an output node layer and a content node layer.In preset time step
Rapid beginning, content node layer can store output node layer and produce and be fed back to the knot of content node layer in previous time step
Really.This result for being fed back to content layer can be that the result of run function or output node layer perform accumulating operation and be not carried out
The result of run function.
Figure 44 is a block diagram, is shown when neutral net unit 121 performs the Jordan times recurrence god for being associated with Figure 43
During the calculating of Jing networks, data random access memory 122 and the weight random access memory 124 of neutral net unit 121
One example of interior data configuration.The Jordan times recurrent neural network that Figure 43 is assumed in the example of Figure 44 has 512
Input node D, 512 concealed nodes Z, 512 content node C, with 512 output node Y.Additionally, also assuming that this Jordan
To link completely, i.e., all 512 input nodes D link each concealed nodes Z as input to time recurrent neural network, entirely
512, portion content node C links each concealed nodes Z as input, and all 512 concealed nodes Z to link each defeated
Egress Y is used as input.Although the example of the Jordan time recurrent neural networks of Figure 44 can be imposed to the numerical value of accumulator 202 and opened
Dynamic function exports the numerical value of node layer Y to produce, and but, this example assumes that can will impose the accumulator before run function 202 counts
Value is transferred to content node layer C, and non-real output node layer Y numerical value.Additionally, neutral net unit 121 is provided with 512
Neural processing unit 126, or neuron, for example, take wide configuration.Finally, this example assumes to be associated with by content node C to hidden
The weight for hiding the link of node Z is respectively provided with numerical value 1;Thus be not required to store the weighted value that these are.
Such as the example of Figure 41, as shown in FIG., the lower section 512 of weight random access memory 124 arranges (row 0 to 511)
The weighted value of the link being associated between input node D and concealed nodes Z can be loaded, and after weight random access memory 124
Continuous 512 row (row 512 to 1023) can load the weighted value of the link being associated between concealed nodes Z and output node Y.
Data random access memory 122 loads Jordan time recurrent neural networks nodal value for a series of similar to figure
Time step in 41 example is used;But, preset time is provided with the memory loads of one group of four row in the example of Figure 44
The nodal value of step.As shown in FIG., in the embodiment of the data random access memory 122 with 64 row, data are random
Access memory 122 can load the nodal value needed for 15 different time steps.In the example of Figure 44, row 0 to 3 are loaded and supplied
The nodal value that time step 0 is used, row 4 to 7 load the nodal value used for time step 1, and the rest may be inferred, and row 60 to 63 are loaded
For the nodal value that time step 15 is used.The first row of this four row storage stack loads now input node D of intermediate step
Numerical value.The secondary series of this four row storage stack loads the numerical value of the now concealed nodes Z of intermediate step.This four row storage stack
The 3rd equip the numerical value for carrying the now content node C of intermediate step.4th row of this four row storage stack are then to load now
The numerical value of the output node Y of intermediate step.As shown in FIG., to carry its corresponding for each luggage of data random access memory 122
The nodal value of neuron or neural processing unit 126.That is, row 0 is loaded is associated with node D0, the node of Z0, C0 and Y0
Value, its calculating is performed by neural processing unit 0;Row 1 is loaded and is associated with node D1, the nodal value of Z1, C1 and Y1, and its calculating is
Performed by neural processing unit 1;The rest may be inferred, and row 511 is loaded and is associated with node D511, the nodal value of Z511, C511 and Y511,
Its calculating is performed by neural processing unit 511.This part can be described in more detail in follow-up corresponding at Figure 44.
The numerical value of the content node C of preset time step was produced and as the next time in now intermediate step in Figure 44
The input of step.That is, the numerical value of the interior calculating of intermediate step at this moment of neural processing unit 126 and the node C for writing, understands into
The numerical value of the node C that the numerical value for being used for calculate node Z in next time step for this neural processing unit 126 is used
(together with the numerical value of input node D of this next time step).(instant intermediate step 0 calculates row 1 to the initial value of content node C
The numerical value of the node C that the numerical value of node Z is used) it is assumed to zero.This part is in the follow-up nand architecture program corresponding to Figure 45
Chapters and sections can be described in more detail.
As described in Figure 41 above, it is preferred that the numerical value of input node D (row 0,4 in the example of Figure 44, the rest may be inferred extremely
The numerical value of row 60) deposited by being implemented in the framework program of processor 100 and write/insert data random access through MTNN instructions 1400
Reservoir 122, and read/used by the nand architecture program for being implemented in neutral net unit 121, the nand architecture journey of such as Figure 45
Sequence.On the contrary, the numerical value of concealed nodes Z/ content node C/ output node Y (is respectively row 1/2/3,5/6/ in the example of Figure 44
7, the rest may be inferred to row 61/62/63 numerical value) write by the nand architecture program for being implemented in neutral net unit 121/insert data
Random access memory 122, and read/used through MFNN instructions 1500 by the framework program for being implemented in processor 100.
The example of Figure 44 assumes that this framework program can perform following steps:(1) different for 15 time step, by input node D
Numerical value insert data random access memory 122 (row 0,4, the rest may be inferred to row 60);(2) the nand architecture journey of Figure 45 is started
Sequence;(3) detect whether nand architecture program is finished;(4) number of output node Y is read from data random access memory 122
Value (row 3,7, the rest may be inferred to row 63);And (5) repeat step (1) to (4) is several times until completing task, such as to mobile phone
The language of user recognized needed for calculating.
In another kind of executive mode, framework program can perform following steps:(1) to single time step, to be input into
The numerical value of node D inserts data random access memory 122 (such as row 0);(2) nand architecture program (Figure 45 nand architecture programs are started
Amendment after version, be not required to circulation, and only access data to deposit single group of four row of memory 122 at random);(3) non-frame is detected
Whether structure program is finished;(4) numerical value (such as row 3) of output node Y is read from data random access memory 122;And
(5) repeat step (1) to (4) is several times until completing task.This two kinds of mode whichever for it is excellent can be according to time recurrent neural net
Depending on the sampling mode of the input value of network.For example, if this task is allowed in taking input in multiple time steps
Sample (such as about 15 time steps) simultaneously performs calculating, and first kind of way is just ideal, because this mode can bring more
Computing resource efficiency and/or preferably efficiency, but, if this task is only allowed in performing sampling in single time step,
It is accomplished by using the second way.
3rd embodiment but, different from the second way single group of four numbers is used similar to the aforementioned second way
Arrange according to random access memory 122, the nand architecture program of this mode uses multigroup four row memory, that is, in each time step
Rapid to organize four row memories using different, this part is similar to first kind of way.In this 3rd embodiment, it is preferred that framework journey
Sequence includes a step before step (2), and in this step, framework program can be updated before nand architecture program starts to it,
For example the row of data random access memory 122 in the instruction of address 1 are updated to point to into next group of four row memories.
Figure 45 be a form, display be stored in neutral net unit 121 program storage 129 program, this program by
Neutral net unit 121 is performed, and the configuration according to Figure 44 uses data and weight, to reach Jordan time recurrent neural nets
Network.Similar to the nand architecture program of Figure 42, the difference of the two can refer to the explanation of this paper related Sections to the nand architecture program of Figure 45.
The example program of Figure 45 includes 14 nand architecture instructions, respectively positioned at address 0 to 13.The instruction of address 0 is one
Initialization directive, to remove accumulator 202 and cycle counter 3804 be initialized to numerical value 15, to perform 15 circulation groups
(instruction of address 4 to 12).It is preferred that this initialization directive and neutral net unit 121 can be made to be configured in wide configuration
512 neural processing units 126.As described herein, in the execution process instruction of address 1 to 3 and address 8 to 12, this 512
The individual correspondence of neural processing unit 126 is simultaneously operated as 512 hiding node layer Z, and the instruction in address 4,5 and 7 is performed
During, this 512 neural correspondences of processing unit 126 are simultaneously operated as 512 output node layer Y.
Address 1 to 5 is identical with the instruction of address 1 to 6 in Figure 42 with the instruction of address 7 and with identical function.Address 1 to
3 instruction calculate hide the initial value of node layer Z and be written into the row 1 of data random access memory 122 for address 4,5 with
The first time of 7 instruction performs and uses, to calculate the output node layer Y of very first time step (time step 0).
During the first time of the output order of address 6 performs, this 512 instructions by address 4 and 5 add up and produce
The numerical value of accumulator 202 next (these numerical value can export node layer Y using calculating and write by the output order of address 7
Numerical value) can be passed and write the row 2 of data random access memory 122, these numerical value are the step (time very first time
Step 0) in the content node layer C numerical value that produces and in the second time step (used in time step 1);In the output of address 6
During second of instruction is performed, this 512 numerical value of accumulator 202 for producing that added up by the instruction of address 4 and 5 (connect down
Come, these numerical value can by the output order of address 7 using calculate and write export node layer Y numerical value) can be passed and write
Enter the row 6 of data random access memory 122, these numerical value are the second time step (content produced in time step 1)
Node layer C numerical value and used in the 3rd time step (time step 2);The rest may be inferred, the tenth of the output order of address 6 the
During five times perform, this 512 numerical value of accumulator 202 (following these numbers for producing that added up by the instruction of address 4 and 5
Value can be by the output order of address 7 using calculating and write the numerical value that exports node layer Y) can be passed and to write data random
The row 58 of access memory 122, these numerical value are the 15th time step (the content node layer C produced in time step 14)
Numerical value (and read by the instruction of address 8, but will not be used).
The instruction of address 8 to 12 is roughly the same with the instruction of address 7 to 11 in Figure 42 and with identical function, and the two only has
There is a discrepancy.I.e., the instruction (ADD_D_ACC DR ROW+1) of address 8 in Figure 45 deposits can data random access to this discrepancy
The columns of reservoir 122 increases by one, and the instruction (ADD_D_ACC DR ROW+0) of address 7 deposits can data random access in Figure 42
The columns of reservoir 122 increases by zero.The difference of this data configuration of difference inducement in data random access memory 122, especially
It is that the configuration of one group of four row includes that an independent row are used (such as row 2,6,10 etc.) for content node layer C numerical value in Figure 44, and Figure 41
In the configuration of one group of three row then not there is this independently to arrange, but allow content node layer C numerical value and hiding node layer Z numerical value it is common
Enjoy same row (such as row Isosorbide-5-Nitrae, 7 etc.).15 execution of the instruction of address 8 to 12 can calculate the numerical value of hidden layer node Z simultaneously
Data random access memory 122 (write row 5,9,13, so on up to row 57) is written into for the finger of address 4,5 and 7
Second to 16 time of order is performed using the output node layer Y (time step 1 to 14) for calculating the second to 15 time step.
(last/the 15th execution of the instruction of address 8 to 12 calculates the numerical value of hiding node layer Z and is written into data and deposit at random
The row 61 of access to memory 122, but these numerical value and be not used by.)
The recursion instruction of address 13 can make cycle counter 3804 successively decrease and big in the new numerical value of cycle counter 3804
The instruction of address 4 is returned in the case of zero.
In another embodiment, the design of Jordan time recurrent neural networks loads output node Y using content node C
Run function value, this run function value be run function perform after accumulated value.In this embodiment, because output node Y
Numerical value it is identical with the numerical value of content node C, the nand architecture of address 6 is instructed and is not included in nand architecture program.Thus can be with
Reduce the columns used in data random access memory 122.More precisely, each load contents node C number in Figure 44
(such as row 2,6,59) are not present in the present embodiment to the row of value.Additionally, each time step of this embodiment only needs data
Three row of random access memory 122, and the 20 time steps that can arrange in pairs or groups, rather than 15, the instruction of nand architecture program in Figure 45
Address can also carry out appropriate adjustment.
Shot and long term memory cell
It is the concept known by the art that shot and long term memory cell is used for time recurrent neural network.For example,
Long Short-Term Memory, Sepp Hochreiter and J ü rgen Schmidhuber, Neural
Computation, November 15,1997, Vol.9, No.8, Pages 1735-1780;Learning to Forget:
Continual Prediction with LSTM, Felix A.Gers, J ü rgen Schmidhuber, and Fred
Cummins, Neural Computation, October 2000, Vol.12, No.10, Pages 2451-2471;These documents
Can obtain from Massachusetts science and engineering publishing house periodical (MIT Press Journals).Shot and long term memory cell can be configured as various
Different types.The shot and long term memory cell 4600 of described below Figure 46 is with network address http://deeplearning.net/
Entitled shot and long term memory network (the LSTM Networks for for mood analysis of tutorial/lstm.html
Sentiment Analysis) study course described by shot and long term memory cell be model, the copy of this study course is in October, 2015
Download (hereinafter referred to as " shot and long term memory study course ") within 19th and be provided in the U. S. application case data of this case and disclose in old report book.This
Shot and long term memory cell 4600 can be used to describing the embodiment of neutral net unit 121 as herein described in general manner and can effectively perform
It is associated with the ability of the calculating of shot and long term memory.It should be noted that the embodiment of these neutral net units 121, including figure
Embodiment described in 49, can effectively perform other the shot and long term memories being associated with beyond the shot and long term memory cell described in Figure 46
The calculating of born of the same parents.
It is preferred that neutral net unit 121 may be used to for one there is shot and long term memory cell layer to link other levels
Time recurrent neural network performs calculating.For example, in this shot and long term memory study course, network includes average common source layer to connect
Receive the output (H) of the shot and long term memory cell of shot and long term memory layer, and logistic regression layer to receive the output of average common source layer.
Figure 46 is a block diagram, shows an embodiment of shot and long term memory cell 4600.
As shown in FIG., this shot and long term memory cell 4600 includes that memory cell is input into (X), and memory cell output (H) is input into lock
(I), lock (O) is exported, forgets lock (F), memory cell state (C) and candidate's memory cell state (C ').Input lock (I) can gate memory
The signal transmission of born of the same parents' input (X) to memory cell state (C), and exporting lock (O) can gate memory cell state (C) to memory cell output
(H) signal transmission.This memory cell state (C) can be fed back to candidate's memory cell state (C ') of a period of time intermediate step.Forget lock (F)
This candidate's memory cell state (C ') can be gated, the memory cell of next time step can be fed back and be become to this candidate's memory cell state
State (C).
The embodiment of Figure 46 calculates aforementioned various different numerical value using following equalities:
(1) I=SIGMOID (Wi*X+Ui*H+Bi)
(2) F=SIGMOID (Wf*X+Uf*H+Bf)
(3) C '=TANH (Wc*X+Uc*H+Bc)
(4) C=I*C '+F*C
(5) O=SIGMOID (Wo*X+Uo*H+Bo)
(6) H=O*TANH (C)
Wi and Ui is associated with being input into the weighted value of lock (I), and Bi is associated with being input into the deviant of lock (I).Wf and Uf
It is associated with forgeing the weighted value of lock (F), and Bf is associated with forgeing the deviant of lock (F).Wo and Uo is associated with exporting lock
(O) weighted value, and Bo is associated with exporting the deviant of lock (O).It has been observed that equation (1), (2) calculate respectively input with (5)
Lock (I), forgets lock (F) with output lock (O).Equation (3) calculates candidate's memory cell state (C '), and equation (4) is calculated with current
Memory cell state (C) is candidate's memory cell state (C ') of input, and current memory cell state (C) is the memory of current time step
Born of the same parents' state (C).Equation (6) calculates memory cell output (H).But the present invention is not limited to this.Input is calculated using his mode of planting
Lock, forgets lock, exports lock, candidate's memory cell state, the embodiment of the shot and long term memory cell that memory cell state is exported with memory cell
Also covered by the present invention.
In order to illustrate the present invention, shot and long term memory cell is input into including memory cell, memory cell output, memory cell state, candidate
Memory cell state, is input into lock, output lock and forgetting lock.For each time step, lock is input into, exports lock, forgotten lock and wait
Select memory cell state be current time step memory memory cell input with the memory cell of previous time steps export with it is related
The function of weight.Now the memory cell state of intermediate step is the memory cell state of previous time steps, and candidate's memory cell state is defeated
Enter the function of lock and output lock.In this sense, memory cell state can feed back the note for calculating next time step
Recall born of the same parents' state.The now memory cell output of intermediate step is the memory cell state that now intermediate step is calculated and the function for exporting lock.
Shot and long term Memory Neural Networks are a neutral nets with a shot and long term memory cell layer.
Figure 47 is a block diagram, is shown when neutral net unit 121 performs the shot and long term memory nerve net for being associated with Figure 46
During the calculating of 4600 layers of the shot and long term memory cell of network, the data random access memory 122 of neutral net unit 12l and weight with
One example of the data configuration in machine access memory 124.In the example of Figure 47, neutral net unit 121 is configured to 512
Neural processing unit 126 or neuron, for example, adopt wide configuration, and but, only 128 neural processing units 126 are (such as nerve process
Unit 0 to 127) produced by numerical value can be used, this is because this example shot and long term memory layer there was only 128 shot and long terms
Memory cell 4600.
As shown in FIG., weight random access memory 124 can load the corresponding nerve process of neutral net unit 121
The weighted value of unit 0 to 127, deviant is worth between two parties.The row 0 to 127 of weight random access memory 124 loads neutral net
The weighted value of the corresponding neural processing unit 0 to 127 of unit 121, deviant is worth between two parties.Each row in row 0 to 14 are then
Load 128 it is following corresponding to previous equations (1) to the numerical value of (6) to be supplied to neural processing unit 0 to 127, these numerical value
For:Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, C ', TANH (C), C, Wo, Uo, Bo.It is preferred that weighted value and deviant-Wi,
Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo (being located at row 0 to 8 and row 12 to 14)-by the frame for being implemented in processor 100
Structure program writes/inserts weight random access memory 124 through MTNN instructions 1400, and by being implemented in neutral net unit
121 nand architecture program reads/uses, such as the nand architecture program of Figure 48.It is preferred that value-C ' between two parties, TANH (C), C (are located at row
9 to 11)-write/insert weight random access memory 124 by the nand architecture program for being implemented in neutral net unit 121 and go forward side by side
Row reads/uses, and the details will be described later.
As shown in FIG., data random access memory 122 loads input (X), exports (H), and lock is forgotten in input lock (I)
(F) use for a series of time steps with output lock (O) numerical value.Furthermore, it is understood that this assembling of five row of memory one carries X, H, I, F
Use for a preset time step with the numerical value of O.So that one has the data random access memory 122 of 64 row as an example, such as scheme
Shown in, this data random access memory 122 can load the memory cell numerical value used for 12 different time steps.In Figure 47
Example in, row 0 to 4 load the memory cell numerical value that uses for time step 0, the note that the loading of row 5 to 9 is used for time step 1
Recall born of the same parents' numerical value, the rest may be inferred, row 55 to 59 load the memory cell numerical value used for time step 11.In this five row storage stack
First row load the X numerical value of now intermediate step.Secondary series in this five row storage stack loads the H numbers of now intermediate step
Value.In this five row storage stack the 3rd equips the I numerical value for carrying now intermediate step.The 4th row in this five row storage stack
Load the F numerical value of now intermediate step.In this five row storage stack the 5th equips the O numerical value for carrying now intermediate step.Such as in figure
Shown, each luggage in data random access memory 122 carries the number used for corresponding neuron or neural processing unit 126
Value.That is, row 0 loads the numerical value for being associated with shot and long term memory cell 0, and its calculating is by performed by neural processing unit 0;
Row 1 loads the numerical value for being associated with shot and long term memory cell 1, and its calculating is by performed by neural processing unit 1;The rest may be inferred, OK
127 load the numerical value for being associated with shot and long term memory cell 127, and its calculating is by performed by neural processing unit 127, in detail as subsequently
Described in Figure 48.
It is preferred that X numerical value (positioned at row 0,5,9, the rest may be inferred to row 55) is saturating by the framework program for being implemented in processor 100
Cross MTNN instructions 1400 and write/insert data random access memory 122, and by being implemented in the non-frame of neutral net unit 121
Structure program is read out/uses, nand architecture program as shown in figure 48.It is preferred that I numerical value, F numerical value is with O numerical value (positioned at row 2/
3/4,7/8/9,12/13/14, the rest may be inferred to row 57/58/59) write by the nand architecture program for being implemented in neural processing unit 121
Enter/insert data random access memory 122, the details will be described later.It is preferred that H numerical value (it is located at row 1,6,10, the rest may be inferred to arranging
56) write/insert data random access memory 122 by the nand architecture program for being implemented in neural processing unit 121 and read
Take/use, and be read out through MFNN instructions 1500 by the framework program for being implemented in processor 100.
The example of Figure 47 assumes that this framework program can perform following steps:(1) different for 12 time steps, will be defeated
The numerical value for entering X inserts data random access memory 122 (row 0,5, the rest may be inferred to row 55);(2) nand architecture of Figure 48 is started
Program;(3) detect whether nand architecture program is finished;(4) numerical value of output H is read from data random access memory 122
(row 1,6, the rest may be inferred to row 59);And (5) repeat step (1) to (4) for example makes several times until completing task to mobile phone
The language of user recognized needed for calculating.
In another kind of executive mode, framework program can perform following steps:(1) to single time step, to be input into X
Numerical value insert data random access memory 122 (such as row 0);(2) (the amendment of Figure 48 nand architecture programs of nand architecture program is started
Afterwards version, is not required to circulation, and only single group five of access data random access memory 122 is arranged);(3) nand architecture journey is detected
Whether sequence is finished;(4) numerical value (such as row 1) of output H is read from data random access memory 122;And (5) repeat to walk
Suddenly (1) to (4) is several times until completing task.This two kinds of mode whichever are the excellent input X numerical value that layer can be remembered according to shot and long term
Sampling mode depending on.For example, if this task is allowed in multiple time steps and is sampled (such as about 12 to input
Individual time step) and calculating is performed, first kind of way is just ideal, because this mode may bring more computing resource efficiency
And/or preferably efficiency, but, if this task is only allowed in single time step and performs sampling, it is necessary to use second
The mode of kind.
3rd embodiment but, different from the second way single group of five columns is used similar to the aforementioned second way
According to random access memory 122, the nand architecture program of this mode uses multigroup five row memory, that is, in each time step
Using five different row storage stacks, this part is similar to first kind of way.In this 3rd embodiment, it is preferred that framework
Program includes a step before step (2), and in this step, framework program can be updated before nand architecture program starts to it,
For example the row of data random access memory 122 in the instruction of address 0 are updated to point to into next group of five row memories.
Figure 48 be a form, display be stored in neutral net unit 121 program storage 129 program, this program by
Neutral net unit 121 is performed and the configuration according to Figure 47 uses data and weight, and to reach shot and long term memory cell layer is associated with
Calculating.The example program of Figure 48 includes that 24 nand architecture instructions are located at respectively address 0 to 23.The instruction of address 0
(INITIALIZE NPU, CLRACC, LOOPCNT=12, DR IN ROW=-1, DR OUT ROW=2) can remove accumulator
202 and cycle counter 3804 is initialized to numerical value 12, to perform 12 circulation groups (instruction of address 1 to 22).This is initial
Change instruction and can be numerical value -1 by the row initialization to be read of data random access memory 122, and the of the instruction in address 1
After once performing, it is zero that this numerical value can increase.This initialization directive can simultaneously fall in lines the to be written of data random access memory 122
(buffer 2606 of such as Figure 26 and Figure 39) is initialized as row 2.It is preferred that this initialization directive and neutral net unit can be made
121 in wide configuration, thus, neutral net unit 121 will be configured with 512 neural processing units 126.Such as following sections
Described, in the execution process instruction of address 0 to 23,126 therein 128 nerves of this 512 neural processing units process single
The correspondence of unit 126 is simultaneously operated as 128 shot and long term memory cells 4600.
In the first time of the instruction of address 1 to 4 performs, this 128 neural (the i.e. neural processing units 0 of processing unit 126
Each neural processing unit 126 into 127) can be for the step (time very first time of corresponding shot and long term memory cell 4600
Step 0) calculate input lock (I) numerical value and I numerical value is write the corresponding word of the row 2 of data random access memory 122;
Each meeting of neural processing unit 126 during second of the instruction of address 1 to 4 is performed, in this 128 neural processing units 126
For corresponding shot and long term memory cell 4600 the second time step (time step 1) calculate I numerical value and by I numerical value write data
The corresponding word of the row 7 of random access memory 122;The rest may be inferred, in the 12nd execution of the instruction of address 1 to 4,
Each neural processing unit 126 in this 128 neural processing units 126 can be directed to the of corresponding shot and long term memory cell 4600
(time step 11) calculates I numerical value and I numerical value is write the phase of the row 57 of data random access memory 122 12 time steps
Correspondence word, as shown in figure 47.
Furthermore, it is understood that the multiply-accumulate instruction of address 1 can read data random access memory 122 when prostatitis rear
Next column (is performed first and is row 0, perform second and be row 5, the rest may be inferred, perform the 12nd and be row 55), this
Row include memory cell input (X) value for being associated with current time step, and this instructs and can read weight random access memory 124
In the row 0 comprising Wi numerical value, and aforementioned reading numerical values are multiplied to produce into the first product accumulation to just by the initial of address 0
Change the accumulator 202 that instruction or the instruction of address 22 are removed.Subsequently, the multiply-accumulate instruction of address 2 can read next data
Random access memory 122 arranges and (performs first and be row 1, perform second and be row 6, the rest may be inferred, performs the 12nd
As row 56), this row includes memory cell output (H) value for being associated with current time step, and this is instructed and to read weight random
Row 1 comprising Ui numerical value in access memory 124, and aforementioned value is multiplied to produce into the second product accumulation to accumulator
202.The H numerical value for being associated with current time step is random by data by the instruction (and the instruction of address 6,10 with 18) of address 2
Access memory 122 reads, and previously time step is produced, and by the output order write data random access storage of address 22
Device 122;But, in first time performs, the instruction of address 2 can be with the row 1 of initial value write data random access memory
As H numerical value.It is preferred that framework program can at random deposit initial H numerical value write data before the nand architecture program for starting Figure 48
The row 1 (for example instructing 1400 using MTNN) of access to memory 122;But, the present invention is not limited to this, includes in nand architecture program
There are the other embodiments that initial H numerical value is write the row 1 of data random access memory 122 for initialization directive to fall within the present invention
Category.In one embodiment, this initial H numerical value is zero.Next, the instruction by weight word addition accumulator of address 3
(ADD_W_ACC WR ROW2) can read in weight random access memory 124 comprising Bi numerical value row 2 and be added into adding up
Device 202.Finally, the output order (OUTPUT SIGMOID, DR OUT ROW+0, CLR ACC) of address 4 can be to accumulator 202
Numerical value performs a S types run function and the current output that implementing result writes data random access memory 122 is arranged (first
Perform and be row 2, perform second and be row 7, the rest may be inferred, perform the 12nd and be row 57) and remove accumulator
202。
In the first time of the instruction of address 5 to 8 performs, each nerve in this 128 neural processing units 126 is processed
Unit 126 can be directed to the very first time step (time step 0) of corresponding shot and long term memory cell 4600 and calculate its forgetting lock (F) number
It is worth and writes F numerical value the corresponding word of the row 3 of data random access memory 122;The second of the instruction of address 5 to 8
In secondary execution, each the neural processing unit 126 in this 128 neural processing units 126 can be remembered for corresponding shot and long term
(time step 1) calculates it and forgets lock (F) numerical value and deposit F numerical value write data random access the second time step of born of the same parents 4600
The corresponding word of the row 8 of reservoir 122;The rest may be inferred, in the 12nd execution of the instruction of address 5 to 8, this 128 god
Each neural processing unit 126 in Jing processing units 126 can be directed to the 12nd time of corresponding shot and long term memory cell 4600
(time step 11) calculates it and forgets lock (F) numerical value and F numerical value is write the row 58 of data random access memory 122 step
Corresponding word, as shown in figure 47.The instruction of address 5 to 8 calculates the instruction of the mode similar to aforementioned addresses 1 to 4 of F numerical value,
But, the instruction of address 5 to 7 can respectively from the row 3 of weight random access memory 124, and row 4 read Wf, Uf and Bf number with row 5
Value is performing multiplication and/or add operation.
In 12 execution of the instruction of address 9 to 12, at each nerve in this 128 neural processing units 126
Reason unit 126 can be directed to the corresponding time step of corresponding shot and long term memory cell 4600 and calculate its candidate's memory cell state (C ')
Numerical value and C ' numerical value is write weight random access memory 124 row 9 corresponding word.The instruction of address 9 to 12 is calculated
, similar to the instruction of aforementioned addresses 1 to 4, but, the instruction of address 9 to 11 can be respectively from weight arbitrary access for the mode of C ' numerical value
The row 6 of memory 124, row 7 read Wc, Uc and Bc numerical value to perform multiplication and/or add operation with row 8.Additionally, address 12
Output order can perform tanh run function rather than (output order such as address 4 is performed) S type run functions.
Furthermore, it is understood that the multiply-accumulate instruction of address 9 can read data random access memory 122 when prostatitis (
Perform for the first time and be row 0, perform at second and be row 5, the rest may be inferred, perform at the 12nd time and be row 55), this is current
Row include memory cell input (X) value for being associated with current time step, and this instructs and can read weight random access memory 124
In comprising Wc numerical value row 6, and by aforementioned value be multiplied to produce the first product accumulation to just by address 8 instruction remove
Accumulator 202.Next, the multiply-accumulate instruction of address 10 can read data random access memory 122 time one row (
Perform for the first time and be row 1, perform at second and be row 6, the rest may be inferred, perform at the 12nd time and be row 56), this row bag
(H) value is exported containing the memory cell for being associated with current time step, this instructs and can read in weight random access memory 124 and wraps
Row 7 containing Uc numerical value, and aforementioned value is multiplied to produce into the second product accumulation to accumulator 202.Next, address 11
By weight word add accumulator instruction can read in weight random access memory 124 include Bc numerical value row 8 and by its
Add accumulator 202.Finally, the output order (OUTPUT TANH, WR OUT ROW 9, CLR ACC) of address 12 can be to cumulative
The numerical value of device 202 performs tanh run function and implementing result is write into the row 9 of weight random access memory 124, and
Remove accumulator 202.
In 12 execution of the instruction of address 13 to 16, at each nerve in this 128 neural processing units 126
Reason unit 126 can be directed to the corresponding time step of corresponding shot and long term memory cell 4600 and calculate new memory cell state (C) number
It is worth and writes this new C numerical value the corresponding word of the row 11 of weight random access memory 122, each neural processing unit
126 can also calculate tanh (C) and be written into the corresponding word of the row 10 of weight random access memory 124.Further come
Say, the multiply-accumulate instruction of address 13 can read next column of the data random access memory 122 when prostatitis rear (for the first time
Perform and be row 2, perform at second and be row 7, the rest may be inferred, perform at the 12nd time and be row 57), this row is comprising association
In input lock (I) numerical value of current time step, this instructs and reads in weight random access memory 124 remembers comprising candidate
The row 9 (just being write by the instruction of address 12) of born of the same parents' state (C ') numerical value, and aforementioned value is multiplied to produce into the first product
It is added to the accumulator 202 just removed by the instruction of address 12.Next, the multiply-accumulate instruction of address 14 can read data
Random access memory 122 next column (perform in first time and be row 3, perform second and be row 8, the rest may be inferred,
Perform for 12nd time and be row 58), comprising forgetting lock (F) numerical value for being associated with current time step, this instructs and reads this row
Current memory cell state (C) numerical value calculated in previous time steps is contained in weight random access memory 124 (by address
The last execution of 15 instruction is write) row 11, and by aforementioned value be multiplied to produce the second product add it is tired
Plus device 202.Next, the output order (OUTPUT PASSTHRU, WR OUT ROW11) of address 15 can transmit this accumulator
202 numerical value are simultaneously written into the row 11 of weight random access memory 124.It is to be appreciated that the instruction of address 14 is by data
The C numerical value that the row 11 of random access memory 122 read is the instruction of address 13 to 15 and produces simultaneously in the last execution
The C numerical value of write.The output order of address 15 can't remove accumulator 202, thus, its numerical value can be by the instruction of address 16
Use.Finally, the output order (OUTPUT TANH, WR OUT ROW 10, CLR ACC) of address 16 can be counted to accumulator 202
Value performs tanh run function and by the row 10 of its implementing result write weight random access memory 124 for address 21
Instruction exports (H) value using memory cell is calculated.The instruction of address 16 can remove accumulator 202.
In the first time of the instruction of address 17 to 20 performs, at each nerve in this 128 neural processing units 126
Reason unit 126 can be directed to the very first time step (time step 0) of corresponding shot and long term memory cell 4600 and calculate its output lock (O)
Numerical value and by O numerical value write data random access memory 122 row 4 corresponding word;In the instruction of address 17 to 20
During second performs, each the neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding shot and long term
(time step 1) calculates its output lock (O) numerical value and at random deposits O numerical value write data second time step of memory cell 4600
The corresponding word of the row 9 of access to memory 122;The rest may be inferred, in the 12nd execution of the instruction of address 17 to 20, this
Each neural processing unit 126 in 128 neural processing units 126 can be directed to the tenth of corresponding shot and long term memory cell 4600
(time step 11) calculates its output lock (O) numerical value and O numerical value is write into data random access memory 122 two time steps
The corresponding word of row 58, as shown in figure 47.The instruction of address 17 to 20 calculates the mode of O numerical value similar to aforementioned addresses 1 to 4
Instruction, but, address 17 to 19 instruction can respectively from the row 12 of weight random access memory 124, row 13 are read with row 14
Take Wo, Uo and Bo numerical value to perform multiplication and/or add operation.
In the first time of the instruction of address 21 to 22 performs, at each nerve in this 128 neural processing units 126
It is defeated that reason unit 126 can calculate its memory cell for the very first time step (time step 0) of corresponding shot and long term memory cell 4600
Go out (H) value and H numerical value is write the corresponding word of the row 6 of data random access memory 122;In the instruction of address 21 to 22
Perform for second, each the neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding length
Phase memory cell 4600 the second time step (time step 1) calculate its memory cell output (H) value and by H numerical value write data with
The corresponding word of the row 11 of machine access memory 122;The rest may be inferred, in the 12nd execution of the instruction of address 21 to 22
In, each the neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding shot and long term memory cell 4600
The 12nd time step (time step 11) calculates its memory cell output (H) value and deposits H numerical value write data random access
The corresponding word of the row 60 of reservoir 122, as shown in figure 47.
Furthermore, it is understood that the multiply-accumulate instruction of address 21 can read data random access memory 122 when prostatitis rear
The 3rd row (perform in first time and be row 4, perform second and be row 9, the rest may be inferred, is in the 12nd execution
Row 59), comprising output lock (O) numerical value for being associated with current time step, this instructs and reads weight random access memory to this row
The row 10 (being write by the instruction of address 16) of tanh (C) numerical value are included in device 124, and aforementioned value is multiplied to produce into one and taken advantage of
Accumulation adds to the accumulator 202 just removed by the instruction of address 20.Subsequently, the output order of address 22 can transmit accumulator
202 numerical value and be written into data random access memory 122 it is following second output row 11 (first time perform be
Row 6, perform at second and are row 11, and the rest may be inferred, perform at the 12nd time and are row 61), and remove accumulator 202.
It is to be appreciated that the H numerical value arranged by the instruction write data random access memory 122 of address 22 (is performed i.e. in first time
For row 6, perform at second and be row 11, the rest may be inferred, perform at the 12nd time and be row 61) it is address 2,6,10 and 18
Instruction subsequent execution in the H numerical value that consumes/read.But, the H numerical value that row 61 are write in performing for the 12nd time can't
Consumed/read by address 2,6,10 and the execution of 18 instruction;For a preferred embodiment, this numerical value can be by framework journey
Sequence is consumed/read.
The instruction (LOOP 1) of address 23 can make cycle counter 3804 successively decrease and count in new cycle counter 3804
Value is more than the instruction that address 1 is returned in the case of zero.
Figure 49 is a block diagram, shows the embodiment of neutral net unit 121, the neural processing unit group of this embodiment
It is interior to cover and feedback capability with output buffering.Figure 49 is shown at the single nerve being made up of four neural processing units 126
Reason cell group 4901.Although Figure 49 only shows single nerve processing unit group 4901, it is to be appreciated, however, that neural
Each neural processing unit 126 in NE 121 all can be contained in a neural processing unit group 4901, therefore,
N/J neural processing unit group 4901 is had altogether, and wherein N is that the quantity of neural processing unit 126 is (for example, just wide
For configuration be 512, for narrow configuration for 1024) and J is the quantity of the neural processing unit 126 in single group 4901
(for example, it is four) for the embodiment of Figure 49.By four nerves in neural processing unit group 4901 in Figure 49
Processing unit 126 is referred to as neural processing unit 0, neural processing unit 1, neural processing unit 2 and neural processing unit 3.
Each neural processing unit in the embodiment of Figure 49 and is schemed similar to the neural processing unit 126 of aforementioned Fig. 7
In have identical label component it is also similar.But, multitask buffer 208 is adjusted with comprising four extra inputs
4905, multitask buffer 705 is adjusted can be from original comprising four extra inputs 4907, to select input 213 adjusted
Carry out selecting to provide to output 209 in this input 211 and 207 and additional input 4905, also, select 713 Jing of input to adjust
It is whole and can carry out from the input 711 and 206 of script and additional input 4907 select provide to output 203.
As shown in FIG., the column buffer 1104 of Figure 11 is output buffer 1104 in Figure 49.Furthermore, it is understood that figure
Shown in the word 0,1,2 and 3 of output buffer 1104 receive and be associated with four of neural processing unit 0,1,2 and 3 and start
The corresponding output of function unit 212.The output buffer 1104 of this part is comprising N number of word corresponding to neural processing unit group
Group 4901, these words are referred to as an output buffering word group.In the embodiment of Figure 49, N is four.Output buffer 1104
This four words feed back to multitask buffer 208 and 705, and as four additional inputs 4905 by multitask buffer
208 receive and are received by multitask buffer 705 as four additional inputs 4907.Output buffering word group feedback
To the feedback action of its corresponding neural processing unit group 4901, the arithmetic instruction of nand architecture program is enable from being associated with god
One or two is selected in the word (i.e. output buffering word group) of the output buffer 1104 of Jing processing units group 4901
Word is input into as it, and its example refer to the nand architecture program of follow-up Figure 51, the finger of address 4,8,11,12 and 15 such as in figure
Order.That is, the word of output buffer 1104 being specified in nand architecture instruction can confirm to select input 213/713 to produce
Numerical value.This ability actually allows output buffer 1104 as a classification scratch memory (scratch pad
Memory), nand architecture program can be allowed to reduce write data random access memory 122 and/or weight random access memory
124 and the number of times that subsequently therefrom reads, such as the numerical value for producing between two parties and using during reducing.It is preferred that output buffering
Device 1104, or claim column buffer 1104, including an one-dimensional cache array, to store 1024 narrow words or 512
Individual wide word.It is preferred that for the reading of output buffer 1104 can be performed within single time-frequency cycle, and for output
The write of buffer 1104 can also be performed within single time-frequency cycle.Different from data random access memory 122 and power
Weight random access memory 124, can enter line access by framework program and nand architecture program, and output buffer 1104 cannot be by framework
Program enters line access, and can only enter line access by nand architecture program.
Output buffer 1104 is adjusted to be input into (mask input) 4903 to receive shielding.It is preferred that shielding input
4903 four words for including four position correspondences to output buffer 1104, this four character associatives are in neural processing unit group
The neural processing unit 126 of four of 4901.If it is preferred that the shielding input of the word of this correspondence to output buffer 1104
4903 is true, and the word of this output buffer 1104 will maintain its currency;Otherwise, the word of this output buffer 1104
The output that function unit 212 will be activated is updated.If that is, this is corresponded to the word of output buffer 1104
Shielding input 4903 is false, and the output of run function unit 212 will be written into the word of output buffer 1104.Thus,
The output order of nand architecture program i.e. optionally by run function unit 212 output write output buffer 1104 certain
The current value of a little words and other words for making output buffer 1104 remains unchanged, and its example refer to the non-of follow-up Figure 51
The instruction of address 6,10,13 and 14 in the instruction of framework program, such as figure.That is, the output being specified in nand architecture program
The word of buffer 1104 certainly results from the numerical value of shielding input 4903.
For the purpose of simplifying the description, do not show the input 1811 of multitask buffer 208/705 (such as Figure 18, Figure 19 in Figure 49
With shown in Figure 23).But, at the same support can dynamic configuration nerve processing unit 126 and output buffer 1104 feedback/shielding
Embodiment also belong to the scope of the present invention.It is preferred that in these embodiments, output buffering word group is can corresponding earthquake
State is configured.
Although it is to be appreciated that neural processing unit 126 in the neural processing unit group 4901 of this embodiment
Quantity is four, and but, the present invention is not limited to this, and the more or less embodiment of the neural quantity of processing unit 126 is equal in group
Belong to scope of the invention.Additionally, for one has the embodiment of shared run function unit 1112, as shown in figure 52,
In the quantity of neural processing unit 126 and a group of run function unit 212 in one neural processing unit group 4901
The quantity of neural processing unit 126 has collaboration to be affected.The masking of output buffer 1104 and feedback in neural processing unit group
Ability is particularly helpful to lift the computational efficiency for being associated with shot and long term memory cell 4600, in detail as described in follow-up Figure 50 and Figure 51.
Figure 50 is a block diagram, shows and is remembered by 128 shot and long terms in the execution of neutral net unit 121 is associated with Figure 46
During the calculating of the level that born of the same parents 4600 are constituted, the data random access memory 122 of the neutral net unit 121 of Figure 49, weight
One example of the data configuration in random access memory 124 and output buffer 1104.In the example of Figure 50, neutral net
Unit 121 is configured to 512 neural processing units 126 or neuron, for example, take wide configuration.Such as the model of Figure 47 and Figure 48
Example, only has 128 shot and long term memory cells 4600 in the shot and long term memory layer in the example of Figure 50 and Figure 51.But, in figure
In 50 example, all the numerical value of 512 neural processing units 126 (such as neural processing unit 0 to 127) generation all can be made
With.When the nand architecture program of Figure 51 is performed, each the neural meeting of processing unit group 4901 collective is as a shot and long term
Memory cell 4600 is operated.
As shown in FIG., data deposit at random memory 122 load memory cell input (X) with output (H) value for it is a series of when
Intermediate step is used.Furthermore, it is understood that for a preset time step, having a pair liang of row memories and loading X numerical value and H numbers respectively
Value.So that one has the data random access memory 122 of 64 row as an example, as shown in FIG., this data random access memory
The 122 memory cell numerical value for being loaded are available for 31 different time steps to use.In the example of Figure 50, row 2 and 3 are loaded and supply the time
The numerical value that step 0 is used, row 4 and 5 load the numerical value used for time step 1, and the rest may be inferred, and row 62 and 63 are loaded and supply time step
Rapid 30 numerical value for using.This loads the X numerical value of now intermediate step to the first row in two row memories, and secondary series is then to load
The now H numerical value of intermediate step.As shown in FIG., the row of each group four correspondence to nerve processes single in data random access memory 122
The numerical value that the memory loads of first group 4901 are used for its correspondence shot and long term memory cell 4600.That is, row 0 to 3 is loaded
The numerical value of shot and long term memory cell 0 is associated with, its calculating is performed by neural processing unit 0-3, i.e., neural processing unit group 0 holds
OK;Row 4 to 7 loads the numerical value for being associated with shot and long term memory cell 1, and its calculating is performed by neural processing unit 4-7, i.e., at nerve
Reason cell group 1 is performed;The rest may be inferred, and row 508 to 511 is loaded and is associated with the numerical value of shot and long term memory cell 127, its calculating be by
Neural processing unit 508-511 is performed, i.e., neural processing unit group 127 performs, in detail as shown in follow-up Figure 51.Such as institute in figure
Show, row 1 are simultaneously not used by, row 0 load initial memory cell output (H) value, for a preferred embodiment, can be by framework program
Null value is inserted, but, the present invention is not limited to this, using nand architecture programmed instruction initial memory cell output (H) number of row 0 is inserted
Value falls within scope of the invention.
It is preferred that X numerical value (the rest may be inferred to row 62 to be located at row 2,4,6) is saturating by the framework program for being implemented in processor 100
Cross MTNN instructions 1400 and write/insert data random access memory 122, and by being implemented in the non-frame of neutral net unit 121
Structure program is read out/uses, such as the nand architecture program shown in Figure 50.(the rest may be inferred to be located at row 3,5,7 it is preferred that H numerical value
To row 63) write/insert data random access memory 122 by the nand architecture program for being implemented in neutral net unit 121 and go forward side by side
Row reads/uses, and the details will be described later.It is preferred that H numerical value and being instructed through MFNN by the framework program of processor 100 is implemented in
1500 are read out.It should be noted that the nand architecture program of Figure 51 assumes that correspondence is each to neural processing unit group 4901
In four line storages of group (such as row 0-3, row 4-7, row 5-8, the rest may be inferred to row 508-511), in four X numerical value of a given row
Insert identical numerical value (for example being inserted by framework program).Similarly, the nand architecture program of Figure 51 can be in correspondence to nerve process
In the line storage of each group four of cell group 4901, calculate and four H numerical value to a given row write identical numerical value.
As shown in FIG., weight random access memory 124 is loaded needed for the neural processing unit of neutral net unit 121
Weight, skew with memory cell state (C) value.In correspondence to (example in the line storage of each group four of neural processing unit group 121
The rest may be inferred for such as row 0-3, row 4-7, row 5-8 to row 508-511):(1) line number divided by 4 the remainder row that is equal to 3, can at it
Row 0,1,2 and 6 load respectively the numerical value of Wc, Uc, Bc, with C;(2) line number divided by 4 remainder be equal to 2 row, can in its row 3,
4 and 5 numerical value for loading Wo, Uo and Bo respectively;(3) line number divided by 4 the remainder row that is equal to 1, can be in its row 3,4 and 5 difference
Load the numerical value of Wf, Uf and Bf;And (4) line number is divided by the 4 remainder row that is equal to 0, can respectively load with 5 in its row 3,4
The numerical value of Wi, Ui and Bi.It is preferred that these weights and deviant-Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo
(in row 0 to 5)-deposited by being implemented in the framework program of processor 100 and write/insert weight arbitrary access through MTNN instructions 1400
Reservoir 124, and be read out/used by the nand architecture program for being implemented in neutral net unit 121, such as the nand architecture journey of Figure 51
Sequence.It is preferred that C values between two parties write/insert weight arbitrary access and deposit by the nand architecture program for being implemented in neutral net unit 121
Reservoir 124 is simultaneously read out/uses, and the details will be described later.
The example of Figure 50 assumes that framework program can perform following steps:(1) different for 31 time steps, will be input into
The numerical value of X inserts data random access memory 122 (row 2,4, the rest may be inferred to row 62);(2) the nand architecture journey of Figure 51 is started
Sequence;(3) detect whether nand architecture program is finished;(4) numerical value (row of output H are read from data random access memory 122
3,5, the rest may be inferred to row 63);And (5) repeat step (1) to (4) is for example used mobile phone several times until completing task
The language of person recognized needed for calculating.
In another kind of executive mode, framework program can perform following steps:(1) to single time step, to be input into X
Numerical value insert data random access memory 122 (such as row 2);(2) (the amendment of Figure 51 nand architecture programs of nand architecture program is started
Afterwards version, is not required to circulation, and only accesses the single to two row of data random access memory 122);(3) nand architecture journey is detected
Whether sequence is finished;(4) numerical value (such as row 3) of output H is read from data random access memory 122;And (5) repeat to walk
Suddenly (1) to (4) is several times until completing task.This two kinds of mode whichever are the excellent input X numerical value that layer can be remembered according to shot and long term
Sampling mode depending on.For example, if this task is allowed in multiple time steps and is sampled (such as about 31 to input
Individual time step) and calculating is performed, first kind of way is just ideal, because this mode may bring more computing resource efficiency
And/or preferably efficiency, but, if this task is only allowed in single time step and performs sampling, it is necessary to use second
The mode of kind.
3rd embodiment but, is used single to two columns similar to the aforementioned second way different from the second way
According to random access memory 122, the nand architecture program of this mode uses multipair memory column, that is, makes in each time step
With difference to memory column, this part is similar to first kind of way.It is preferred that the framework program of this 3rd embodiment is in step
(2) step is included before, in this step, framework program can be updated before nand architecture program starts to it, such as by ground
The row of data random access memory 122 in the instruction of location 1 are updated to point to lower a pair liang of row memories.
As shown in FIG., for the neural processing unit 0 to 511 of neutral net unit 121, in the nand architecture program of Figure 51
After the instruction of middle different address is performed, output buffer 1104 loads memory cell output (H), and candidate's memory cell state (C ') is defeated
Enter lock (I), forget the value between two parties of lock (F), output lock (O), memory cell state (C) and tanh (C), each output buffering word
(such as group of the correspondence of output buffer 1104 to four words of neural processing unit group 4901, such as word 0- in group
The rest may be inferred for 3,4-7,5-8 to 508-511), word number divided by 4 remainder be 3 textual representation be OUTBUF [3], word
Number divided by 4 remainder be 2 textual representation be OUTBUF [2], it is that 1 textual representation is that word is numbered divided by 4 remainder
OUTBUF [1], and word number divided by 4 remainder be 0 textual representation be OUTBUF [0].
As shown in FIG., after the instruction of address 2 is performed in the nand architecture program of Figure 51, for each neural processing unit
For group 4901, four words of whole of output buffer 1104 can all write the initial of corresponding shot and long term memory cell 4600
Memory cell exports (H) value.After the instruction of address 6 is performed, for each neural processing unit group 4901, output buffering
OUTBUF [3] word of device 1104 can write candidate's memory cell state (C ') value of corresponding shot and long term memory cell 4600, and defeated
Going out other three words of buffer 1104 can then maintain its preceding numerical values.After the instruction of address 10 is performed, for each god
For Jing processing units group 4901, OUTBUF [0] word of output buffer 1104 can write corresponding shot and long term memory cell
4600 input lock (I) numerical value, OUTBUF [1] word can write forgetting lock (F) numerical value of corresponding shot and long term memory cell 4600,
OUTBUF [2] word can write output lock (O) numerical value of corresponding shot and long term memory cell 4600, and OUTBUF [3] word is then
Maintain its preceding numerical values.After the instruction of address 13 is performed, for each neural processing unit group 4901, output buffering
OUTBUF [3] word of device 1104 can write new memory cell state (C) value of corresponding shot and long term memory cell 4600 (for defeated
For going out buffer 1104, the C numerical value comprising groove (slot) 3 writes the row 6 of weight random access memory 124, in detail as subsequently
Described in Figure 51), and three words of other of output buffer 1104 are then to maintain its preceding numerical values.Instruction in address 14 is performed
Afterwards, for each neural processing unit group 4901, OUTBUF [3] word of output buffer 1104 can write corresponding
Tanh (C) numerical value of shot and long term memory cell 4600, and three words of other of output buffer 1104 are then to maintain its previous number
Value.After the instruction of address 16 is performed, for each neural processing unit group 4901, the whole of output buffer 1104
Four words can all write new memory cell output (H) value of corresponding shot and long term memory cell 4600.Aforementioned addresses 6 to 16 are held
Row flow process (the namely execution of excluded address 2, this is because address 2 is not belonging to a part for program circulation) can repeat 30
It is secondary, return to the program circulation of address 3 as address 17.
Figure 51 be a form, display be stored in neutral net unit 121 program storage 129 program, this program by
The neutral net unit 121 of Figure 49 is performed and the configuration according to Figure 50 uses data and weight, and to reach shot and long term note is associated with
Recall the calculating of born of the same parents' layer.The example program bag of Figure 51 is located at respectively address 0 to 17 containing 18 nand architecture instructions.The instruction of address 0 is
One initialization directive, to remove accumulator 202 and cycle counter 3804 be initialized to numerical value 31, to perform 31 times
Circulation group (instruction of address 1 to 17).This initialization directive simultaneously can be by (the example of falling in lines to be written of data random access memory 122
Such as the buffer 2606 of Figure 26/Figure 39) numerical value 1 is initialized as, and after the first time of the instruction of address 16 performs, this numerical value meeting
Increase to 3.It is preferred that this initialization directive and neutral net unit 121 can be made in wide configuration, thus, neutral net unit
121 will be configured with 512 neural processing units 126.As described in following sections, in the execution process instruction of address 0 to 17,
128 neural processing unit groups 4901 of this 512 neural compositions of processing unit 126 are used as 128 corresponding shot and long terms
Memory cell 4600 is operated.
The instruction of address 1 and 2 is not belonging to the circulation group of program and only can perform once.These instructions can produce initial memory
Born of the same parents export (H) value (such as 0) and are written into all words of output buffer 1104.The instruction of address 1 can be random from data
The row 0 of access memory 122 read initial H numerical value and are placed on the accumulator 202 removed by the instruction of address 0.Address 2
Instruction (OUTPUT PASSTHRU, NOP, CLR ACC) numerical value of accumulator 202 can be transferred to output buffer 1104, such as scheme
Shown in 50." NOP " sign in the output order (and other output orders of Figure 51) of address 2 represents that output valve only can be write
Enter output buffer 1104, without being written into memory, that is, data random access memory 122 or power will not be written into
Weight random access memory 124.The instruction of address 2 simultaneously can remove accumulator 202.
The instruction of address 3 to 17 is located in circulation group, and it performs numerical value of the number of times for cycle count (such as 31).
Performing each time for the instruction of address 3 to 6 can calculate tanh (the C ') numerical value of current time step and be written into
Word OUTBUF [3], this word will be used by the instruction of address 11.More precisely, the multiply-accumulate instruction of address 3 can be from
Current reading row (the rest may be inferred to row 62 such as row 2,4,6) of data random access memory 122 is read and is associated with now spacer step
Rapid memory cell input (X) value, from the row 0 of weight random access memory 124 Wc numerical value is read, and aforementioned value is multiplied to
Produce a product and add the accumulator 202 removed by the instruction of address 2.
The multiply-accumulate instruction (MULT-ACCUM OUTBUF [0], WR ROW 1) of address 4 can read from word OUTBUF [0]
H numerical value (all four neural processing units 126 of i.e. neural processing unit group 4901) is taken, from weight random access memory
124 row 1 read Uc numerical value, and aforementioned value is multiplied to produce into one second product addition accumulator 202.
Accumulator instruction (the ADD_W_ACC WR ROW 2) meeting that weight word is added of address 5 be deposited from weight arbitrary access
The row 2 of reservoir 124 read Bc numerical value and are added into accumulator 202.
Output order (OUTPUT TANH, NOP, the MASK [0 of address 6:2], CLR ACC) numerical value of accumulator 202 can be held
Row tanh run function, and only by implementing result write word OUTBUF [3] (that is, only neural processing unit group
The remainder that numbering removes 4 in group 4901 is that 3 neural processing unit 126 can write this result), also, accumulator 202 can be clear
Remove.That is, the output order of address 6 can cover word OUTBUF [0], OUTBUF [1] and OUTBUF [2] (such as instruction art
Language MASK [0:2] it is represented) and its current value is maintained, as shown in figure 50.Additionally, the output order of address 6 can't write
Memory (as represented by instructions nomenclature NOP).
The execution each time of the instruction of address 7 to 10 can calculate input lock (I) numerical value of current time step, forget lock
(F) numerical value lock (O) numerical value and is respectively written into word OUTBUF [0] with output, OUTBUF [1], and OUTBUF [2], these
Numerical value will be used by instruction of the address 11,12 with 15.More precisely, the multiply-accumulate instruction of address 7 can be random from data
The memory for being associated with now intermediate step is read in current reading row (the rest may be inferred to row 62 such as row 2,4,6) of access memory 122
Born of the same parents are input into (X) value, and from the row 3 of weight random access memory 124 Wi, Wf and Wo numerical value is read, and aforementioned value is multiplied to
Produce a product and add the accumulator 202 removed by the instruction of address 6.More precisely, in neural processing unit group 4901
In, numbering removes the neural processing unit 126 that 4 remainder is 0 and can calculate the product of X and Wi, and numbering removes the nerve that 4 remainder is 1
Processing unit 126 can calculate the product of X and Wf, and number except 4 remainder is that 2 neural processing unit 126 can calculate X and Wo
Product.
The multiply-accumulate instruction of address 8 can read H numerical value (i.e. neural processing unit group 4901 from word OUTBUF [0]
All four neural processing units 126), read Ui, Uf and Uo numerical value, and general from the row 4 of weight random access memory 124
Aforementioned value is multiplied to produce one second product and adds accumulator 202.More precisely, in neural processing unit group 4901
In, numbering removes the neural processing unit 126 that 4 remainder is 0 and can calculate the product of H and Ui, and numbering removes the nerve that 4 remainder is 1
Processing unit 126 can calculate the product of H and Uf, and number except 4 remainder is that 2 neural processing unit 126 can calculate H and Uo
Product.
Accumulator instruction (the ADD_W_ACC WR ROW 2) meeting that weight word is added of address 9 be deposited from weight arbitrary access
The row 5 of reservoir 124 read Bi, Bf and Bo numerical value and are added into accumulator 202.More precisely, in neural processing unit group
In group 4901, numbering is that 0 neural processing unit 126 can perform the additional calculation of Bi numerical value except 4 remainder, numbering except 4 it is remaining
Number can perform the additional calculation of Bf numerical value for 1 neural processing units 126, and number except 4 remainder be 2 neural processing unit
126 additional calculations that can perform Bo numerical value.
The output order (OUTPUT SIGMOID, NOP, MASK [3], CLR ACC) of address 10 can be to the numerical value of accumulator 202
Perform S types run function and the I for calculating, F and O numerical value be respectively written into into word OUTBUF [0], OUTBUF [1] and
OUTBUF [2], this instructs and can remove accumulator 202, and is not written into memory.That is, the output order meeting of address 10
Cover word OUTBUF [3] (as represented by instructions nomenclature MASK [3]) and maintain the current value (namely C ') of this word, such as
Shown in Figure 50.
The execution each time of the instruction of address 11 to 13 can calculate the new memory cell state (C) of current time step generation
Value is simultaneously written into the row 6 of weight random access memory 124 and uses (namely for the finger of address 12 for next time step
Order is used when next time circulation is performed), more precisely, this numerical value write row 6 is corresponding to neural processing unit group 4901
Four style of writing words in label except 4 remainder be 3 word.Additionally, the execution each time of the instruction of address 14 all can be by tanh (C)
Numerical value write OUTBUF [3] is used for the instruction of address 15.
More precisely, the multiply-accumulate instruction (MULT-ACCUM OUTBUF [0], OUTBUF [3]) of address 11 can be from text
Word OUTBUF [0] reads input lock (I) numerical value, and from word OUTBUF [3] candidate's memory cell state (C ') value is read, and will be aforementioned
Numerical value is multiplied to produce one first product and adds the accumulator 202 removed by the instruction of address 10.More precisely, nerve is processed
Each neural processing unit 126 in the neural processing unit 126 of four of cell group 4901 can all calculate I numerical value and C ' numerical value
The first product.
The multiply-accumulate instruction (MULT-ACCUM OUTBUF [1], WR ROW 6) of address 12 can indicate neural processing unit
126 read forgetting lock (F) numerical value from word OUTBUF [1], read its from the row 6 of weight random access memory 124 corresponding
Word, and it is multiplied to produce the first product addition that the second product is resulted from accumulator 202 with the instruction of address 11.More
Speak by the book, for the neural internal label of processing unit group 4901 except 4 remainder be 3 neural processing unit 126 for, from row
6 words for reading are the totallings of current memory cell state (C) value that previous time steps are calculated, the first product and the second product
As this new memory cell state (C).But, for other three of neural processing unit group 4901 neural processing units
For 126, the word read from row 6 is the numerical value that is not required to comprehend, this is because the accumulated value produced by these numerical value will not be by
Use, namely instruction that will not be by address 13 with 14 is put into output buffer 1104 and can be removed by the instruction of address 14.
That is, the remainder that label removes 4 in only neural processing unit group 4901 is new produced by 3 neural processing unit 126
Memory cell state (C) value will be used, i.e. the instruction by address 13 with 14 is used.With regard to the second to three of the instruction of address 12
For ten once perform, the C numerical value read from the row 6 of weight random access memory 124 be circulation group previous execution in by
The numerical value of the instruction write of address 13.But, for the first time of the instruction of address 12 performs, the C numerical value of row 6 be then by
Framework program is before the nand architecture program for starting Figure 51 or by the initial value of version write after an adjustment of nand architecture program.
Output order (OUTPUT PASSTHRU, WR ROW 6, the MASK [0 of address 13:2]) only accumulator 202 can be transmitted
Numerical value, that is, the C numerical value for calculating, to word OUTBUF [3] (that is, label in only neural processing unit group 4901
Except 4 remainder be 3 neural processing unit 126 can be calculated C numerical value write output buffer 1104), and weight with
Machine access memory 124 row 6 be then with renewal after output buffer 1104 write, as shown in figure 50.That is, address
13 output order can cover word OUTBUF [0], OUTBUF [1] and OUTBUF [2] and maintain its current value (i.e. I, F and O
Numerical value).It has been observed that only row 6 are 3 corresponding to the remainder that label in four style of writing words of neural processing unit group 4901 removes 4
C numerical value in word can be used, that is, be used by the instruction of address 12;Therefore, nand architecture program will not comprehend weight with
Be located at row 0-2, row 4-6 in the row 6 of machine access memory 124, the rest may be inferred to row 508-510 numerical value, as shown in figure 50 (i.e.
I, F and O numerical value).
Output order (OUTPUT TANH, NOP, the MASK [0 of address 14:2], CLR ACC) can be to the numerical value of accumulator 202
Tanh run function is performed, and by the tanh for calculating (C) numerical value write word OUTBUF [3], this instructs and understands clear
Except accumulator 202, and it is not written into memory.The output order of address 14, such as the output order of address 13, can cover word
OUTBUF [0], OUTBUF [1] and OUTBUF [2] and maintain its script numerical value, as shown in figure 50.
The execution each time of the instruction of address 15 to 16 can calculate memory cell output (H) value of current time step generation simultaneously
The current output row rear secondary series of data random access memory 122 is written into, its numerical value will be read by framework program
And for time step next time (namely being used by the instruction of address 3 and 7 in circulation next time is performed).More precisely,
The multiply-accumulate instruction of address 15 can read output lock (O) numerical value from word OUTBUF [2], read from word OUTBUF [3]
Tanh (C) numerical value, and it is multiplied to produce the accumulator 202 that product addition is removed by the instruction of address 14.More accurately
Say, each the neural processing unit 126 in four neural processing units 126 of neural processing unit group 4901 can all calculate number
The product of value O and tanh (C).
The output order of address 16 can transmit the numerical value of accumulator 202 and write the H numerical value for calculating in first time performs
Fall in lines 3, the H numerical value for calculating write into row 5 in performing at second, the rest may be inferred once perform the 30th in will calculate
H numerical value write row 63, as shown in figure 50, the instruction that next these numerical value can be by address 4 with 8 is used.Additionally, such as Figure 50 institutes
Show, the H numerical value that these are calculated can be placed into instruction of the output buffer 1104 for address 4 with 8 and subsequently use.Address 16
Output order simultaneously can remove accumulator 202.In one embodiment, the design of shot and long term memory cell 4600 refers to the output of address 16
Make the output order of address 22 (and/or in Figure 48) that there is a run function, such as S types or hyperbolic tangent function, rather than transmission
The numerical value of accumulator 202.
The recursion instruction of address 17 can make cycle counter 3804 successively decrease and big in the new numerical value of cycle counter 3804
The instruction of address 3 is returned in the case of zero.
Thus can find because the feedback of the output buffer 1104 in the embodiment of neutral net unit 121 of Figure 49 with
Screening ability, nand architecture instruction of the instruction number in the circulation group of the nand architecture program of Figure 51 compared to Figure 48 is substantially reduced
34%.Additionally, because the feedback and screening ability of the output buffer 1104 in the embodiment of neutral net unit 121 of Figure 49,
Memory in the data random access memory 122 of Figure 51 nand architecture programs configures arranged in pairs or groups time number of steps substantially schemes
Three times of 48.Aforementioned improvement contributes to some framework journeys calculated using the executive chairman short-term memory born of the same parents layer of neutral net unit 121
Sequence application, especially for application of the quantity of shot and long term memory cell 4600 in shot and long term memory cell layer less equal than 128.
The embodiment of Figure 47 to Figure 51 assumes that the weight in each time step remains unchanged with deviant.But, this
Bright to be not limited to this, other weights are also belonged to the scope of the present invention with the deviant embodiment that at any time intermediate step changes, wherein, weight
Random access memory 124 not inserts single group of weight and deviant as shown in Figure 47 to Figure 50, but in each time step
Suddenly the address meeting of weight random access memory 124 of different group weights and deviant and the nand architecture program of Figure 48 to Figure 51 is inserted
Adjust therewith.
Substantially, in the embodiment of aforementioned Figure 47 to Figure 51, weight, skew is stored in value between two parties (such as C, C ' numerical value)
Weight random access memory 124, and be input into and be then stored in data random access memory with output valve (such as X, H numerical value)
122.This feature is conducive to data random access memory 122, and for dual-port, weight random access memory 124 is single port
Embodiment, this is because having more flows to data random access memory 122 from nand architecture program and framework program.
But, because weight random access memory 124 is larger, in another embodiment of the invention then be exchange storage nand architecture with
The memory (i.e. interchange of data random access memory 122 and weight random access memory 124) of framework program write numerical value.
That is, W, U, B, C ', tanh (C) and C numerical value be stored in data random access memory 122 and X, H, I, F with O numerical value then
It is stored in weight random access memory 124 (embodiment after the adjustment of Figure 47);And W, U, B, with C numerical value is stored in data
Random access memory 122 and X are then stored in weight random access memory 124 with H numerical value (to be implemented after the adjustment of Figure 50
Example).Because weight random access memory 124 is larger, these embodiments can process more time step in a batch.It is right
For the application that the framework program for calculating is performed using neutral net unit 121, this feature is conducive to some can be from more
Application that time step is got profit and foot can be provided for the memory (such as weight random access memory 124) that single port be designed
Enough frequency ranges.
Figure 52 is a block diagram, shows the embodiment of neutral net unit 121, the neural processing unit group of this embodiment
It is interior to cover and feedback capability with output buffering, and shared run function unit 1112.The class of neutral net unit 121 of Figure 52
The neutral net unit 121 of Figure 47 is similar to, and the component in figure with identical label is also similar.But, four of Figure 49
Run function unit 212 is replaced by single shared run function unit 1112, and this single is opened
Dynamic function unit can receive four from the output 217 of four accumulators 202 and produce four and export to word OUTBUF [0],
OUTBUF [1], OUTBUF [2] and OUTBUF [3].The function mode of the neutral net unit 212 of Figure 52 is similar to Figure 49 above
To the embodiment described in Figure 51, and the mode of the shared run function unit 1112 of its running is similar to the institutes of Figure 11 to Figure 13 above
The embodiment stated.
Figure 53 is a block diagram, shows that one has 128 length in the execution of neutral net unit 121 is associated with Figure 46
During the calculating of the level of phase memory cell 4600, the data random access memory 122 of the neutral net unit 121 of Figure 49, weight
Another embodiment of the data configuration in random access memory 124 and output buffer 1104.The example of Figure 53 is similar to figure
50 example.But, in Figure 53, Wi, Wf and Wo value is located at row 0 (rather than as Figure 50 is located at row 3);Ui, Uf and Uo value is located at
Row 1 (rather than as Figure 50 is located at row 4);Bi, Bf and Bo value is located at row 2 (rather than as Figure 50 is located at row 5);C values be located at row 3 (rather than
As Figure 50 is located at row 6).In addition, the content of the output buffer 1104 of Figure 53 is similar to Figure 50, but, because Figure 54 and Figure 51
Nand architecture program difference, tertial content (i.e. I, F, O and C ' numerical value) be address 7 instruction perform after occur in it is defeated
Go out buffer 1104 (rather than if Figure 50 is the instruction of address 10);The content (i.e. I, F, O and C numerical value) of the 4th row is in address 10
Instruction perform after occur in output buffer 1104 (rather than as Figure 50 be address 13 instruction);5th content for arranging (i.e. I,
F, O and tanh (C) numerical value) be address 11 instruction perform after occur in output buffer 1104 (rather than as Figure 50 be address
14 instruction);And the content (i.e. H numerical value) of the 6th row is to occur in output buffer 1104 after the instruction of address 13 is performed
(rather than if Figure 50 is the instruction of address 16), the details will be described later.
Figure 54 be a form, display be stored in neutral net unit 121 program storage 129 program, this program by
The neutral net unit 121 of Figure 49 is performed and the configuration according to Figure 53 uses data and weight, and to reach shot and long term note is associated with
Recall the calculating of born of the same parents' layer.Program of the example program of Figure 54 similar to Figure 51.More precisely, in Figure 54 and Figure 51, address 0 to 5
Instruction it is identical;The instruction of address 7 and 8 in Figure 54 is same as the instruction of address 10 and 11 in Figure 51;And address 10 in Figure 54
Instruction to 14 is same as the instruction of address 13 to 17 in Figure 51.
But, in Figure 54 the instruction of address 6 can't remove accumulator 202 (in comparison, in Figure 51 address 6 instruction
Accumulator 202 can then be removed).Additionally, the instruction of address 7 to 9 is not present in the nand architecture program of Figure 54 in Figure 51.Most
Afterwards, for the instruction of address in Figure 54 9 with the instruction of address 12 in Figure 51, except weight is read in the instruction of address 9 in Figure 54
The row 3 of random access memory 124 and in Figure 51 the instruction of address 12 then be read weight random access memory row 6 outside,
Other parts all same.
Because the difference of the nand architecture program of the nand architecture program of Figure 54 and Figure 51, the weight that the configuration of Figure 53 is used is random
The columns of access memory 124 can reduce three, and the instruction number in program circulation can also reduce three.The nand architecture journey of Figure 54
Circulation packet size in sequence substantially only has the half of the circulation packet size in the nand architecture program of Figure 48, and substantially only schemes
80% of circulation packet size in 51 nand architecture program.
Figure 55 is a block diagram, shows the part of the neural processing unit 126 of another embodiment of the present invention.More accurately
Say, for single nerve processing unit 126 in the multiple neural processing unit 126 of Figure 49, multitask is shown in figure
The input 207,211 and 4905 associated with it of buffer 208, and the input 206,711 associated with it of multitask buffer 705
With 4907.In addition to the input of Figure 49, the multitask buffer 208 of neural processing unit 126 is other with multitask buffer 705
Receive and number in a group (index_within_group) input 5599.Numbering input 5599 in group is pointed out at specific nerve
Numbering of the reason unit 126 in its neural processing unit group 4901.So that it takes up a position, for example, with each neural processing unit group
Group 4901 has as a example by the embodiment of four neural processing units 126, in each neural processing unit group 4901, wherein one
Individual neural processing unit 126 receives value of zero in numbering input 5599 in its group, and one of nerve processing unit 126 exists
Numerical value one is received in its group in numbering input 5599, one of nerve processing unit 126 numbers input in its group
Numerical value two is received in 5599, and one of nerve processing unit 126 receives numerical value three in numbering input 5599 in its group.
In other words, 5599 numerical value of numbering input are exactly that this neural processing unit 126 exists in the group that neural processing unit 126 is received
Divided by the remainder of J, wherein J is that the nerve in neural processing unit group 4901 processes single to numbering in neutral net unit 121
The quantity of unit 126.So that it takes up a position, for example, neural processing unit 73 numbers input 5599 in its group receives numerical value one, nerve
Processing unit 353 is numbered input 5599 in its group and receives numerical value three, and neural processing unit 6 numbers input in its group
5599 receive numerical value two.
Additionally, when control input 213 specifies a default value, here to be expressed as " SELF ", multitask buffer 208 can be selected
Corresponding to the output of output buffer 1,104 4905 of 5599 numerical value of numbering input in group.Therefore, when nand architecture is instructed with SELF
Numerical value specify to receive and (be denoted as OUTBUF in the instruction of Figure 57 addresses 2 and 7 from the data of output buffer 1104
[SELF]), the multitask buffer 208 of each neural processing unit 126 can receive its corresponding text from output buffer 1104
Word.So that it takes up a position, for example, the nand architecture of address 2 and 7 is instructed in neutral net unit 121 performs Figure 57, neural processing unit
73 multitask buffer 208 can select second (numbering 1) input to receive from output buffering in four inputs 4905
The word 73 of device 1104, the multitask buffer 208 of neural processing unit 353 can select the 4th in four inputs 4905
(numbering 3) input is to receive the word 353 from output buffer 1104, and the multitask buffer 208 of neural processing unit 6
The 3rd (numbering 2) input can be selected in four inputs 4905 to receive the word 6 from output buffer 1104.Although and
The nand architecture program in Figure 57 is not used, but, nand architecture instruction be possible with SELF numerical value (OUTBUF [SELF]) specify connect
Receipts make control input 713 specify default value to make many of each neural processing unit 126 from the data of output buffer 1104
Task buffer device 705 receives its corresponding word from output buffer 1104.
Figure 56 is a block diagram, is shown when neutral net unit performs the Jordan time recurrent neural nets for being associated with Figure 43
When the calculating of network and the embodiment using Figure 55, the data random access memory 122 of neutral net unit 121 is random with weight
One example of the data configuration in access memory 124.Weight configuration in figure in weight random access memory 124 is same as
The example of Figure 44.The example being similarly configured in Figure 44 of the numerical value in figure in data random access memory 122, except in this model
In example, each time step has corresponding a pair liang row memory to load input layer D values with output node layer Y
Value, rather than as the example of Figure 44 uses the memory of one group of four row.That is, in this example, hidden layer Z numerical value and content
Layer C numerical value is simultaneously not written into data random access memory 122.But using output buffer 1104 as hidden layer Z numerical value with it is interior
Hold the classification scratch memory of layer C numerical value, in detail as described in the nand architecture program of Figure 57.Aforementioned OUTBUF [SELF] output buffer
1104 feedback characteristic, can making the running of nand architecture program, more quick (this is by for data random access memory 122
The write twice for performing and twi-read action, with the write twice and the twi-read action that perform for output buffer 1104
To replace) and the space of the data random access memory 122 that each time step is used is reduced, and make the data of the present embodiment
The data that random access memory 122 is loaded can be used to be approximately twice the time step that the embodiment of Figure 44 and Figure 45 has
Suddenly, as shown in FIG., i.e., 32 time steps.
Figure 57 be a form, display be stored in neutral net unit 121 program storage 129 program, this program by
Neutral net unit 121 is performed and the configuration according to Figure 56 uses data and weight, to reach Jordan time recurrent neural nets
Network.The nand architecture program of Figure 57 similar to Figure 45 nand architecture program, it is as described below at its difference.
There are the example program of Figure 57 12 nand architecture instructions to be located at address 0 to 11 respectively.The initialization directive meeting of address 0
Remove accumulator 202 and the numerical value of cycle counter 3804 is initialized as into 32, perform circulation group (instruction of address 2 to 11)
32 times.The null value of accumulator 202 (being removed by the instruction of address 0) can be put into output buffer by the output order of address 1
1104.Thus can be observed, in the implementation procedure of the instruction of address 2 to 6, this 512 neural correspondences of processing unit 126 are simultaneously made
Operated for 512 hiding node layer Z, and in the implementation procedure of the instruction of address 7 to 10, correspondence is simultaneously defeated as 512
Go out node layer Y to be operated.That is, 32 execution of the instruction of address 2 to 6 can calculate 32 corresponding time steps
Hiding node layer Z numerical value, and put it into output buffer 1104 and use for corresponding 32 execution of the instruction of address 7 to 9,
To calculate the output node layer Y of this 32 corresponding time steps and be written into data random access memory 122, and provide
Corresponding 32 execution of the instruction of address 10 are used, and the content node layer C of this 32 corresponding time steps are put into defeated
Go out buffer 1104.(the content node layer C for being put into the 32nd time step in output buffer 1104 can't be used.)
In the instruction (ADD_D_ACC OUTBUF [SELF] and ADD_D_ACC ROTATE, COUNT=511) of address 2 and 3
First time perform, each the neural processing unit 126 in 512 neural processing units 126 can be by output buffer 1104
512 content node C values be added to its accumulator 202, these content nodes C values are produced by the instruction of address 0 to 1 is performed
With write.In second execution of the instruction of address 2 and 3, each nerve in this 512 neural processing units 126 is processed
512 content node C values of output buffer 1104 can be added to its accumulator 202, these content nodes C values by unit 126
With write produced by address 7 to 8 and 10 instruction are performed.More precisely, the instruction of address 2 can indicate that each nerve is processed
The multitask buffer 208 of unit 126 selects its word of corresponding output buffer 1104, as it was previously stated, and being added into tiring out
Plus device 202;The instruction of address 3 can indicate that neural processing unit 126 rotates content node C values in the circulator of 512 words,
The circulator of this 512 words operates institute by the collective of the multitask buffer 208 being connected in this 512 neural processing units
Constitute, and allow each neural processing unit 126 that this 512 content node C values are added to into its accumulator 202.Address 3
Instruction can't remove accumulator 202, and input layer D values (can be multiplied by its corresponding power by instruction of the such address 4 with 5
Weight) plus the content node layer C values added up out by the instruction of address 2 and 3.
In instruction (MULT-ACCUM DR ROW+2, WR ROW 0 and MULT-ACCUM ROTATE, the WR of address 4 and 5
ROW+1, COUNT=511) perform for each time, each meeting of neural processing unit 126 in this 512 neural processing units 126
512 multiplyings are performed, by the row that current time step is associated with data random access memory 122 (for example:For when
Row 0 are for intermediate step 0, row 2 are for time step 1, the rest may be inferred, for for time step 31 i.e.
For row 62) 512 input node D values, be multiplied by the row 0 to 511 of weight random access memory 124 corresponding to this nerve at
The weight of the row of reason unit 126, to produce 512 products, and together with this address 2 and 3 instruction for this 512 content nodes
The accumulation result that C values are performed, is added in the lump the accumulator 202 of corresponding neural processing unit 126 to calculate concealed nodes Z layers
Numerical value.
In each execution of the instruction (OUTPUT PASSTHRU, NOP, CLRACC) of address 6, this 512 nerves are processed
512 numerical value of accumulator 202 of unit 126 transmit and write the corresponding word of output buffer 1104, and accumulator 202
Can be eliminated.
In instruction (MULT-ACCUM OUTBUF [SELF], the WR ROW 512 and MULT-ACCUM of address 7 and 8
ROTATE, WR ROW+1, COUNT=511) implementation procedure in, at each nerve in this 512 neural processing units 126
Reason unit 126 can perform 512 multiplyings, by 512 concealed nodes Z values in output buffer 1104 (by address 2 to 6
Instruction it is corresponding time perform produced by and write), it is right in the row 512 to 1023 of weight random access memory 124 to be multiplied by
Should be in the weight of the row of this neural processing unit 126, to produce 512 product accumulations to corresponding neural processing unit 126
Accumulator 202.
In each execution of the instruction (OUTPUT ACTIVATION FUNCTION, DR OUT ROW+2) of address 9, meeting
For this 512 accumulated values perform run function (such as hyperbolic tangent function, S type functions, correction function) to calculate output node Y
Value, this output node Y value can be written in data random access memory 122 corresponding to current time step row (for example:It is right
Row 1 are for time step 0, row 3 are for time step 1, the rest may be inferred, for time step 31 i.e.
For row 63).The instruction of address 9 can't remove accumulator 202.
In each execution of the instruction (OUTPUT PASSTHRU, NOP, CLR ACC) of address 10, the finger of address 7 and 8
512 numerical value that order adds up out can be placed into output buffer 1104 and use with the execution next time of 3 instruction for address 2, and
And accumulator 202 can be eliminated.
The recursion instruction of address 11 can make the number decrements of cycle counter 3804, and if new cycle counter 3804
Numerical value would indicate that the instruction for returning to address 2 still above zero.
As described in the chapters and sections corresponding to Figure 44, in the nand architecture program performing Jordan time recurrent neural using Figure 57
In the example of network, although run function can be imposed for the numerical value of accumulator 202 to produce output node layer Y value, but, this model
Official holiday is scheduled on and imposes before run function, and the numerical value of accumulator 202 is just transferred to content node layer C, rather than transmits real output layer
Node Y value.But, for the Jordan times that run function is put on into the numerical value of accumulator 202 to produce content node layer C pass
For returning neutral net, the instruction of address 10 will be removed from the nand architecture program of Figure 57.In the embodiments described herein,
Elman or Jordan time recurrent neural networks have single concealed nodes layer (such as Figure 40 and Figure 42), however, it is desirable to understand
, the embodiment of these processors 100 and neutral net unit 121 can be used similar to manner described herein, effectively
Ground performs the calculating for being associated with the time recurrent neural network with multiple hidden layers.
As described in the chapters and sections for corresponding to Fig. 2 above, each neural processing unit 126 is used as in an artificial neural network
Neuron is operated, and all of neural processing unit 126 can be with the side of extensive parallel processing in neutral net unit 121
Formula effectively calculates the neuron output value of a level of this network.The parallel processing of this neutral net unit, particularly
The circulator constituted using neural processing unit multitask buffer collective, not traditionally calculates the mode of neuronal layers output
Institute's energy intuition is expected.Furthermore, it is understood that traditional approach is usually directed to is associated with single neuron or a very little god
The calculating (for example, performing multiplication and additional calculation using parallel arithmetical unit) of Jing units subclass, then continues to perform association
In the calculating of the next neuron of same level, the rest may be inferred continues executing with serial fashion, until completing for this level
In all of neuron calculating.In comparison, it is of the invention within each time-frequency cycle, all god of neutral net unit 121
Jing processing units 126 (neuron) meeting parallel execution is associated with the small set produced in calculating needed for all neurons outputs
(such as single multiplication and accumulation calculating).After about M time-frequency end cycle-M be in current level link nodes-
Neutral net unit 121 will calculate the output of all neurons.In the configuration of many artificial neural networks, because existing big
The neural processing unit 126 of amount, neutral net unit 121 just can be in M time-frequency end cycle for all god of flood level
Jing units calculate its neuron output value.As described herein, this is calculated for all types of artificial neural networks all
Tool efficiency, these artificial neural networks are including but not limited to feedforward and time recurrent neural network, such as Elman, Jordan and length
Short-term memory network.Finally, although in the embodiments herein, neutral net unit 121 is configured to 512 neural processing units
To perform the calculating of time recurrent neural network, but, the present invention is not limited to this to 126 (for example take wide word configure), will
Neutral net unit 121 is configured to 1024 neural processing units 126 (for example taking narrow word to configure) to perform time recurrence
The embodiment of the calculating of neutral net unit, and such as the aforementioned neural processing unit with 512 and 1024 in addition other quantity
126 neutral net unit 121, also belongs to the scope of the present invention.
Only as described above, only presently preferred embodiments of the present invention, when the model of present invention enforcement can not be limited with this
Enclose, i.e., the simple equivalence changes made according to scope of the present invention patent and invention description content generally and modification all still belong to
In the range of patent of the present invention covers.For example, software can perform the function of apparatus and method of the present invention, system
Make, mould, emulate, describe and/or test.This can be by general program language (such as C, C++), hardware description language
(HDL) include Verilog HDL, VHDL etc., or other existing programs to reach.This software can be arranged at any of
Computer can utilize medium, such as tape, semiconductor, disk, CD (such as CD-ROM, DVD-ROM), network connection, it is wireless or
It is other medium of communications.The embodiment of apparatus and method described herein may be included with semiconductor intelligence wealth core, such as micro- place
Manage core (such as with the embodiment of hardware description language) and the making through integrated circuit is converted to hardware.Additionally, herein
Described apparatus and method can also include the combination of hardware and software.Therefore, any embodiment as herein described, and be not used to
Limit the scope of the present invention.Additionally, present invention can apply to the micro processor, apparatus of general all-purpose computer.Finally, affiliated skill
Art field tool usually intellectual utilize disclosed herein idea and embodiment based on designing and adjust out difference
Structure reached identical purpose, also without departing from the scope of the present invention.
Claims (21)
1. a kind of device, it is characterised in that include:
One array being made up of N number of processing unit, each processing unit includes:
One accumulator (accumulator), with an output;
One arithmetical unit, with first, second and the 3rd input, and performs computing to it and stores tired to this to produce a result
Plus device, this first input receive the accumulator the output;
One weight is input into, and is received to the arithmetical unit by second input;And
One multitask buffer, with first and second data input, output and control input, the output is connect by the 3rd input
Receive to the arithmetical unit, the control input is controlled for the selection of first and second data input;
Wherein, output of the multitask buffer and defeated by second data of the multitask buffer of adjacent processing units
Enter to be received, when the control input selectes second data input, these multitask buffers of N number of processing unit are common
Running is such as the circulator of N number of word;
One first memory, loads N number of weight word of W row, and N number of weight word of wherein one row of the W row is provided
To corresponding weight input of N number of processing unit of the pe array;And
One second memory, loads N number of data literal of D row, and N number of data literal of wherein one row of the D row is provided
To corresponding first data input of the multitask buffer of N number of processing unit of the pe array.
2. device according to claim 1, it is characterised in that the computing that the arithmetical unit is performed by this second and the 3rd
Input is multiplied to produce a product, and the product and first input are added up to produce the result.
3. device according to claim 1, it is characterised in that the computing that the arithmetical unit is performed produce this second and the
The difference of three inputs, and the difference and first input are added up to produce the result.
4. device according to claim 1, it is characterised in that the device is contained in processor, the processor has and refers to
Order collection, the framework designated command processor in the instruction set by multiple data literals by the processor one or more frameworks
Buffer writes the position specified by the framework in the second memory.
5. device according to claim 4, it is characterised in that the second memory includes:
First port, there is provided N number of data literal of the row in the D row is to N number of processing unit;And
Second port, while the first port provides N number of data literal to N number of processing unit, receives the plurality of number
According to word writing the position in the second memory.
6. device according to claim 1, it is characterised in that the device is contained in processor, the processor has and refers to
Order collection, the framework designated command processor in the instruction set by multiple weight words by the processor one or more frameworks
Buffer writes the position specified by the framework in the first memory.
7. device according to claim 6, it is characterised in that also include:
Buffer, stores N number of weight word, use the multiple examples in framework instruction perform and after being received in, write to
A row in the W row of the first memory.
8. device according to claim 1, it is characterised in that the device is contained in processor, the processor has and refers to
Order collection, the framework designated command processor in the instruction set is by multiple data literals by being referred to by the framework in the second memory
Read to one or more framework buffers of the processor position that order is specified.
9. device according to claim 1, it is characterised in that W is at least N.
10. device according to claim 1, it is characterised in that W is at least 512.
11. a kind of processors, it is characterised in that include:
One instruction set, the instruction set has framework instruction to order the processor running;
One array being made up of N number of processing unit, each processing unit includes:
One accumulator (accumulator), with an output;
One arithmetical unit, with first, second and the 3rd input, and performs computing to it and stores tired to this to produce a result
Plus device, this first input receive the accumulator the output;
One weight is input into, and is received to the arithmetical unit by second input;And
One multitask buffer, with first and second data input, an output and a control input, the output is defeated by the 3rd
Enter to receive to the arithmetical unit, the control input is controlled for the selection of first and second data input;
Wherein, output of the multitask buffer and defeated by second data of the multitask buffer of adjacent processing units
Enter to be received, when the control input selectes second data input, these multitask buffers of N number of processing unit are common
Running is such as the circulator of N number of word;
One first memory, loads N number of weight word of W row, and N number of weight word of wherein one row of the W row is provided
To corresponding weight input of N number of processing unit of the pe array;And
One second memory, loads N number of data literal of D row, and N number of data literal of wherein one row of the D row is provided
To corresponding first data input of the multitask buffer of N number of processing unit of the pe array.
12. processors according to claim 11, it is characterised in that the computing that the arithmetical unit is performed by this second with
3rd input is multiplied to produce a product, and the product and first input are added up to produce the result.
13. processors according to claim 11, it is characterised in that the computing that the arithmetical unit is performed produce this second
The difference being input into the 3rd, and the difference and first input are added up to produce the result.
14. processors according to claim 11, it is characterised in that also instruct including a framework, indicate that the processor will
Multiple data literals are write in the second memory by one or more framework buffers of the processor and are referred to by the framework
Fixed position.
15. processors according to claim 14, it is characterised in that the second memory includes:
First port, there is provided N number of data literal of the row in the D row is to N number of processing unit;And
Second port, while the first port provides N number of data literal to N number of processing unit, receives the plurality of number
According to word writing the position in the second memory.
16. processors according to claim 11, it is characterised in that also instruct including framework, order the processor be many
Individual weight word is write in the first memory by one or more framework buffers of the processor and is specified by the framework instruction
Position.
17. processors according to claim 16, it is characterised in that also include:
Buffer, stores N number of weight word, use the multiple examples in framework instruction perform and after being received in, write to
A row in the W row of the first memory.
18. processors according to claim 11, it is characterised in that also instruct including framework, order the processor be many
Individual data literal is read to one or more framves of the processor by the position specified by the framework in the second memory
Structure buffer.
19. processors according to claim 11, it is characterised in that W is at least N.
20. processors according to claim 11, it is characterised in that W is at least 512.
21. a kind of are encoded in the computer journey that an at least non-momentary computer can be used using media for a computer installation
Sequence product, it is characterised in that include:
Being included in the computer of the media can use program code, to specify a device, the computer to use program code
Including:
First program code a, it is intended that array being made up of N number of processing unit, each processing unit includes:
One accumulator (accumulator), with an output;
One arithmetical unit, with first, second and the 3rd input, and performs computing to it and stores tired to this to produce a result
Plus device, this first input receive the accumulator the output;
One weight is input into, and is received to the arithmetical unit by second input;And
One multitask buffer, with first and second data input, an output and a control input, the output is defeated by the 3rd
Enter to receive to the arithmetical unit, the control input is controlled for the selection of first and second data input;
Wherein, output of the multitask buffer and defeated by second data of the multitask buffer of adjacent processing units
Enter to be received, when the control input selectes second data input, those multitask buffers of N number of processing unit are common
Running is such as the circulator of N number of word;
Second program code, it is intended that first memory, to load N number of weight word of W row, and by wherein one row of W row
N number of weight word is provided to corresponding weight input of N number of processing unit of the pe array;And
3rd program code, it is intended that second memory, to load N number of data literal of D row, and by wherein one row of D row
N number of data literal is provided to the corresponding of the multitask buffer of N number of processing unit of the pe array and is somebody's turn to do
First data input.
Applications Claiming Priority (48)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201562239254P | 2015-10-08 | 2015-10-08 | |
US62/239,254 | 2015-10-08 | ||
US201562262104P | 2015-12-02 | 2015-12-02 | |
US62/262,104 | 2015-12-02 | ||
US201662299191P | 2016-02-24 | 2016-02-24 | |
US62/299,191 | 2016-02-24 | ||
US15/090,712 | 2016-04-05 | ||
US15/090,794 US10353862B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit that performs stochastic rounding |
US15/090,796 | 2016-04-05 | ||
US15/090,665 US10474627B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with neural memory and array of neural processing units that collectively shift row of data received from neural memory |
US15/090,794 | 2016-04-05 | ||
US15/090,823 | 2016-04-05 | ||
US15/090,666 | 2016-04-05 | ||
US15/090,829 | 2016-04-05 | ||
US15/090,796 US10228911B2 (en) | 2015-10-08 | 2016-04-05 | Apparatus employing user-specified binary point fixed point arithmetic |
US15/090,708 US10346350B2 (en) | 2015-10-08 | 2016-04-05 | Direct execution by an execution unit of a micro-operation loaded into an architectural register file by an architectural instruction of a processor |
US15/090,696 | 2016-04-05 | ||
US15/090,669 | 2016-04-05 | ||
US15/090,672 | 2016-04-05 | ||
US15/090,722 | 2016-04-05 | ||
US15/090,807 US10380481B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit that performs concurrent LSTM cell calculations |
US15/090,705 US10353861B2 (en) | 2015-10-08 | 2016-04-05 | Mechanism for communication between architectural program running on processor and non-architectural program running on execution unit of the processor regarding shared resource |
US15/090,801 | 2016-04-05 | ||
US15/090,708 | 2016-04-05 | ||
US15/090,701 US10474628B2 (en) | 2015-10-08 | 2016-04-05 | Processor with variable rate execution unit |
US15/090,807 | 2016-04-05 | ||
US15/090,823 US10409767B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with neural memory and array of neural processing units and sequencer that collectively shift row of data received from neural memory |
US15/090,691 | 2016-04-05 | ||
US15/090,678 US10509765B2 (en) | 2015-10-08 | 2016-04-05 | Neural processing unit that selectively writes back to neural memory either activation function output or accumulator value |
US15/090,798 US10585848B2 (en) | 2015-10-08 | 2016-04-05 | Processor with hybrid coprocessor/execution unit neural network unit |
US15/090,665 | 2016-04-05 | ||
US15/090,678 | 2016-04-05 | ||
US15/090,691 US10387366B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with shared activation function units |
US15/090,814 | 2016-04-05 | ||
US15/090,829 US10346351B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with output buffer feedback and masking capability with processing unit groups that operate as recurrent neural network LSTM cells |
US15/090,722 US10671564B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit that performs convolutions using collective shift register among array of neural processing units |
US15/090,672 US10353860B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with neural processing units dynamically configurable to process multiple data sizes |
US15/090,666 US10275393B2 (en) | 2015-10-08 | 2016-04-05 | Tri-configuration neural network unit |
US15/090,701 | 2016-04-05 | ||
US15/090,727 US10776690B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with plurality of selectable output functions |
US15/090,727 | 2016-04-05 | ||
US15/090,696 US10380064B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit employing user-supplied reciprocal for normalizing an accumulated value |
US15/090,712 US10366050B2 (en) | 2015-10-08 | 2016-04-05 | Multi-operation neural network unit |
US15/090,801 US10282348B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with output buffer feedback and masking capability |
US15/090,798 | 2016-04-05 | ||
US15/090,814 US10552370B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with output buffer feedback for performing recurrent neural network computations |
US15/090,705 | 2016-04-05 | ||
US15/090,669 US10275394B2 (en) | 2015-10-08 | 2016-04-05 | Processor with architectural neural network execution unit |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106599991A true CN106599991A (en) | 2017-04-26 |
CN106599991B CN106599991B (en) | 2019-04-09 |
Family
ID=58556056
Family Applications (6)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610866452.7A Active CN106599992B (en) | 2015-10-08 | 2016-09-29 | The neural network unit operated using processing unit group as time recurrent neural network shot and long term memory cell |
CN201610866129.XA Active CN106599991B (en) | 2015-10-08 | 2016-09-29 | The neural pe array that neural network unit and collective with neural memory will be shifted from the data of neural memory column |
CN201610866130.2A Active CN106650923B (en) | 2015-10-08 | 2016-09-29 | Neural network unit with neural memory and neural processing unit and sequencer |
CN201610866127.0A Active CN106599990B (en) | 2015-10-08 | 2016-09-29 | The neural pe array that neural network unit and collective with neural memory will be shifted from the data of neural memory column |
CN201610866030.XA Active CN106599989B (en) | 2015-10-08 | 2016-09-29 | Neural network unit and neural pe array |
CN201610866026.3A Active CN106598545B (en) | 2015-10-08 | 2016-09-29 | Processor and method for communicating shared resources and non-transitory computer usable medium |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610866452.7A Active CN106599992B (en) | 2015-10-08 | 2016-09-29 | The neural network unit operated using processing unit group as time recurrent neural network shot and long term memory cell |
Family Applications After (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610866130.2A Active CN106650923B (en) | 2015-10-08 | 2016-09-29 | Neural network unit with neural memory and neural processing unit and sequencer |
CN201610866127.0A Active CN106599990B (en) | 2015-10-08 | 2016-09-29 | The neural pe array that neural network unit and collective with neural memory will be shifted from the data of neural memory column |
CN201610866030.XA Active CN106599989B (en) | 2015-10-08 | 2016-09-29 | Neural network unit and neural pe array |
CN201610866026.3A Active CN106598545B (en) | 2015-10-08 | 2016-09-29 | Processor and method for communicating shared resources and non-transitory computer usable medium |
Country Status (2)
Country | Link |
---|---|
CN (6) | CN106599992B (en) |
TW (7) | TWI616825B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111078286A (en) * | 2018-10-19 | 2020-04-28 | 上海寒武纪信息科技有限公司 | Data communication method, computing system and storage medium |
Families Citing this family (47)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11615285B2 (en) | 2017-01-06 | 2023-03-28 | Ecole Polytechnique Federale De Lausanne (Epfl) | Generating and identifying functional subnetworks within structural networks |
US10481870B2 (en) | 2017-05-12 | 2019-11-19 | Google Llc | Circuit to perform dual input value absolute value and sum operation |
US10019668B1 (en) * | 2017-05-19 | 2018-07-10 | Google Llc | Scheduling neural network processing |
CN107315710B (en) | 2017-06-27 | 2020-09-11 | 上海兆芯集成电路有限公司 | Method and device for calculating full-precision numerical value and partial-precision numerical value |
CN107291420B (en) | 2017-06-27 | 2020-06-05 | 上海兆芯集成电路有限公司 | Device for integrating arithmetic and logic processing |
TWI680409B (en) * | 2017-07-08 | 2019-12-21 | 英屬開曼群島商意騰科技股份有限公司 | Method for matrix by vector multiplication for use in artificial neural network |
TWI687873B (en) * | 2017-08-09 | 2020-03-11 | 美商谷歌有限責任公司 | Computing unit for accelerating neural networks |
US10079067B1 (en) * | 2017-09-07 | 2018-09-18 | Winbond Electronics Corp. | Data read method and a non-volatile memory apparatus using the same |
US11507806B2 (en) * | 2017-09-08 | 2022-11-22 | Rohit Seth | Parallel neural processor for Artificial Intelligence |
CN109472344A (en) * | 2017-09-08 | 2019-03-15 | 光宝科技股份有限公司 | The design method of neural network system |
CN109697509B (en) * | 2017-10-24 | 2020-10-20 | 上海寒武纪信息科技有限公司 | Processing method and device, and operation method and device |
CN109960673B (en) * | 2017-12-14 | 2020-02-18 | 中科寒武纪科技股份有限公司 | Integrated circuit chip device and related product |
CN108288091B (en) * | 2018-01-19 | 2020-09-11 | 上海兆芯集成电路有限公司 | Microprocessor for booth multiplication |
US20190251429A1 (en) * | 2018-02-12 | 2019-08-15 | Kneron, Inc. | Convolution operation device and method of scaling convolution input for convolution neural network |
CN111767998B (en) * | 2018-02-27 | 2024-05-14 | 上海寒武纪信息科技有限公司 | Integrated circuit chip device and related products |
CN110197271B (en) * | 2018-02-27 | 2020-10-27 | 上海寒武纪信息科技有限公司 | Integrated circuit chip device and related product |
CN110197270B (en) * | 2018-02-27 | 2020-10-30 | 上海寒武纪信息科技有限公司 | Integrated circuit chip device and related product |
TWI664585B (en) * | 2018-03-30 | 2019-07-01 | 國立臺灣大學 | Method of Neural Network Training Using Floating-Point Signed Digit Representation |
US10522226B2 (en) * | 2018-05-01 | 2019-12-31 | Silicon Storage Technology, Inc. | Method and apparatus for high voltage generation for analog neural memory in deep learning artificial neural network |
TWI650769B (en) * | 2018-05-22 | 2019-02-11 | 華邦電子股份有限公司 | Memory device and programming method for memory cell array |
US11893471B2 (en) | 2018-06-11 | 2024-02-06 | Inait Sa | Encoding and decoding information and artificial neural networks |
US11972343B2 (en) | 2018-06-11 | 2024-04-30 | Inait Sa | Encoding and decoding information |
US11663478B2 (en) | 2018-06-11 | 2023-05-30 | Inait Sa | Characterizing activity in a recurrent artificial neural network |
JP2020004247A (en) * | 2018-06-29 | 2020-01-09 | ソニー株式会社 | Information processing apparatus, information processing method, and program |
US10418109B1 (en) | 2018-07-26 | 2019-09-17 | Winbond Electronics Corp. | Memory device and programming method of memory cell array |
CN108984426B (en) * | 2018-08-03 | 2021-01-26 | 北京字节跳动网络技术有限公司 | Method and apparatus for processing data |
US11449756B2 (en) * | 2018-09-24 | 2022-09-20 | Samsung Electronics Co., Ltd. | Method to balance sparsity for efficient inference of deep neural networks |
CN109376853B (en) * | 2018-10-26 | 2021-09-24 | 电子科技大学 | Echo state neural network output axon circuit |
KR20200061164A (en) * | 2018-11-23 | 2020-06-02 | 삼성전자주식회사 | Neural network device for neural network operation, operating method of neural network device and application processor comprising neural network device |
US10867399B2 (en) | 2018-12-02 | 2020-12-15 | Himax Technologies Limited | Image processing circuit for convolutional neural network |
TWI694413B (en) * | 2018-12-12 | 2020-05-21 | 奇景光電股份有限公司 | Image processing circuit |
US11652603B2 (en) | 2019-03-18 | 2023-05-16 | Inait Sa | Homomorphic encryption |
US11569978B2 (en) | 2019-03-18 | 2023-01-31 | Inait Sa | Encrypting and decrypting information |
US11797827B2 (en) | 2019-12-11 | 2023-10-24 | Inait Sa | Input into a neural network |
US20210182655A1 (en) * | 2019-12-11 | 2021-06-17 | Inait Sa | Robust recurrent artificial neural networks |
US11816553B2 (en) | 2019-12-11 | 2023-11-14 | Inait Sa | Output from a recurrent neural network |
US11651210B2 (en) | 2019-12-11 | 2023-05-16 | Inait Sa | Interpreting and improving the processing results of recurrent neural networks |
US11580401B2 (en) | 2019-12-11 | 2023-02-14 | Inait Sa | Distance metrics and clustering in recurrent neural networks |
TWI722797B (en) | 2020-02-17 | 2021-03-21 | 財團法人工業技術研究院 | Computation operator in memory and operation method thereof |
RU2732201C1 (en) * | 2020-02-17 | 2020-09-14 | Российская Федерация, от имени которой выступает ФОНД ПЕРСПЕКТИВНЫХ ИССЛЕДОВАНИЙ | Method for constructing processors for output in convolutional neural networks based on data-flow computing |
CN111898752A (en) * | 2020-08-03 | 2020-11-06 | 乐鑫信息科技(上海)股份有限公司 | Apparatus and method for performing LSTM neural network operations |
TWI742802B (en) | 2020-08-18 | 2021-10-11 | 創鑫智慧股份有限公司 | Matrix calculation device and operation method thereof |
TWI746126B (en) | 2020-08-25 | 2021-11-11 | 創鑫智慧股份有限公司 | Matrix multiplication device and operation method thereof |
TWI798798B (en) * | 2020-09-08 | 2023-04-11 | 旺宏電子股份有限公司 | In-memory computing method and in-memory computing apparatus |
TWI775170B (en) * | 2020-09-30 | 2022-08-21 | 新漢股份有限公司 | Method for cpu to execute artificial intelligent related processes |
US11657864B1 (en) * | 2021-12-17 | 2023-05-23 | Winbond Electronics Corp. | In-memory computing apparatus and computing method having a memory array includes a shifted weight storage, shift information storage and shift restoration circuit to restore a weigh shifted amount of shifted sum-of-products to generate multiple restored sum-of-products |
TWI830669B (en) * | 2023-02-22 | 2024-01-21 | 旺宏電子股份有限公司 | Encoding method and encoding circuit |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1056356A (en) * | 1990-03-16 | 1991-11-20 | 德克萨斯仪器股份有限公司 | Distributed processing memory |
US6138136A (en) * | 1996-06-26 | 2000-10-24 | U.S. Philips Corporation | Signal processor |
Family Cites Families (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5138695A (en) * | 1989-10-10 | 1992-08-11 | Hnc, Inc. | Systolic array image processing system |
US5563982A (en) * | 1991-01-31 | 1996-10-08 | Ail Systems, Inc. | Apparatus and method for detection of molecular vapors in an atmospheric region |
TW279231B (en) * | 1995-04-18 | 1996-06-21 | Nat Science Council | This invention is related to a new neural network for prediction |
US5956703A (en) * | 1995-07-28 | 1999-09-21 | Delco Electronics Corporation | Configurable neural network integrated circuit |
TW337568B (en) * | 1996-10-11 | 1998-08-01 | Apex Semiconductor Inc | Pseudo cache DRAM controller with packet command protocol |
US6216119B1 (en) * | 1997-11-19 | 2001-04-10 | Netuitive, Inc. | Multi-kernel neural network concurrent learning, monitoring, and forecasting system |
US6557096B1 (en) * | 1999-10-25 | 2003-04-29 | Intel Corporation | Processors with data typer and aligner selectively coupling data bits of data buses to adder and multiplier functional blocks to execute instructions with flexible data types |
US8660939B2 (en) * | 2000-05-17 | 2014-02-25 | Timothy D. Allen | Method for mortgage customer retention |
US6581131B2 (en) * | 2001-01-09 | 2003-06-17 | Hewlett-Packard Development Company, L.P. | Method and apparatus for efficient cache mapping of compressed VLIW instructions |
US6782375B2 (en) * | 2001-01-16 | 2004-08-24 | Providian Bancorp Services | Neural network based decision processor and method |
US7146486B1 (en) * | 2003-01-29 | 2006-12-05 | S3 Graphics Co., Ltd. | SIMD processor with scalar arithmetic logic units |
US7689641B2 (en) * | 2003-06-30 | 2010-03-30 | Intel Corporation | SIMD integer multiply high with round and shift |
US7421565B1 (en) * | 2003-08-18 | 2008-09-02 | Cray Inc. | Method and apparatus for indirectly addressed vector load-add -store across multi-processors |
CN1306395C (en) * | 2004-02-13 | 2007-03-21 | 中国科学院计算技术研究所 | Processor extended instruction of MIPS instruction set, encoding method and component thereof |
CN1658153B (en) * | 2004-02-18 | 2010-04-28 | 联发科技股份有限公司 | Compound dynamic preset number representation and algorithm, and its processor structure |
JP2006004042A (en) * | 2004-06-16 | 2006-01-05 | Renesas Technology Corp | Data processor |
CN100383781C (en) * | 2004-11-26 | 2008-04-23 | 北京天碁科技有限公司 | Cholesky decomposition algorithm device |
US7743233B2 (en) * | 2005-04-05 | 2010-06-22 | Intel Corporation | Sequencer address management |
US7512573B2 (en) * | 2006-10-16 | 2009-03-31 | Alcatel-Lucent Usa Inc. | Optical processor for an artificial neural network |
US8145887B2 (en) * | 2007-06-15 | 2012-03-27 | International Business Machines Corporation | Enhanced load lookahead prefetch in single threaded mode for a simultaneous multithreaded microprocessor |
TW200923803A (en) * | 2007-11-26 | 2009-06-01 | Univ Nat Taipei Technology | Hardware neural network learning and recall architecture |
CN101625735A (en) * | 2009-08-13 | 2010-01-13 | 西安理工大学 | FPGA implementation method based on LS-SVM classification and recurrence learning recurrence neural network |
US8380138B2 (en) * | 2009-10-21 | 2013-02-19 | Qualcomm Incorporated | Duty cycle correction circuitry |
US20120066163A1 (en) * | 2010-09-13 | 2012-03-15 | Nottingham Trent University | Time to event data analysis method and system |
US8880851B2 (en) * | 2011-04-07 | 2014-11-04 | Via Technologies, Inc. | Microprocessor that performs X86 ISA and arm ISA machine language program instructions by hardware translation into microinstructions executed by common execution pipeline |
EP2508980B1 (en) * | 2011-04-07 | 2018-02-28 | VIA Technologies, Inc. | Conditional ALU instruction pre-shift-generated carry flag propagation between microinstructions in read-port limited register file microprocessor |
CN102402415B (en) * | 2011-10-21 | 2013-07-17 | 清华大学 | Device and method for buffering data in dynamic reconfigurable array |
US9251116B2 (en) * | 2011-11-30 | 2016-02-02 | International Business Machines Corporation | Direct interthread communication dataport pack/unpack and load/save |
US9235414B2 (en) * | 2011-12-19 | 2016-01-12 | Intel Corporation | SIMD integer multiply-accumulate instruction for multi-precision arithmetic |
US9207646B2 (en) * | 2012-01-20 | 2015-12-08 | Mediatek Inc. | Method and apparatus of estimating/calibrating TDC gain |
TWI602181B (en) * | 2012-02-29 | 2017-10-11 | 三星電子股份有限公司 | Memory system and method for operating test device to transmit fail address to memory device |
CN102665049B (en) * | 2012-03-29 | 2014-09-17 | 中国科学院半导体研究所 | Programmable visual chip-based visual image processing system |
CN103019656B (en) * | 2012-12-04 | 2016-04-27 | 中国科学院半导体研究所 | The multistage parallel single instruction multiple data array processing system of dynamic reconstruct |
US9483263B2 (en) * | 2013-03-26 | 2016-11-01 | Via Technologies, Inc. | Uncore microcode ROM |
US9792121B2 (en) * | 2013-05-21 | 2017-10-17 | Via Technologies, Inc. | Microprocessor that fuses if-then instructions |
CN104216866B (en) * | 2013-05-31 | 2018-01-23 | 深圳市海思半导体有限公司 | A kind of data processing equipment |
EP2843550B1 (en) * | 2013-08-28 | 2018-09-12 | VIA Technologies, Inc. | Dynamic reconfiguration of mulit-core processor |
US9286268B2 (en) * | 2013-12-12 | 2016-03-15 | Brno University of Technology | Method and an apparatus for fast convolution of signals with a one-sided exponential function |
CN104952448A (en) * | 2015-05-04 | 2015-09-30 | 张爱英 | Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks |
-
2016
- 2016-09-29 CN CN201610866452.7A patent/CN106599992B/en active Active
- 2016-09-29 CN CN201610866129.XA patent/CN106599991B/en active Active
- 2016-09-29 CN CN201610866130.2A patent/CN106650923B/en active Active
- 2016-09-29 CN CN201610866127.0A patent/CN106599990B/en active Active
- 2016-09-29 CN CN201610866030.XA patent/CN106599989B/en active Active
- 2016-09-29 CN CN201610866026.3A patent/CN106598545B/en active Active
- 2016-10-04 TW TW105132064A patent/TWI616825B/en active
- 2016-10-04 TW TW105132059A patent/TWI650707B/en active
- 2016-10-04 TW TW105132063A patent/TWI591539B/en active
- 2016-10-04 TW TW105132061A patent/TWI626587B/en active
- 2016-10-04 TW TW105132062A patent/TWI601062B/en active
- 2016-10-04 TW TW105132065A patent/TWI579694B/en active
- 2016-10-04 TW TW105132058A patent/TWI608429B/en active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1056356A (en) * | 1990-03-16 | 1991-11-20 | 德克萨斯仪器股份有限公司 | Distributed processing memory |
US6138136A (en) * | 1996-06-26 | 2000-10-24 | U.S. Philips Corporation | Signal processor |
Non-Patent Citations (2)
Title |
---|
HADI ESMAEILZADEH ET AL.: "Neural Acceleration for General-Purpose Approximate Programs", 《2012 IEEE/ACM 45TH ANNUAL INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE》 * |
MAURICE PEEMEN ET AL.: "memory-centric accelerator design for convolutional neural networks", 《2013 IEEE 31ST INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD)》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111078286A (en) * | 2018-10-19 | 2020-04-28 | 上海寒武纪信息科技有限公司 | Data communication method, computing system and storage medium |
CN111078286B (en) * | 2018-10-19 | 2023-09-01 | 上海寒武纪信息科技有限公司 | Data communication method, computing system and storage medium |
Also Published As
Publication number | Publication date |
---|---|
TWI608429B (en) | 2017-12-11 |
CN106599989A (en) | 2017-04-26 |
TWI626587B (en) | 2018-06-11 |
CN106650923A (en) | 2017-05-10 |
CN106599989B (en) | 2019-04-09 |
CN106650923B (en) | 2019-04-09 |
TW201714120A (en) | 2017-04-16 |
TWI650707B (en) | 2019-02-11 |
CN106599990B (en) | 2019-04-09 |
CN106598545A (en) | 2017-04-26 |
CN106598545B (en) | 2020-04-14 |
TWI591539B (en) | 2017-07-11 |
TW201714091A (en) | 2017-04-16 |
TW201714080A (en) | 2017-04-16 |
CN106599992A (en) | 2017-04-26 |
TWI616825B (en) | 2018-03-01 |
TW201714081A (en) | 2017-04-16 |
TW201714078A (en) | 2017-04-16 |
CN106599990A (en) | 2017-04-26 |
TW201714079A (en) | 2017-04-16 |
CN106599992B (en) | 2019-04-09 |
TWI601062B (en) | 2017-10-01 |
TW201714119A (en) | 2017-04-16 |
CN106599991B (en) | 2019-04-09 |
TWI579694B (en) | 2017-04-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106485323B (en) | It is fed back with output buffer to execute the neural network unit of time recurrent neural network calculating | |
CN106599990B (en) | The neural pe array that neural network unit and collective with neural memory will be shifted from the data of neural memory column | |
CN107844830A (en) | Neutral net unit with size of data and weight size mixing computing capability | |
CN108268944A (en) | Neural network unit with the memory that can be remolded | |
CN108268932A (en) | Neural network unit | |
CN108268945A (en) | The neural network unit of circulator with array-width sectional | |
CN108268946A (en) | The neural network unit of circulator with array-width sectional | |
CN108133263A (en) | Neural network unit | |
CN108133262A (en) | With for perform it is efficient 3 dimension convolution memory layouts neural network unit | |
CN108133264A (en) | Perform the neural network unit of efficient 3 dimension convolution | |
CN108564169A (en) | Hardware processing element, neural network unit and computer usable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP01 | Change in the name or title of a patent holder |
Address after: Room 301, 2537 Jinke Road, Zhangjiang High Tech Park, Pudong New Area, Shanghai 201203 Patentee after: Shanghai Zhaoxin Semiconductor Co.,Ltd. Address before: Room 301, 2537 Jinke Road, Zhangjiang High Tech Park, Pudong New Area, Shanghai 201203 Patentee before: VIA ALLIANCE SEMICONDUCTOR Co.,Ltd. |
|
CP01 | Change in the name or title of a patent holder |