CN106599992B - The neural network unit operated using processing unit group as time recurrent neural network shot and long term memory cell - Google Patents
The neural network unit operated using processing unit group as time recurrent neural network shot and long term memory cell Download PDFInfo
- Publication number
- CN106599992B CN106599992B CN201610866452.7A CN201610866452A CN106599992B CN 106599992 B CN106599992 B CN 106599992B CN 201610866452 A CN201610866452 A CN 201610866452A CN 106599992 B CN106599992 B CN 106599992B
- Authority
- CN
- China
- Prior art keywords
- processing unit
- output
- text
- input
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012545 processing Methods 0.000 title claims abstract description 942
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 539
- 230000007787 long-term memory Effects 0.000 title claims abstract description 119
- 230000000306 recurrent effect Effects 0.000 title abstract description 68
- 239000000872 buffer Substances 0.000 claims abstract description 521
- 230000015654 memory Effects 0.000 claims abstract description 294
- 230000003139 buffering effect Effects 0.000 claims abstract description 68
- 230000007717 exclusion Effects 0.000 claims abstract description 15
- 230000006870 function Effects 0.000 claims description 378
- 238000000034 method Methods 0.000 claims description 77
- 230000008569 process Effects 0.000 claims description 44
- 230000007774 longterm Effects 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 4
- 238000009434 installation Methods 0.000 claims description 2
- 230000001537 neural effect Effects 0.000 description 665
- 210000004027 cell Anatomy 0.000 description 169
- 210000005036 nerve Anatomy 0.000 description 163
- 239000000047 product Substances 0.000 description 129
- 238000010586 diagram Methods 0.000 description 99
- 210000002569 neuron Anatomy 0.000 description 78
- 238000003860 storage Methods 0.000 description 72
- 239000011159 matrix material Substances 0.000 description 60
- 238000007667 floating Methods 0.000 description 35
- 238000004422 calculation algorithm Methods 0.000 description 33
- 230000001186 cumulative effect Effects 0.000 description 33
- 230000035508 accumulation Effects 0.000 description 24
- 238000009825 accumulation Methods 0.000 description 24
- 230000000116 mitigating effect Effects 0.000 description 22
- 230000004044 response Effects 0.000 description 18
- 238000004220 aggregation Methods 0.000 description 16
- 230000002776 aggregation Effects 0.000 description 15
- 210000004218 nerve net Anatomy 0.000 description 14
- 238000004364 calculation method Methods 0.000 description 13
- 238000005070 sampling Methods 0.000 description 13
- 238000007792 addition Methods 0.000 description 12
- 230000008901 benefit Effects 0.000 description 12
- 238000011068 loading method Methods 0.000 description 12
- 230000002829 reductive effect Effects 0.000 description 12
- 238000013461 design Methods 0.000 description 11
- 238000011049 filling Methods 0.000 description 10
- 230000007423 decrease Effects 0.000 description 9
- 235000013399 edible fruits Nutrition 0.000 description 9
- 238000012163 sequencing technique Methods 0.000 description 9
- 238000013519 translation Methods 0.000 description 9
- 230000014616 translation Effects 0.000 description 9
- 238000012937 correction Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 8
- 230000001965 increasing effect Effects 0.000 description 8
- 229910002056 binary alloy Inorganic materials 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 7
- 230000008859 change Effects 0.000 description 7
- 230000006835 compression Effects 0.000 description 7
- 238000007906 compression Methods 0.000 description 7
- 230000009977 dual effect Effects 0.000 description 7
- 230000000873 masking effect Effects 0.000 description 6
- 230000006403 short-term memory Effects 0.000 description 6
- 238000012546 transfer Methods 0.000 description 6
- 238000011176 pooling Methods 0.000 description 5
- 239000004065 semiconductor Substances 0.000 description 5
- 230000003068 static effect Effects 0.000 description 5
- 230000009471 action Effects 0.000 description 4
- 229910052796 boron Inorganic materials 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000000151 deposition Methods 0.000 description 4
- 238000009826 distribution Methods 0.000 description 4
- 230000000977 initiatory effect Effects 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 229920006395 saturated elastomer Polymers 0.000 description 4
- 230000001629 suppression Effects 0.000 description 4
- 241001269238 Data Species 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 239000013078 crystal Substances 0.000 description 3
- 125000004122 cyclic group Chemical group 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000005265 energy consumption Methods 0.000 description 3
- 230000014759 maintenance of location Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000036961 partial effect Effects 0.000 description 3
- 238000002360 preparation method Methods 0.000 description 3
- 230000000750 progressive effect Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 229910052770 Uranium Inorganic materials 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 239000006227 byproduct Substances 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000001816 cooling Methods 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 230000003111 delayed effect Effects 0.000 description 2
- 238000006073 displacement reaction Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 239000002184 metal Substances 0.000 description 2
- 229910052751 metal Inorganic materials 0.000 description 2
- 238000002156 mixing Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000001737 promoting effect Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 229910052721 tungsten Inorganic materials 0.000 description 2
- 101100058681 Drosophila melanogaster Btk29A gene Proteins 0.000 description 1
- 241000854350 Enicospilus group Species 0.000 description 1
- 101000585775 Homo sapiens Myoneurin Proteins 0.000 description 1
- 102100030166 Myoneurin Human genes 0.000 description 1
- 238000012356 Product development Methods 0.000 description 1
- 101100534231 Xenopus laevis src-b gene Proteins 0.000 description 1
- 238000005267 amalgamation Methods 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 229910052797 bismuth Inorganic materials 0.000 description 1
- 238000011094 buffer selection Methods 0.000 description 1
- 239000003990 capacitor Substances 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000009514 concussion Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000003708 edge detection Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011067 equilibration Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000005669 field effect Effects 0.000 description 1
- 235000011868 grain product Nutrition 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 229910052738 indium Inorganic materials 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 229910044991 metal oxide Inorganic materials 0.000 description 1
- 150000004706 metal oxides Chemical class 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000000926 neurological effect Effects 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 210000004205 output neuron Anatomy 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000021715 photosynthesis, light harvesting Effects 0.000 description 1
- 238000004080 punching Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- VEMKTZHHVJILDY-UHFFFAOYSA-N resmethrin Chemical compound CC1(C)C(C=C(C)C)C1C(=O)OCC1=COC(CC=2C=CC=CC=2)=C1 VEMKTZHHVJILDY-UHFFFAOYSA-N 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 238000009738 saturating Methods 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 238000005549 size reduction Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
- 238000010977 unit operation Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/57—Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
- G06F7/575—Basic arithmetic logic units, i.e. devices selectable to perform either addition, subtraction or one of several logical operations, using, at least partially, the same circuitry
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3893—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/468—Specific access rights for resources, e.g. using capability register
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Neurology (AREA)
- Advance Control (AREA)
- Complex Calculations (AREA)
- Memory System (AREA)
- Executing Machine-Instructions (AREA)
- Multi Processors (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Devices For Executing Special Programs (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
Output buffer loads N number of text and distributes to N/J mutual exclusion output buffering text group.N number of processing unit is distributed to N/J corresponding mutual exclusion processing unit groups.Processing unit includes first and second multitask buffer, accumulator and arithmetical unit.Multitask buffer includes J+1 input, and the first input therein receives operand from memory, and other inputs receive the text of corresponding output buffering text group.Accumulator provides its output to corresponding output and buffers text group.Arithmetical unit executes operation to each multitask buffer and accumulator output and is added to accumulator to generate result.Output buffer includes that wherein those texts are maintained its script numerical value or updated with its corresponding accumulator output shielding input control.Processing unit group operates as the shot and long term memory cell of time recurrent neural network.
Description
Technical field
The present invention relates to a kind of processor, in particular to a kind of place of the operation efficiency for promoting artificial neural network and efficiency
Manage device.
Present application advocates the international priority of following United States provisional application.These priority cases are incorporated by this
Case is for reference.
Present application is associated with US application case that is following while filing an application.These association request cases are incorporated by this
Case is for reference.
Background technique
In recent years, artificial neural network (artificial neural networks, ANN) has attracted the note of people again
Meaning.These researchs are commonly known as deep learning (deep learning), computer learning (computer learning) etc.
Similar terms.The promotion of general processor operational capability also raised people after many decades now for artificial neural network
Interest.The recent application of artificial neural network includes language and image identification etc..For promoting the operation of artificial neural network
The demand of efficiency and efficiency seems increasing.
Summary of the invention
In view of this, the present invention provides a kind of device.This device is including an output buffer and one by N number of processing unit
The array of composition.Wherein, output buffer is distributed to load N number of text, N number of text to the output buffering text of N/J mutual exclusion
In sub-block group, output buffering text group has J text in N number of text, and J is greater than twice that 2, N is at least J.In array
N number of processing unit distribute to the processing unit group of N/J mutual exclusion, this processing cell group have N number of processing unit in J
A processing unit, each processing unit group correspond to one of N/J output buffering text group, each processing unit packet
Include first and second multitask buffer, an accumulator, with an arithmetical unit.Wherein, each multitask buffer includes at least J
+ 1 input, an output are inputted with a control.One first input in J+1 input receives an operand, J+1 from a memory
Other J inputs in a input receive J text of corresponding output buffering text group;Control input is to control pair
In the selection that aforementioned J+1 inputs to be provided to output.Accumulator has an output to be provided in N number of output buffering text
One corresponding output buffers text.Arithmetical unit has the first, the second to input with third, first and second input is respectively to connect
Receive the output of first and second multitask buffer, third inputs the output to receive accumulator, arithmetical unit for first,
Second, which executes an operation with third input, is added to accumulator to generate a result.Wherein, output buffer and including one shielding
Input can maintain its script numerical value or to control those texts in N number of text with the output progress of its corresponding accumulator
It updates.Wherein, each processing unit group in the N/J processing unit group with J processing unit passs as the time
A shot and long term memory cell of neural network is returned to be operated, the first processing units in J processing unit calculate shot and long term note
Recall the input lock of born of the same parents, the second processing unit in J processing unit calculates the one of shot and long term memory cell and forgets lock, and J
A third processing unit in a processing unit calculates an output lock of shot and long term memory cell.
The present invention also provides a kind of methods for operating a device.This device is with an output buffer and one by N
Manage the array that unit is constituted.Output buffer is distributed to load N number of text, N number of text to the output buffering of N/J mutual exclusion
In text group, output buffering text group has J text in N number of text, and J is greater than twice that 2, N is at least J, N number of
Processing unit is distributed to the processing unit group of N/J mutual exclusion, and processing unit group has J processing in N number of processing unit
Unit, each processing unit group correspond to one of N/J output buffering text group.Output buffer includes a screen
Cover input, to control those texts in N number of text can maintain its script numerical value or with the output of its corresponding accumulator into
Row updates.Each processing unit has first and second multitask buffer, an accumulator and an arithmetical unit.Each multitask
Buffer has an output.The corresponding output buffering that accumulator has an output to be provided in N number of output buffering text
Text.Arithmetical unit has the first, the second to input with third, first and second input is respectively to receive first and second more
The output of business buffer, third input the output to receive accumulator, and arithmetical unit is held for the first, the second with third input
One operation of row is added to accumulator to generate a result.Each multitask buffer packet in first and second multitask buffer
At least J+1 input is included, aforementioned output is inputted with a control, and one first input in J+1 input receives one from a memory
Operand, J+1 input in it is other J input receive it is corresponding output buffering text group J texts, control input use
To control the selection inputted for aforementioned J+1 to be provided to output.The method includes: to utilize the N/J with J processing unit
Each processing unit group in a processing unit group, the shot and long term memory cell as a time recurrent neural network carry out
Running, to execute following steps: calculating an input of shot and long term memory cell using the first processing units in J processing unit
Lock;Calculate shot and long term memory cell using the second processing unit in J processing unit one forgets lock;And using at J
Manage the output lock that the third processing unit in unit calculates shot and long term memory cell.
Being encoded at least one non-instantaneous computer the present invention also provides one kind can be used media so that a computer installation makes
One computer program product.This computer program product includes the usable program code of computer for being included in media, is used
To describe a device.It includes the first program code and the second program code that program code, which can be used, in this computer.First program generation
Code is to describe an output buffer, and to load N text, N number of text is distributed to N/J mutual exclusion this output buffer
In output buffering text group, output buffering text group has J text in N number of text, and J is greater than 2, N and is at least the two of J
Times.Second program code to describe the array being made of N number of processing unit, N number of processing unit in array distribute to
The processing unit group of N/J mutual exclusion, this processing cell group have J processing unit in N number of processing unit, each processing
Cell group corresponds to one of N/J output buffering text group, and each processing unit includes first and second multitask
Buffer, an accumulator, with an arithmetical unit.Wherein, each multitask buffer include at least J+1 input, one output with
One control input.One first input in J+1 input receives an operand from a memory, and other J in J+1 input
Input receives J text of corresponding output buffering text group;Control input is to control the choosing inputted for aforementioned J+1
It selects to be provided to output.The corresponding output buffering text that accumulator has an output to be provided in N number of output buffering text
Word.Arithmetical unit has the first, the second to input with third, first and second input is respectively to receive first and second multitask
The output of buffer, third input the output to receive accumulator, and arithmetical unit is executed for the first, the second with third input
One operation is added to accumulator to generate a result.Wherein, output buffer and including one shielding input, to control N number of text
Those texts can maintain its script numerical value or are updated with the output of its corresponding accumulator in word.Wherein, have at J
Manage a shot and long term of each processing unit group in N/J processing unit group of unit as a time recurrent neural network
Memory cell is operated, an input lock of the first processing units calculating shot and long term memory cell in J processing unit, at J
Manage the forgetting lock that the second processing unit in unit calculates shot and long term memory cell, and the third in J processing unit
Processing unit calculates an output lock of shot and long term memory cell.
Specific embodiment of the present invention will be further described by embodiment below and schema.
Detailed description of the invention
Fig. 1 is the square signal of processor of the display comprising neural network unit (neural network unit, NNU)
Figure.
Fig. 2 is the block schematic diagram for showing the neural processing unit (neural processing unit, NPU) of Fig. 1.
Fig. 3 is block diagram, and display is cached using N number of multitask of N number of neural processing unit of the neural network unit of Fig. 1
Device executes the rotator such as N number of text for a column data text of the data random access memory acquirement by Fig. 1
(rotator) or the running of cyclic shifter (circular shifter).
Fig. 4 is table, and display one is stored in the program storage of the neural network unit of Fig. 1 and by the neural network list
The program that member executes.
Fig. 5 is to show that neural network unit executes the timing diagram of the program of Fig. 4.
Fig. 6 A is to show that the neural network unit of Fig. 1 executes the block schematic diagram of the program of Fig. 4.
Fig. 6 B is flow chart, shows that the processor of Fig. 1 executes framework program, to be associated with using the execution of neural network unit
The running of the multiply-accumulate run function operation of typical case of the neuron of the hidden layer of artificial neural network, such as the program by Fig. 4
The running of execution.
Fig. 7 is the block schematic diagram for showing another embodiment of neural processing unit of Fig. 1.
Fig. 8 is the block schematic diagram for showing the another embodiment of neural processing unit of Fig. 1.
Fig. 9 is table, and display one is stored in the program storage of the neural network unit of Fig. 1 and by the neural network list
The program that member executes.
Figure 10 is to show that neural network unit executes the timing diagram of the program of Fig. 9.
Figure 11 is the block schematic diagram for showing an embodiment of neural network unit of Fig. 1.In the embodiment in figure 11, one
A neuron is divided into two parts, i.e., (this part also includes shifting cache to run function unit part with arithmetic logic unit part
Device part), and each run function unit part is by multiple arithmetic logic unit partial sharings.
Figure 12 is to show that the neural network unit of Figure 11 executes the timing diagram of the program of Fig. 4.
Figure 13 is to show that the neural network unit of Figure 11 executes the timing diagram of the program of Fig. 4.
Figure 14 is block schematic diagram, and display is moved to the instruction of neural network (MTNN) framework and its mind for corresponding to Fig. 1
The running of part through network unit.
Figure 15 is block schematic diagram, and display is moved to the instruction of neural network (MTNN) framework and its mind for corresponding to Fig. 1
The running of part through network unit.
Figure 16 is the block schematic diagram for showing an embodiment of data random access memory of Fig. 1.
Figure 17 is the block schematic diagram for showing an embodiment of weight random access memory and buffer of Fig. 1.
Figure 18 is the block schematic diagram for showing the dynamically configurable neural processing unit of Fig. 1.
Figure 19 is block schematic diagram, and embodiment of the display according to Figure 18 utilizes N number of nerve of the neural network unit of Fig. 1
2N multitask buffer of processing unit holds a column data text of the data random access memory acquirement by Fig. 1
Running of the row such as rotator (rotator).
Figure 20 is table, and display one is stored in the program storage of the neural network unit of Fig. 1 and by the neural network
The program that unit executes, and this neural network unit has neural processing unit as shown in the embodiment of figure 18.
Figure 21 is to show that neural network unit executes the timing diagram of the program of Figure 20, this neural network unit has such as Figure 18
Shown in nerve processing unit be implemented in narrow configuration.
Figure 22 is the block schematic diagram for showing the neural network unit of Fig. 1, this neural network unit has as shown in figure 18
Neural processing unit to execute the program of Figure 20.
Figure 23 is the block schematic diagram for showing another embodiment of dynamically configurable neural processing unit of Fig. 1.
Figure 24 is block schematic diagram, and display is used by the neural network unit of Fig. 1 to execute convolution (convolution)
One example of the data structure of operation.
Figure 25 is flow chart, shows that the processor of Fig. 1 executes framework program to utilize neural network unit foundation Figure 24's
The convolution algorithm of data array execution convolution kernel.
Figure 26 A is the program listing of neural network unit program, this neural network unit program utilizes the convolution kernel of Figure 24
It executes the convolution algorithm of data matrix and is write back weight random access memory.
Figure 26 B is to show that the square of an embodiment of certain fields of the control buffer of neural network unit of Fig. 1 shows
It is intended to.
Figure 27 is block schematic diagram, shows the example that the weight random access memory of input data is inserted in Fig. 1, this
Input data executes common source operation (pooling operation) by the neural network unit of Fig. 1.
Figure 28 is the program listing of neural network unit program, this neural network unit program executes the input data of Figure 27
The common source operation of matrix is simultaneously write back weight random access memory.
Figure 29 A is the block schematic diagram for showing an embodiment of control buffer of Fig. 1.
Figure 29 B is the block schematic diagram for showing another embodiment of control buffer of Fig. 1.
Figure 29 C is that display is illustrated with the square of an embodiment of the inverse (reciprocal) of two section store Figure 29 A
Figure.
Figure 30 is the block schematic diagram for showing an embodiment of run function unit (AFU) of Fig. 2.
Figure 31 is the example for showing the running of run function unit of Figure 30.
Figure 32 is second example for showing the running of run function unit of Figure 30.
Figure 33 is the third example for showing the running of run function unit of Figure 30.
Figure 34 is the block schematic diagram for showing the part details of processor and neural network unit of Fig. 1.
Figure 35 is block diagram, and showing has the processor of variable rate neural network unit.
Figure 36 A is timing diagram, shows the running example that there is the processor of neural network unit to operate on general modfel,
This general modfel i.e. with it is main when frequency operation.
Figure 36 B is timing diagram, shows the running example that there is the processor of neural network unit to operate on mitigation mode,
Frequency when frequency is lower than main when the running of mitigation mode.
Figure 37 is flow chart, shows the running of the processor of Figure 35.
Figure 38 is block diagram, displays the details of the sequence of neural network unit.
Figure 39 is block diagram, shows the control of neural network unit and certain fields of status register.
Figure 40 is block diagram, is shown Elman time recurrent neural network (recurrent neural network, RNN)
An example.
Figure 41 is block diagram, and display is associated with the Elman time recurrent neural network of Figure 40 when the execution of neural network unit
Calculating when, one of the data configuration in the data random access memory of neural network unit and weight random access memory
Example.
Figure 42 is table, and display is stored in the program of the program storage of neural network unit, this program is by neural network
Unit executes, and uses data and weight according to the configuration of Figure 41, to reach Elman time recurrent neural network.
Figure 43 is the example that block diagram shows Jordan time recurrent neural network.
Figure 44 is block diagram, and display is associated with the Jordan time recurrent neural network of Figure 43 when the execution of neural network unit
Calculating when, one of the data configuration in the data random access memory of neural network unit and weight random access memory
Example.
Figure 45 is table, and display is stored in the program of the program storage of neural network unit, this program is by neural network
Unit executes, and uses data and weight according to the configuration of Figure 44, to reach Jordan time recurrent neural network.
Figure 46 is block diagram, the embodiment of display shot and long term memory (long short term memory, LSTM) born of the same parents.
Figure 47 is block diagram, and display is associated with the calculating of the shot and long term memory cell layer of Figure 46 when the execution of neural network unit
When, an example of the data random access memory and the data configuration in weight random access memory of neural network unit.
Figure 48 is table, and display is stored in the program of the program storage of neural network unit, this program is by neural network
Unit executes and uses data and weight according to the configuration of Figure 47, to reach the calculating for being associated with shot and long term memory cell layer.
Figure 49 is block diagram, shows the embodiment of neural network unit, has in the neural processing unit group of this embodiment
There are the masking of output buffering and feedback capability.
Figure 50 is block diagram, and display is associated with the calculating of the shot and long term memory cell layer of Figure 46 when the execution of neural network unit
When, the data random access memory of the neural network unit of Figure 49, in weight random access memory and output buffer
One example of data configuration.
Figure 51 is table, display be stored in neural network unit program storage program, this program by Figure 49 mind
It is executed through network unit and uses data and weight according to the configuration of Figure 50, be associated in terms of shot and long term memory cell layer by reaching
It calculates.
Figure 52 is block diagram, shows the embodiment of neural network unit, has in the neural processing unit group of this embodiment
There are the masking of output buffering and feedback capability, and shared run function unit.
Figure 53 is block diagram, and display is associated with the calculating of the shot and long term memory cell layer of Figure 46 when the execution of neural network unit
When, the data random access memory of the neural network unit of Figure 49, in weight random access memory and output buffer
Another embodiment of data configuration.
Figure 54 is table, display be stored in neural network unit program storage program, this program by Figure 49 mind
It is executed through network unit and uses data and weight according to the configuration of Figure 53, be associated in terms of shot and long term memory cell layer by reaching
It calculates.
Figure 55 is block diagram, shows the partial nerve processing unit of another embodiment of the present invention.
Figure 56 is block diagram, and display is associated with the Jordan time recurrent neural network of Figure 43 when the execution of neural network unit
Calculating and when using the embodiment of Figure 55, the data random access memory of neural network unit and weight random access memory
One example of the data configuration in device.
Figure 57 is table, and display is stored in the program of the program storage of neural network unit, this program is by neural network
Unit executes and uses data and weight according to the configuration of Figure 56, to reach Jordan time recurrent neural network.
Specific embodiment
Processor with framework neural network unit
Fig. 1 is the side of processor 100 of the display comprising neural network unit (neural network unit, NNU) 121
Block schematic diagram.As shown in the figure, this processor 100 includes instruction acquisition unit 101, instruction cache 102, instruction translator
104, rename unit 106, multiple reservation stations 108, multiple media caches 118, multiple general caching devices 116, aforementioned neurological
Multiple execution units 112 and memory sub-system 114 outside network unit 121.
Processor 100 is electronic device, the central processing unit as integrated circuit.Processor 100 receives the number of input
Digital data handles these data according to by the instruction that memory is seized, and generates and made by the processing result of the operation of instruction instruction
It is exported for it.This processor 100 can be used for desktop PC, running gear or tablet computer, and be used to calculate, text
The application such as processing, multimedia display and web browsing.This processor 100 may also be disposed in embedded system, to control various packets
Include equipment, mobile phone, smart phone, vehicle, the device with industrial controller.Central processing unit penetrates and executes packet to data
The operations such as arithmetic, logical AND input/output are included, to execute computer program (or for computer applied algorithm or application program)
The electronic circuit (i.e. hardware) of instruction.Integrated circuit be one group be made in small semiconductor material, usually silicon, electronics electricity
Road.Integrated circuit is also commonly used to indicate chip, microchip or crystal grain.
The control of instruction acquisition unit 101 seizes framework instruction 103 to instruction cache 102 by system storage (not shown)
Running.The instruction offer of acquisition unit 101 seizes address to instruction cache 102, is seized with given processor 100 to cache
The storage address of the cache line of 102 framework command byte.Seize the selected instruction pointer based on processor 100 of address
The current value or program counter of (not shown).In general, program counter can be incremented by proper order according to instruction size, until referring to
Enable and occur such as control instruction of branch, calling or return in crossfire, or occur for example interruptions, trap (trap), make an exception or
The exceptional conditions such as mistake, and need to update program with the non-sequential address such as such as branch target address, return address or exception vector
Counter.To sum up, program counter can be executed instruction in response to execution unit 112/121 and is updated.Program counter
It can also be updated when detecting exceptional condition, such as instruction translator 104 suffers from the finger for not being defined in processor 100
Enable the instruction 103 of collection framework.
Instruction cache 102 stores the framework instruction 103 for seizing the system storage that processor 100 is coupled to from one.This
A little framework instructions 103 include being moved to neural network (MTNN) instruction to remove (MFNN) instruction with by neural network, in detail as after
It states.In one embodiment, framework instruction 103 is the instruction of x86 instruction set architecture, and affix MTNN instruction is instructed with MFNN.
In this disclosure, x86 instruction set architecture processor is interpreted as in the case where executing same mechanical sound instruction, withProcessor generates the processor of identical result in instruction set architecture layer.But, other instruction set architectures,
For example, advanced reduced instruction set machine framework (ARM), the extendible processor architecture (SPARC) of rising Yang (SUN) or enhancing
Reduced instruction set computer performance operational performance optimization architecture (PowerPC), it can also be used to other embodiments of the invention.Instruction cache
102 provide framework instruction 103 to instruction translator 104, and framework instruction 103 is translated to microcommand 105.
Microcommand 105 is provided to renaming unit 106 and is finally executed by execution unit 112/121.These microcommands 105
It can realize that framework instructs.For a preferred embodiment, instruction translator 104 include first part, to will frequently execute with
It and/or is that relatively uncomplicated framework instruction 103 translates to microcommand 105.This instruction translator 104 further includes second
Point, with microcode unit (not shown).There is microcode unit microcode memory to load micro-code instruction, to execute architecture instruction set
Middle complicated and/or instruction less.Microcode unit further includes that micro-sequencer (microsequencer) provides nand architecture microprogram
Counter (micro-PC) is to microcode memory.For a preferred embodiment, these microcommands (are not schemed via micro- transfer interpreter
Show) translate to microcommand 105.Whether selector is currently possessed of control power according to microcode unit, and selection is from first part or the
The microcommand 105 of two parts is provided to renaming unit 106.
Renaming unit 106 can be by the entity of the specified framework buffer renamed as processor 100 of framework instruction 103
Buffer.For a preferred embodiment, this processor 100 includes reorder buffer (not shown).Renaming unit 106 can be according to
The allocation of items of reorder buffer is given to each microcommand 105 according to program sequence.So processor 100 can be made suitable according to program
Sequence removes microcommand 105 and its corresponding framework instruction 103.In one embodiment, media cache 118 has 256
Width, and general caching device 116 has 64 bit widths.In one embodiment, media cache 118 is x86 media cache,
Such as advanced vector expands (AVX) buffer.
In one embodiment, each project of reorder buffer has storage space to store the result of microcommand 105.This
Outside, processor 100 includes framework register file, this framework register file has physical registers slow corresponding to each framework
Storage, such as media cache 118, general caching device 116 and other framework buffers.(for a preferred embodiment, citing
For, media cache 118 is of different sizes with general caching device 116, that is, can be used separated register file corresponding to this
Two kinds of buffers.) for being assigned with each source operand of a framework buffer in microcommand 105, renaming unit can benefit
With the reorder buffer catalogue of a newest microcommand in the old microcommand 105 of write-in framework buffer, microcommand 105 is inserted
Source operand field.When execution unit 112/121 completes the execution of microcommand 105, execution unit 112/121 can be by its result
The reorder buffer project of this microcommand 105 is written.When microcommand 105 is removed, unit meeting (not shown) is removed in the future since then
The reorder buffer field of microcommand result write-in physical registers archives buffer, this physical registers profile associated in
Thus framework purpose buffer specified by microcommand 105 is removed.
In another embodiment, processor 100 includes physical registers archives, and the quantity for the physical registers having is more
In the quantity of framework buffer, but, this processor 100 does not include framework register file, and in reorder buffer project
It does not include result storage space.(for a preferred embodiment, because of the size of media cache 118 and general caching device 116
Difference can be used separated register file corresponding to both buffers.) this processor 100 further includes pointer gauge, tool
There is the corresponding pointer of each framework buffer.For being assigned with each operand of framework buffer in microcommand 105, order again
Name unit can be directed toward the pointer of free buffer in physical registers archives using one, insert the purpose behaviour in microcommand 105
Make digital section.If free buffer is not present in physical registers archives, renaming unit 106 can lie over pipeline.It is right
In each source operand for being assigned with framework buffer in microcommand 105, renaming unit can be cached using a direction entity
In device archives, it is assigned to the pointer of the buffer of newest microcommand in the old microcommand 105 of write-in framework buffer, is inserted micro-
Source operand field in instruction 105.When execution unit 112/121 is completed to execute microcommand 105,112/121 meeting of execution unit
Write the result into the buffer that the destination operand field of microcommand 105 in physical registers archives is directed toward.When microcommand 105 is removed
Except when, remove unit and the destination operand field value of microcommand 105 can be copied to and be associated with this to remove microcommand 105 specified
The pointer of the pointer gauge of framework purpose buffer.
Reservation station 108 can load microcommand 105, until these microcommands completion be distributed to execution unit 112/121 for
The preparation of execution.When all source operands of a microcommand 105 can all take and execution unit 112/121 can also be used for
When execution, i.e., microcommand 105 completes the preparation issued thus.Execution unit 112/121 is by reorder buffer or aforementioned first reality
Apply framework register file described in example, or the physical registers archives accession buffer source as described in aforementioned second embodiment
Operand.In addition, execution unit 112/121 can be directed through result transmission bus reception buffer source operand (not shown).
In addition, execution unit 112/121 can receive immediate operand specified by microcommand 105 from reservation station 108.MTNN and MFNN
Framework instruction 103 includes immediate operand to specify 121 function to be performed of neural network unit, and this function is by MTNN
The one or more microcommands 105 generated with 103 translation of MFNN framework instruction are provided, and the details will be described later.
Execution unit 112 includes one or more load/store units (not shown), is loaded by memory sub-system 114
Data and data are stored to memory sub-system 114.For a preferred embodiment, this memory sub-system 114 includes depositing
Reservoir administrative unit (not shown), this memory management unit may include, such as (lookaside) buffering is searched in multiple translations
Device, a table mobile (tablewalk) unit, one data quick of stratum (with instruction cache 102), two unification of stratum
Cache and a Bus Interface Unit as the interface between processor 100 and system storage.In one embodiment, Fig. 1
Processor 100 is indicated with one of multiple processing cores of multi-core processor, and shared one of this multi-core processor is most
Stratum's cache afterwards.Execution unit 112 may also include multiple integer units, multiple media units, multiple floating point units and one
A branch units.
Neural network unit 121 includes weight random access memory (RAM) 124, data random access memory 122, N
A nerve processing unit (NPU) 126, one sequencer 128 of program storage 129, one and multiple controls and status register
127.These neural processing units 126 are conceptually such as the function of the neuron in neural network.Weight random access memory
Device 124, data random access memory 122 are both transparent for MTNN and MFNN framework instruction 103 with program storage 129 and write respectively
Enter and reads.Weight random access memory 124 is arranged as W column, the N number of weight text of each column, data random access memory 122
It is arranged as D column, the N number of data literal of each column.Each data literal and each weight text are multiple positions, are preferably implemented with regard to one
For example, it can be 8 positions, 9 positions, 12 positions or 16 positions.Neuron of each data literal as preceding layer in network
Output valve (being indicated sometimes with initiation value), each weight text is as the neuron being associated in network into network current layer
Connection weight.Although being loaded into weight random access memory 124 in many applications of neural network unit 121
Text or operand are actually to be associated with the weight into the connection of neuron, but should be noted that in nerve net
In certain applications of network unit 121, the text for being loaded into weight random access memory 124 is not weight, but because these
Text is stored in weight random access memory 124, so still being indicated with the term of " weight text ".For example,
In certain applications of neural network unit 121, for example, Figure 24 to Figure 26 A convolution algorithm example or Figure 27 to Figure 28's
The example of common source operation, weight random access memory 124 can load the object other than weight, such as data matrix (such as image
Pixel data) element.Similarly, although being loaded into data random access in many applications of neural network unit 121 and depositing
The text or operand of reservoir 122 are substantially exactly the output valve or initiation value of neuron, but be should be noted that in nerve
In certain applications of network unit 121, the text for being loaded into data random access memory 122 is really not so, but because this
A little texts are stored in data random access memory 122, so still being indicated with the term of " data literal ".Citing comes
Say, in certain applications of neural network unit 121, such as Figure 24 to Figure 26 A convolution algorithm example, data random access
Memory 122 can load the output of non-neuron, such as the element of convolution kernel.
In one embodiment, neural processing unit 126 and sequencer 128 include combinational logic, sequencing logic, state machine
Device or a combination thereof.The content of status register 127 can be loaded one of them by framework instruction (such as MFNN instruction 1500)
General caching device 116, to confirm the state of neural network unit 121, if neural network unit 121 is from program storage
129 complete an order or the running of program or neural network unit 121 can freely receive a new order or
Start a new neural network unit program.
The quantity of neural processing unit 126 can deposit at random according to increase in demand, weight random access memory 124 with data
The width of access to memory 122 can also adjust therewith with depth to be expanded.For a preferred embodiment, weight arbitrary access is deposited
Reservoir 124 can be greater than data random access memory 122, this is because there are many connections in typical neural net layer, because
And biggish storage space is needed to store the weight for being associated with each neuron.Many is disclosed herein about data and weight text
Size, weight random access memory 124 and the size of data random access memory 122 and different nerve processing it is single
The embodiment of first 126 quantity.In one embodiment, it is 64KB (8192 x64 column) that neural network unit 121, which has a size,
Data random access memory 122, size is the weight random access memory 124 of 2MB (8192 x2048 column),
And 512 neural processing units 126.This neural network unit 121 is 16 nanometers of processing procedures with Taiwan integrated circuit (TSMC)
Manufacture, occupied area is about 3.3 square millimeters.
Sequencer 128 is seized instruction by program storage 129 and is executed, and the running executed further includes generating address and control
Signal processed is supplied to data random access memory 122, weight random access memory 124 and neural processing unit 126.Sequencing
Device 128 generates storage address 123 and reading order is supplied to data random access memory 122, uses N number of number in D column
N number of neural processing unit 126 is supplied to according to selection one in text.Sequencer 128 can also generate storage address 125 and read
Order is supplied to weight random access memory 124, uses the selection one in N number of weight text of W column and is supplied to N number of nerve
Processing unit 126.Sequencer 128 generates the i.e. determining nerve of sequence for being also provided to the address 123,125 of neural processing unit 126
" connection " between member.Sequencer 128 can also generate storage address 123 and writing commands are supplied to data random access memory
122, it uses the selection one in N number of data literal of D column and is written by N number of neural processing unit 126.Sequencer 128 is also
Storage address 125 can be generated and writing commands are supplied to weight random access memory 124, use N number of weight text in W column
One is selected to be written in word by N number of neural processing unit 126.Sequencer 128 can also generate storage address 131 to journey
To select the neural network unit instruction for being supplied to sequencer 128, this part will do it sequence memory 129 in following sections
It is bright.Storage address 131 is corresponding to program counter (not shown), position of the sequencer 128 usually in accordance with program storage 129
Setting sequence is incremented by program counter, and except non-sequencer 128 suffers from control instruction, such as recursion instruction (is please referred to such as figure
Shown in 26A), in the case, program counter can be updated the destination address of control instruction thus by sequencer 128.Sequencer
128 can also generate control signal to neural processing unit 126, indicate neural processing unit 126 execute a variety of different operations or
Function, such as Initiation, arithmetic/logic, rotation/shift operation, run function and operation is write back, relevant example
It can be described in more detail in following sections (please referring to as shown in micro- operation 3418 of Figure 34).
It is N number of nerve processing unit 126 can generate N number of result text 133, these result texts 133 can be written back into weight with
Machine accesses a column of memory 124 or data random access memory 122.For a preferred embodiment, weight arbitrary access
Memory 124 and data random access memory 122 are coupled directly to N neural processing unit 126.Furthermore, it is understood that weight
Random access memory 124 belongs to these neural processing units 126 with 122 turns of data random access memory, without being shared with
Other execution units 112 in processor 100, these neural processing units 126 can be constantly within each time-frequency period
As soon as obtain and complete column from one or both of weight random access memory 124 and data random access memory 122, one compared with
For good embodiment, processed in pipelined fashion can be used.In one embodiment, data random access memory 122 and weight are random
Each of access memory 124 can provide 8192 positions to neural processing unit 126 within each time-frequency period.
This 8192 positions can be considered as 512 16 bytes or 1024 8 bytes to be handled, and the details will be described later.
The data group size handled by neural network unit 121 is not limited to weight random access memory 124 and number
According to the size of random access memory 122, and it can be only limited to the size of system storage, this is because data and weight can be
Refer between system storage and weight random access memory 124 and data random access memory 122 through MTNN and MFNN
The use (for example, through media cache 118) of order and move.In one embodiment, data random access memory 122 is assigned
Dual-port is given, enables deposited at random by the reading data literal of data random access memory 122 or write-in data literal to data
While access to memory 122, write-in data literal to data random access memory 122.In addition, including cache
Memory sub-system 114 larger memory hierarchical structure can provide very big data bandwidth for system storage and nerve
Carry out data transmission between network unit 121.In addition, this memory sub-system 114 includes hardware number for a preferred embodiment
According to seizing device in advance, the access mode of trace memory, such as by the neural deta and weight of system storage load, and to cache
Hierarchical structure executes data and is seized in advance in favor of being transmitted to weight random access memory 124 and data random access memory
Reach the transmission of high bandwidth and low latency during 122.
Although in the embodiments herein, one of behaviour of each neural processing unit 126 is provided to by weights memory
Count and be denoted as weight, this term is common in neural network, it is to be appreciated, however, that these operands be also possible to it is other with
The data of related type are calculated, and its calculating speed can pass through these devices and be promoted.
Fig. 2 is the block schematic diagram for showing the neural processing unit 126 of Fig. 1.As shown in the figure, this neural processing unit
Many functions or operation can be performed in 126 running.Especially, this neural processing unit 126 can be used as in artificial neural network
Neuron or node are operated, to execute typical product accumulation function or operation.That is, in general, nerve net
Network unit 126 (neuron) to: (1) from it is each with its have connection neuron receive input value, this connection would generally but
It is not necessarily the preceding layer in artificial neural network;(2) each output valve is multiplied by the corresponding power for being associated with its connection
Weight values are to generate product;(3) all products are added up to generate a sum;(4) run function is executed to generate mind to this sum
Output through member.But, needed to be implemented different from traditional approach be associated with it is all connection input all multiplyings and by its
The power for being associated with one of connection input can be performed in product aggregation, each neuron of the invention within the given time-frequency period
Multiplying and by the cumulative of its product and the product for being associated with connection input performed in the time-frequency period before the time point again
Value is added (cumulative).It is assumed that a shared M connection connects so far neuron, (M time-frequency is probably needed after M product adds up
The time in period), this neuron can execute run function to this cumulative number to generate output or result.The advantages of this mode, is
The quantity of required multiplier can be reduced, and only needs a smaller, simpler and more quick addition in neuron
Device circuit (such as using two input adder), without use can by it is all connection input products aggregation or even
To adder needed for the aggregation of a wherein subclass.This mode is also beneficial in neural network unit 121 use a myriad of
(N) neuron (neural processing unit 126), in this way, neural network unit 121 just can produce after about M time-frequency period
The output of this big quantity (N) neuron.Finally, for a large amount of different connection inputs, the nerve net being made of these neurons
Network unit 121 can be effective as the execution of artificial neural network network layers.That is, if the quantity of M is increased in different layers
Subtract, time-frequency periodicity needed for generating memory cell output also can correspondingly increase and decrease, and resource (such as multiplier and accumulator)
It can be fully utilized.In comparison, traditional design has the part of certain multipliers and adder for lesser M value
Fail to be utilized.Therefore, number is exported in response to the connection of neural network unit, embodiment as described herein has both elasticity and efficiency
Advantage, and high efficiency can be provided.
Neural processing unit 126 includes buffer 205, dual input multitask buffers 208, an arithmetic logic unit
(ALU) 204, accumulator 202 and run function unit (AFU) 212.Buffer 205 is connect by weight random access memory 124
Weight text 206 of retaking the power simultaneously provides its output 203 in the subsequent time-frequency period.Multitask buffer 208 is in two inputs 207,211
Selection one is stored in its buffer and is provided in its output 209 in the subsequent time-frequency period.Input 207 receives random from data
Access the data literal of memory 122.Another input 211 then receives the output 209 of adjacent nerve processing unit 126.Fig. 2 institute
The neural processing unit 126 shown is denoted as neural processing unit J in N number of neural processing unit shown in FIG. 1.That is,
Neural processing unit J is the one of this N number of neural processing unit 126 to represent example.For a preferred embodiment, nerve processing is single
The multitask that the input 211 of the multitask buffer 208 of the J example of member 126 receives the J-1 example of neural processing unit 126 is slow
The output 209 of storage 208, and the output 209 of the multitask buffer 208 of neural processing unit J is supplied to neural processing unit
The input 211 of the multitask buffer 208 of 126 J+1 example.In this way, the multitask buffer of N number of nerve processing unit 126
208 can cooperating syringe, such as the rotator or cyclic shifter of N number of text, this part has in more detail in subsequent figure 3
Explanation.Multitask buffer 208 can be by multitask buffer 208 using which in the two inputs of 213 control of control input
Selection is stored in its buffer and is provided in output 209 in subsequent.
There are three inputs for the tool of arithmetic logic unit 204.One of input receives weight text 203 by buffer 205.Separately
One input receives the output 209 of multitask buffer 208.Yet another input receives the output 217 of accumulator 202.This is calculated
Art logic unit 204 can input execution arithmetic and/or logical operation to it and be provided in its output to generate a result.Preferably with regard to one
For embodiment, arithmetic that arithmetic logic unit 204 executes and/or logical operation are by being stored in the instruction of program storage 129
It is specified.For example, multiply-accumulate operation is specified in the multiply-accumulate instruction in Fig. 4, that is, result 215 can be accumulator 202
The aggregation of the product of numerical value 217 and the data literal of weight text 203 and multitask buffer 208 output 209.But may be used
To specify other operations, these operations include but is not limited to: result 215 is the numerical value of 209 transmitting of multitask buffer output;
As a result 215 be weight text 203 transmit numerical value;As a result 215 be zero;As a result 215 be 202 numerical value 217 of accumulator and weight
203 aggregation;As a result 215 be 202 numerical value 217 of accumulator with multitask buffer output 209 aggregation;As a result 215 be cumulative
202 numerical value 217 of device and the maximum value in weight 203;As a result 215 be that 202 numerical value 217 of accumulator is exported with multitask buffer
Maximum value in 209.
Arithmetic logic unit 204 provides its output 215 to accumulator 202 and stores.Arithmetic logic unit 204 includes multiplication
Device 242 carries out multiplying to the data literal of weight text 203 and the output of multitask buffer 208 209 to generate a product
246.In one embodiment, two 16 positional operands are multiplied to produce one 32 results by multiplier 242.This arithmetical logic
Unit 204 further includes that adder 244 is total to generate one plus product 246 in the output 217 of accumulator 202, this sum is
It is stored in the result 215 of the accumulating operation of accumulator 202.In one embodiment, one 41 in accumulator 202 of adder 244
Place value 217 adds 32 results of multiplier 242 to generate 41 results.In this way, in the phase in multiple time-frequency periods
Interior to utilize rotator characteristic possessed by multitask buffer 208, neural processing unit 126 may achieve needed for neural network
Neuron product add up operation.It is other such as preceding institute to execute that this arithmetic logic unit 204 may also comprise other circuit units
The arithmetic/logic stated.In one embodiment, second adder subtracts in the data literal of the output of multitask buffer 208 209
Weight text 203 is removed to generate a difference, subsequent adder 244 can add this difference in the output 217 of accumulator 202 to generate
One result 215, this result are the accumulation result in accumulator 202.In this way, in a period of multiple time-frequency periods, at nerve
Reason unit 126 can reach the operation of difference aggregation.For a preferred embodiment, although weight text 203 and data literal
209 size is identical (in bits), they can also have different binary point positions, and the details will be described later.It is preferably real with regard to one
For applying example, multiplier 242 and adder 244 are integer multiplier and adder, compared to using the arithmetic of floating-point operation to patrol
Volume unit, this arithmetic logic unit 204 have the advantages that low complex degree, it is small-sized, quickly with low power consuming.But, of the invention
In other embodiments, floating-point operation is also can be performed in arithmetic logic unit 204.
Although only one multiplier 242 of display and adder 244 in the arithmetic logic unit 204 of Fig. 2, but, with regard to one compared with
For good embodiment, this arithmetic logic unit 204 further includes having other components to execute aforementioned other different operations.Citing comes
It says, this arithmetic logic unit 204 may include that comparator (not shown) compares accumulator 202 and data/weight text and multiplexing
Device (not shown) selection the greater (maximum value) in the two values that comparator is specified is stored to accumulator 202.At another
In example, arithmetic logic unit 204 includes selection logic (not shown), skips multiplier 242 using data/weight text,
Store adder 224 to accumulator plus this data/weight text to generate a sum in the numerical value 217 of accumulator 202
202.These additional operations can be described in more detail in following sections such as Figure 18 to Figure 29 A, and these operations also contribute to
Such as the execution of convolution algorithm and common source operation.
The output 217 of the reception accumulator 202 of run function unit 212.Run function unit 212 can be to accumulator 202
Output executes run function to generate the result 133 of Fig. 1.In general, in the neuron of the intermediary layer of artificial neural network
Run function can be used to standardize the sum after product accumulation, it is particularly possible to be carried out using nonlinear mode.For " standard
Change " progressive total, the run function of Current neural member can be used as defeated in the expected reception of other neurons of connection Current neural member
An end value is generated in the numberical range entered.(result after standardization is known as " starting " sometimes, and herein, starting is to work as
The output of front nodal point, and this output can be multiplied by the weight for being associated with and linking between output node and receiving node to produce by receiving node
A raw product, and the product accumulation that this product can link with the other inputs for being associated with this receiving node.) for example, it is connecing
In the case where the expected reception numerical value as input of neuron is received/be concatenated between 0 and 1, output neuron may require that non-thread
Property squeeze and/or adjust (such as upward displacement is to be converted to positive value for negative value) beyond 0 and 1 range outside progressive total,
Fall within it in this desired extent.Therefore, the operation that run function unit 212 executes 202 numerical value 217 of accumulator can will tie
Fruit 133 is taken in known range.The result 133 of N number of nerve execution unit 126 can all be deposited by write back data arbitrary access simultaneously
Reservoir 122 or weight random access memory 124.For a preferred embodiment, run function unit 212 is more to execute
A run function, and it is cumulative for example one to be selected to be implemented in these run functions from the input for controlling buffer 127
The output 217 of device 202.These run functions may include but be not limited to step function, correction function, S type function, tanh letter
It is several to add function (also referred to as smooth correction function) with soft.Soft plus function analytic formula is f (x)=ln (1+ex), that is, 1
With exAggregation natural logrithm, wherein " e " is Euler's numbers (Euler ' s number), and x is the input 217 of this function.Just
For one preferred embodiment, run function may also comprise transmitting (pass-through) function, directly transmitting accumulator 202 number
Value 217 or in which a part, the details will be described later.In one embodiment, the circuit of run function unit 212 can be in single a time-frequency week
Run function is executed in phase.In one embodiment, run function unit 212 includes multiple lists, receives accumulated value and exports
One numerical value can be similar to really certain run functions, such as S type function, hyperbolic tangent function, soft plus function, this numerical value
Numerical value provided by run function.
For a preferred embodiment, the width (in bits) of accumulator 202 is greater than the output of run function function 212
133 width.For example, in one embodiment, the width of this accumulator is 41, to avoid being added to most 512
(this part such as corresponds at Figure 30 in following sections to be described in more detail) loses precision in the case where 32 products,
And the width of result 133 is 16.In one embodiment, in the subsequent time-frequency period, run function unit 212 can transmit cumulative
Device 202 output 217 other untreated parts, and these parts can be write back data random access memory 122 or
Weight random access memory 124, this part corresponds at Fig. 8 in following sections to be described in more detail.It can so incite somebody to action
Untreated 202 numerical value of accumulator carries back media cache 118 through MFNN instruction, and whereby, other in processor 100 hold
The instruction that row unit 112 executes can execute the complicated run function that run function unit 212 can not execute, such as common
Soft greatly (softmax) function, this function also referred to as standardize exponential function.In one embodiment, the instruction of processor 100
Collection framework includes the instruction for executing this exponential function, is typically expressed as exOr exp (x), this instruction can be by the other of processor 100
Execution unit 112 uses the execution speed to promote soft very big run function.
In one embodiment, neural processing unit 126 uses pipeline designs.For example, neural processing unit 126 can wrap
Include the buffer of arithmetic logic unit 204, for example, positioned at multiplier and adder and/or be arithmetic logic unit 204 its
Buffer between its circuit, neural processing unit 126 may also include one and load the buffer that run function function 212 exports.
The other embodiments of this neural processing unit 126 can be illustrated in following sections.
Fig. 3 is block diagram, and display utilizes the N number of more of N number of neural processing unit 126 of the neural network unit 121 of Fig. 1
Task buffer device 208 executes such as N the column data text 207 obtained by the data random access memory 122 of Fig. 1
The running of the rotator (rotator) or cyclic shifter (circular shifter) of a text.In the embodiment of Fig. 3
In, N is 512, and therefore, neural network unit 121 has 512 multitask buffers 208, is denoted as 0 to 511, respectively corresponds
To 512 neural processing units 126.Each multitask buffer 208 can receive the D column of data random access memory 122
The wherein corresponding data literal 207 on a column.That is, multitask buffer 0 can be from data random access memory 122
Column receive data literal 0, and multitask buffer 1 can arrange from data random access memory 122 and receive data literal 1, multitask
Buffer 2 can arrange from data random access memory 122 and receive data literal 2, and so on, multitask buffer 511 can be from
The column of data random access memory 122 receive data literal 511.In addition, multitask buffer 1 can receive multitask buffer 0
Output 209 as another input 211, multitask buffer 2 can receive the output 209 of multitask buffer 1 as another defeated
Enter 211, multitask buffer 3 can receive the output 209 of multitask buffer 2 as another input 211, and so on, more
Business buffer 511 can receive the output 209 of multitask buffer 510 as another input 211, and multitask buffer 0 can connect
The output 209 of multitask buffer 511 is received as other inputs 211.Each multitask buffer 208 can receive control input
213 select data literal 207 or circulation input 211 to control it.In the mode operated herein, control input 213 can be the
In one time-frequency period, controls each multitask buffer 208 and select data literal 207 to store to buffer and in subsequent step
It is supplied to arithmetic logic unit 204, and within the subsequent time-frequency period (such as aforementioned M-1 time-frequency period), control 213 meetings of input
Each multitask buffer 208 selection circulation input 211 is controlled with storage to buffer and is supplied to arithmetic in subsequent step and patrols
Collect unit 204.
Although in Fig. 3 (and subsequent Fig. 7 and Figure 19) described embodiment, multiple nerve processing units 126 can be used
The numerical value of these multitask buffers 208/705 to be rotated to the right, namely by the neural neural processing unit of processing unit J direction
J+1 is mobile, but the present invention is not limited thereto, in other embodiments (such as embodiment corresponding to Figure 24 to Figure 26),
Multiple nerve processing units 126 can be used to rotate to the left the numerical value of multitask buffer 208/705, namely single by nerve processing
First J is mobile towards nerve processing unit J-1.In addition, in other embodiments of the invention, these neural processing units 126 can
Selectively the numerical value of multitask buffer 208/705 is rotated to the left or to the right, for example, this selection can be by neural network
Unit instruction.
Fig. 4 is table, and display one is stored in the program storage 129 of the neural network unit 121 of Fig. 1 and by the mind
The program executed through network unit 121.As previously mentioned, this example program executes one layer of related meter with artificial neural network
It calculates.The table of Fig. 4 shows four column and three rows.Each column correspond to the address that the first row is shown in program storage 129.The
Two rows specify corresponding instruction, and the third line points out the time-frequency periodicity for being associated with this instruction.For a preferred embodiment,
Aforementioned time-frequency periodicity indicates the effective time-frequency periodicity of every instruction time-frequency periodic quantity in the embodiment that pipeline executes, rather than
Instruction delay.As shown in the figure, because of the essence that there is neural network unit 121 pipeline to execute, each instruction has associated
The time-frequency period, the instruction positioned at address 2 is an exception, this instruction actually be can do by myself and be repeated 511 times, thus need
511 time-frequency periods are wanted, the details will be described later.
Each instruction in all meeting parallel processing programs of neural processing unit 126.That is, all N number of minds
All can be in the instruction of execution of same time-frequency period first row through processing unit 126, all N neural processing unit 126 is all
It can be in the instruction of execution of same time-frequency period secondary series, and so on.But the present invention is not limited thereto, in following sections
In other embodiments, some instructions are executed in a manner of the parallel section sequence of part, for example, such as the embodiment of Figure 11
It is described, in the embodiment that multiple neural processing units 126 share a run function units, run function be located at address 3
Output order with 4 is to execute by this method.One layer is assumed in the example of Fig. 4 has 512 neuron (neural processing units
126), each neuron has the connection input of 512 512 neurons from preceding layer, a total of 256K connection.
Each neuron can from each connection input receive 16 bit data values, and by this 16 bit data value be multiplied by one it is appropriate
16 weighted values.
First row (also may specify to other addresses) positioned at address 0 can specify the neural processing unit instruction of initialization.This
Initialization directive can remove 202 numerical value of accumulator and be allowed to be zero.In one embodiment, initialization directive can also be in accumulator 202
In one column of interior load data random access memory 122 or weight random access memory 124, thus instruction
Corresponding text.Configuration Values can also be loaded control buffer 127 by this initialization directive, this part is in subsequent figure 29A and figure
29B can be described in more detail.For example, the width of data literal 207 and weight text 209 can be loaded, is patrolled for arithmetic
The utilization of unit 204 is collected to confirm the operation size of circuit execution, this width also will affect the result 215 for being stored in accumulator 202.
In one embodiment, neural processing unit 126 is stored in accumulator in the output 215 of arithmetic logic unit 204 including a circuit
This output 215 is filled up before 202, and Configuration Values can be loaded this circuit by initialization directive, this Configuration Values will affect above-mentioned fill up
Operation.It in one embodiment, can also be in arithmetic logic unit function instruction (the multiply-accumulate instruction of such as address 1) or output order
It is so specified in (the write-in starting function unit output order of such as address 4), accumulator 202 is removed to zero.
Secondary series positioned at address 1 specifies multiply-accumulate instruction to indicate that this 512 neural processing units 126 are random from data
The column for accessing memory 122 load corresponding data literal and the column load from weight random access memory 124
Corresponding weight text, and the first multiply-accumulate fortune is executed to this data literal input 207 and weight text input 206
It calculates, i.e., plus initialization 202 zero of accumulator.Furthermore, it is understood that this instruction can indicate that sequencer 128 is produced in control input 213
A raw numerical value is to select data literal to input 207.In the example of Fig. 4, the specified of data random access memory 122 is classified as
Column 17, the specified of weight random access memory 124 is classified as column 0, therefore sequencer can be instructed to output numerical value 17 as data
Random access memory address 123, output numerical value 0 are used as weight random access memory address 125.Therefore, from data with
512 data literals that machine accesses the column 17 of memory 122 provide the corresponding data as 512 neural processing units 126
Input 207, and 512 from the column of weight random access memory 124 0 weight texts are provided as 512 nerve processing
The corresponding weight input 206 of unit 126.
Third column positioned at address 2 specify multiply-accumulate rotation to instruct, and it is 511 that this instruction, which has one to count its numerical value, with
Indicate that this 512 neural processing units 126 execute 511 multiply-accumulate operations.This neural processing unit of instruction instruction this 512
126 will input the data literal 209 of arithmetic logic unit 204 in the operation each time of 511 multiply-accumulate operations, as from neighbour
The rotational value 211 of nearly nerve processing unit 126.That is, this instruction can indicate that sequencer 128 is produced in control input 213
A raw numerical value is to select rotational value 211.In addition, this instruction can indicate that this 512 neural processing units 126 tire out 511 multiplication
The corresponding weighted value in the operation each time of operation is added to load " next " column of weight random access memory 124.Namely
It says, this instruction can indicate that sequencer 128 increases weight random access memory address 125 from the numerical value in previous time-frequency period
One, in this example, the first time-frequency period of instruction is column 1, and next time-frequency period is exactly column 2, in next time-frequency period
It is exactly column 3, and so on, the 511st time-frequency period is exactly column 511.In each of this 511 multiply-accumulate operations operation
In, the product of rotation input 211 and weight text input 206 can be added into the previous numerical value of accumulator 202.This 512 minds
This 511 multiply-accumulate operations, each nerve processing unit 126 can be executed within 511 time-frequency periods through processing unit 126
Can different data text-for the column 17 from data random access memory 122 it is, adjacent neural processing unit
126 execute the data literal of operation in the previous time-frequency period, and are associated with the different weight texts execution one of data literal
A multiply-accumulate operation is conceptually the different connection inputs of neuron.This example assumes each neural processing unit 126
(neuron) has 512 connection inputs, therefore involves the processing of 512 data literals and 512 weight texts.In column 2
Multiply-accumulate rotation instruction repeat last time iteration after, will be stored in accumulator 202 this 512 connection input multiply
Long-pending aggregation.In one embodiment, the instruction set of neural processing unit 126 includes that " execution " is instructed to indicate arithmetic logic unit
204 execute by initializing the specified arithmetic logic unit operation of neural processing unit instruction, such as the arithmetic logic unit of Figure 29 A
Person specified by function 2926, rather than for each different types of arithmetic logical operation (such as multiply-accumulate, accumulator above-mentioned
With the maximum value of weight etc.) there is independent instruction.
Specified run function instruction is arranged positioned at the 4th of address 3.This run function instruction instruction run function unit 212 is right
Specified run function is executed in 202 numerical value of accumulator to generate result 133.The embodiment of run function is in following sections meeting
It is described in more detail.
The specified write-in run function unit output order of the 5th column positioned at address 4, to indicate that this 512 nerve processing are single
Its run function unit 212 is exported 133 column for being written back to data random access memory 122 as a result by member 216, herein
It is column 16 in example.That is, this instruction can indicate 128 output numerical value 16 of sequencer as data random access memory
Address 123 and writing commands (corresponding to by the reading order of the multiply-accumulate instruction of address 1).Preferably with regard to one
For embodiment, because of the characteristic that pipeline executes, write-in run function unit output order can be performed simultaneously with other instructions, because
This write-in run function unit output order can actually execute within single a time-frequency period.
For a preferred embodiment, each nerve processing unit 126 is used as a pipeline, this pipeline has various different function
Energy component, such as multitask buffer 208 (and multitask buffer 705 of Fig. 7), arithmetic logic unit 204, accumulator
202, run function unit 212, multiplexer 802 (please referring to Fig. 8), column buffer 1104 (please join with run function unit 1112
According to Figure 11) etc., some of them component itself can pipeline execution.Other than data literal 207 and weight text 206, this pipeline
It can also receive and instruct from program storage 129.These instructions can be flowed along pipeline and control multiple functions unit.Another
Do not include run function in embodiment, in this program to instruct, but by initialize neural processing unit instruction it is specified be implemented in it is tired
Add the run function of 202 numerical value 217 of device, it is indicated that the numerical value of appointed run function is stored in allocating cache device, for pipeline
212 part of run function unit is after generating last 202 numerical value 217 of accumulator, that is, the multiply-accumulate rotation in address 2
After instruction repeats last time execution, it is used.For a preferred embodiment, in order to save energy consumption, the starting letter of pipeline
212 part of counting unit can be opened in not starting state when instructing arrival before write-in run function unit output order reaches
Dynamic function unit 212 will start and execute run function to the output of accumulator 202 217 that initialization directive is specified.
Fig. 5 is to show that neural network unit 121 executes the timing diagram of the program of Fig. 4.Each column of timing diagram are corresponding extremely
The continuous time-frequency period that the first row is pointed out.Other rows are then to be respectively corresponding to mind different in this 512 neural processing units 126
Through processing unit 126 and point out its operation.Only show the operation of neural processing unit 0,1,511 to simplify explanation in figure.
In the time-frequency period 0, the neural processing unit 126 of each of this 512 neural processing units 126 can all execute figure
4 initialization directive is that zero is assigned to accumulator 202 in Fig. 5.
In the time-frequency period 1, the neural processing unit 126 of each of this 512 neural processing units 126 can all execute figure
The multiply-accumulate instruction of address 1 in 4.As shown in the figure, neural processing unit 0 can be by 202 numerical value of accumulator (i.e. zero) plus number
According to the product of the text 0 of the column 0 of the text 0 and weight random access memory 124 of the column 17 of random access memory 122;Mind
It can be by 202 numerical value of accumulator (i.e. zero) plus the text 1 of the column 17 of data random access memory 122 and power through processing unit 1
The product of the text 1 of the column 0 of weight random access memory 124;The rest may be inferred, and neural processing unit 511 can be by the number of accumulator 202
It is worth (i.e. zero) plus the text 511 of the column 17 of data random access memory 122 and the column 0 of weight random access memory 124
Text 511 product.
In the time-frequency period 2, the neural processing unit 126 of each of this 512 neural processing units 126 can all carry out figure
The first time iteration of the multiply-accumulate rotation instruction of address 2 in 4.As shown in the figure, neural processing unit 0 can be by accumulator 202
Numerical value adds the 209 received spin data text 211 of the output of multitask buffer 208 by neural processing unit 511 (i.e. by counting
According to the received data literal 511 of random access memory 122) multiply with the text 0 of the column 1 of weight random access memory 124
Product;202 numerical value of accumulator can be added and export 209 by the multitask buffer 208 of neural processing unit 0 by neural processing unit 1
Received spin data text 211 (i.e. by the received data literal 0 of data random access memory 122) and weight arbitrary access
The product of the text 1 of the column 1 of memory 124;The rest may be inferred, and neural processing unit 511 can add 202 numerical value of accumulator by mind
Multitask buffer 208 through processing unit 510 exports 209 received spin data texts 211 and (is deposited by data random access
The received data literal 510 of reservoir 122) product with the text 511 of the column 1 of weight random access memory 124.
In the time-frequency period 3, the neural processing unit 126 of each of this 512 neural processing units 126 can all carry out figure
Second of iteration of the multiply-accumulate rotation instruction of address 2 in 4.As shown in the figure, neural processing unit 0 can be by accumulator 202
Numerical value adds the 209 received spin data text 211 of the output of multitask buffer 208 by neural processing unit 511 (i.e. by counting
According to the received data literal 510 of random access memory 122) multiply with the text 0 of the column 2 of weight random access memory 124
Product;202 numerical value of accumulator can be added and export 209 by the multitask buffer 208 of neural processing unit 0 by neural processing unit 1
Received spin data text 211 is deposited (i.e. by the received data literal 511 of data random access memory 122) with weight at random
The product of the text 1 of the column 2 of access to memory 124;The rest may be inferred, neural processing unit 511 202 numerical value of accumulator can be added by
The 209 received spin data text 211 of the output of multitask buffer 208 of neural processing unit 510 is (i.e. by data random access
The received data literal 509 of memory 122) product with the text 511 of the column 2 of weight random access memory 124.Such as figure
5 omission label shows that following 509 time-frequency periods can continue to carry out according to this, until the time-frequency period 512.
In the time-frequency period 512, the neural processing unit 126 of each of this 512 neural processing units 126 all can be into
511st iteration of the multiply-accumulate rotation instruction of address 2 in row Fig. 4.As shown in the figure, neural processing unit 0 can will add up
202 numerical value of device, which is added, exports 209 received spin data texts 211 by the multitask buffer 208 of neural processing unit 511
The text of the column 511 of (i.e. by the received data literal 1 of data random access memory 122) and weight random access memory 124
The product of word 0;202 numerical value of accumulator can be added the multitask buffer 208 by neural processing unit 0 by neural processing unit 1
Export 209 received spin data texts 211 (i.e. by the received data literal 2 of data random access memory 122) and weight
The product of the text 1 of the column 511 of random access memory 124;The rest may be inferred, and neural processing unit 511 can be by the number of accumulator 202
Value is plus the 209 received spin data text 211 of the output of multitask buffer 208 by neural processing unit 510 (i.e. by data
Random access memory 122 received data literal 0) multiply with the text 511 of the column 511 of weight random access memory 124
Product.Need multiple time-frequency periods from data random access memory 122 and weight random access memory 124 in one embodiment
Data literal and weight text are read to execute the multiply-accumulate instruction of address 1 in Fig. 4;But, data random access memory
122, weight random access memory 124 and neural processing unit 126 be using pipeline configuration, it is so multiply-accumulate at first
After operation starts (as shown in the time-frequency period 1 of Fig. 5), subsequent multiply-accumulate operation is (such as the time-frequency period 2-512 institute of Fig. 5
Show) it will start to execute within the time-frequency period of connecting.For a preferred embodiment, instructed in response to using framework, such as MTNN
Or MFNN instruction (will do it explanation in subsequent figure 14 and Figure 15), it is random for data random access memory 122 and/or weight
The microcommand that the access action or framework instruction translation for accessing memory 124 go out, these neural processing units 126 can be of short duration
It shelves on ground.
In the time-frequency period 513, the starting of the neural processing unit 126 of each of this 512 neural processing units 126
Function unit 212 can all execute the run function of address 3 in Fig. 4.Finally, this 512 nerve processing are single in the time-frequency period 514
The neural processing unit 126 of each of member 126 can be penetrated the column of its 133 write back data random access memory 122 of result
Corresponding text in 16 is to execute the write-in run function unit output order of address 4 in Fig. 4, that is to say, that nerve processing
The result 133 of unit 0 can be written into the text 0 of data random access memory 122, and the result 133 of neural processing unit 1 can quilt
The text 1 of data random access memory 122 is written, and so on, the result 133 of neural processing unit 511 can be written into number
According to the text 511 of random access memory 122.The corresponding block diagram of operation corresponding to earlier figures 5 is shown in Fig. 6 A.
Fig. 6 A is to show that the neural network unit 121 of Fig. 1 executes the block schematic diagram of the program of Fig. 4.This neural network list
First 121 include 512 neural processing units 126, the data random access memory 122 for receiving address input 123, with reception ground
The weight random access memory 124 of location input 125.When the time-frequency period 0, this 512 126 meetings of neural processing unit
Execute initialization directive.This running is not shown in figure.As shown in the figure, when the time-frequency period 1,512 of column 17
16 data literals can read from data random access memory 122 and be provided to this 512 neural processing units 126.?
During the time-frequency period 1 to 512,512 16 weight texts of column 0 to column 511 can be deposited from weight arbitrary access respectively
Reservoir 122 reads and is provided to this 512 neural processing units 126.When the time-frequency period 1, this 512 nerve processing are single
Member 126 can execute its corresponding multiply-accumulate operation with weight text to the data literal of load.This is operated in figure not
Display.During the time-frequency period 2 to 512, the multitask buffer 208 of 512 neural processing units 126 can be such as same
Rotator with 512 16 texts operated, and will previously have been loaded by the column 17 of data random access memory 122
Data literal turns to neighbouring neural processing unit 126, and these neural processing units 126 can be to the corresponding number after rotation
Multiply-accumulate operation is executed according to text and by the corresponding weight text that weight random access memory 124 loads.In time-frequency
When period 513, this 512 run function units 212 can execute enabled instruction.This running is not shown in figure.When
When the frequency period 514, this 512 neural processing units 126 can be by its corresponding 512 16 133 write back datas of result
The column 16 of random access memory 122.
As shown in the figure, result text (neuron output) and write back data random access memory 122 or weight are generated
The data input (connection) that the current layer for the time-frequency periodicity substantially neural network that random access memory 124 needs receives
The square root of quantity.For example, if current layer has 512 neurons, and each neuron have 512 from previous
The sum of the connection of layer, these connections is exactly 256K, and the time-frequency periodicity for generating current layer result needs will be slightly larger than
512.Therefore, neural network unit 121 can provide high efficiency in terms of neural computing.
Fig. 6 B is flow chart, shows that the processor 100 of Fig. 1 executes framework program, to execute using neural network unit 121
It is associated with the running of the multiply-accumulate run function operation of typical case of the neuron of the hidden layer of artificial neural network, as by Fig. 4
Program execute running.The example of Fig. 6 B assumes (to be shown in the variable NUM_ of initialization step 602 there are four hidden layer
LAYERS), each hidden layer has 512 neurons, and 512 neurons of each neuron connection preceding layer whole (penetrate
The program of Fig. 4).However, it is desirable to understand, the selection of these layers and the quantity of neuron to illustrate the invention, neural network
Unit 121 is when neural with different number in the embodiment that similar calculating can be applied to different number hidden layer, each layer
The embodiment that the embodiment or neuron of member are not linked all.In one embodiment, the mind for being not present in this layer
The weighted value linked through member or the neuron being not present can be set to zero.For a preferred embodiment, framework program meeting
By first group of weight write-in weight random access memory 124 and start neural network unit 121, when neural network unit 121
When being carrying out the calculating for being associated with first layer, weight random access memory can be written in second group of weight by this framework program
124, once in this way, neural network unit 121 completes the calculating of the first hidden layer, neural network unit 121 can start the
Two layers of calculating.In this way, framework program can travel to and fro between two regions of weight random access memory 124, to ensure nerve net
Network unit 121 can be fully utilized.This process starts from step 602.
In step 602, as described in the related Sections of Fig. 6 A, number is written in input value by the processor 100 for executing framework program
According to the Current neural member hidden layer of random access memory 122, that is, the column 17 of write-in data random access memory 122.
The column 17 that these values may also have been positioned at data random access memory 122 are directed to preceding layer as neural network unit 121
Operation result 133 (such as convolution, common source or input layer).Secondly, variable N can be initialized as numerical value 1 by framework program.Variable
N represents the current layer that will be handled by neural network unit 121 in hidden layer.In addition, framework program can be by variable NUM_
LAYERS is initialized as numerical value 4, because there are four hidden layers in this example.Following process advances to step 604.
In step 604, weight random access memory 124, such as Fig. 6 A is written in the weight text of layer 1 by processor 100
Shown in column 0 to 511.Following process advances to step 606.
In step 606, processor 100 is instructed using specified function 1432 with the MTNN of write-in program memory 129
1400, by multiply-accumulate run function program 121 program storage 129 of write-in neural network unit (as shown in Figure 4).Processing
Followed by MTNN instruction 1400 to start neural network unit program, this instruction specified function 1432 starts to execute this device 100
Program.Following process advances to step 608.
In steps in decision-making 608, whether the numerical value of framework program validation variable N is less than NUM_LAYERS.If so, process is just
Step 612 can be advanced to;Otherwise step 614 is proceeded to.
In step 612, weight random access memory 124 is written in the weight text of layer N+1 by processor 100, such as
Column 512 to 1023.Therefore, framework program can neural network unit 121 execute current layer hidden layer calculate when will under
Weight random access memory 124 is written in one layer of weight text, whereby, in the calculating for completing current layer, that is, write-in number
After random access memory 122, the hidden layer that neural network unit 121 can get started next layer of execution is calculated.It connects
Get off to advance to step 614.
In step 614, processor 100 confirms the neural network unit program being carrying out (for layer 1, in step
606 start to execute, and are then to start to execute in step 618 for layer 2 to 4) whether complete to execute.Preferably implement with regard to one
For example, processor 100 can read 121 status register 127 of neural network unit through MFNN instruction 1500 is executed to confirm
Whether complete to execute.In another embodiment, neural network unit 121 can generate an interruption, and multiplication has been completed in expression
Cumulative run function layer program.Following process advances to steps in decision-making 616.
In steps in decision-making 616, whether the numerical value of framework program validation variable N is less than NUM_LAYERS.If so, process meeting
Advance to step 618;Otherwise step 622 is proceeded to.
In step 618, processor 100 will be updated multiply-accumulate run function program, enable hiding for execution level N+1
Layer calculates.Furthermore, it is understood that processor 100 can be by the data random access memory of the multiply-accumulate instruction of address 1 in Fig. 4
122 train values are updated to the column (such as being updated to column 16) of preceding layer calculated result write-in in data random access memory 122 simultaneously
Update output column (such as being updated to column 15).Processor 100 then begins to update neural network unit program.In another implementation
In example, the program of Fig. 4 specifies the same row of the output order of address 4 as the column of the multiply-accumulate instruction of address 1
(column namely read by data random access memory 122).In this embodiment, the forefront of working as of input data text can quilt
Overriding is (because column data text has been read into multitask buffer 208 and through N text rotator at these nerves thus
It is rotated between reason unit 126, as long as this column data text is not required to be used for other purposes, such processing mode is exactly can be with
It is allowed to).In the case, it in step 618 there is no need to update neural network unit program, and only needs it again
Starting.Following process advances to step 622.
In step 622, neural network unit program of the processor 100 from 122 reading layer N of data random access memory
Result.But, if these results can only be used for next layer, framework program is just not necessary to from data random access memory
122 read these as a result, and can be retained on data random access memory 122 and be used for the calculating of next hidden layer.It connects
Process of getting off advances to step 624.
In steps in decision-making 624, whether the numerical value of framework program validation variable N is less than NUM_LAYERS.If so, before process
Proceed to step 626;Otherwise this process is just terminated.
In step 626, the numerical value of N can be increased by one by framework program.Following process can return to steps in decision-making 608.
As shown in the example of Fig. 6 B, in generally every 512 time-frequency periods, these neural processing units 126 will logarithm
Primary read with write-once (through the effect of the operation of the neural network unit program of Fig. 4 is executed according to random access memory 122
Fruit).In addition, these neural generally each time-frequency periods of processing unit 126 can carry out weight random access memory 124
It reads to read a column weight text.Therefore, the whole bandwidth of weight random access memory 124 all can be because of neural network list
Member 121 executes hidden layer operation in a mixed manner and is consumed.Furthermore, it is assumed that there is a write-in in one embodiment and read
Buffer, such as the buffer 1704 of Figure 17, while neural processing unit 126 is read out, processor 100 is random to weight
Access memory 124 is written, and such buffer 1704 generally every 16 time-frequency periods can be to weight random access memory
Device 124 executes write-once so that weight text is written.Therefore, in the implementation that weight random access memory 124 is single-port
In example (as described in the corresponding chapters and sections of Figure 17), generally every 16 time-frequency periods, these neural processing units 126 will be temporary
When shelve the reading carried out to weight random access memory 124, and enable buffer 1704 to weight random access memory
Device 124 is written.But, in the embodiment of dual-port weight random access memory 124, these neural processing units
126 are just not required to lie on the table.
Fig. 7 is the block schematic diagram for showing another embodiment of neural processing unit 126 of Fig. 1.The nerve processing of Fig. 7 is single
Member 126 is similar to the neural processing unit 126 of Fig. 2.But, in addition Fig. 7 neural processing unit 126 has a dual input
Multitask buffer 705.This multitask buffer 705 selects one of input 206 or 711 to be stored in its buffer, and in
The subsequent time-frequency period is provided in its output 203.Input 206 receives weight text from weight random access memory 124.It is another
A input 711 is then the output 203 for receiving the second multitask buffer 705 of adjacent nerve processing unit 126.It is preferably real with regard to one
For applying example, the multitask caching of the received neural processing unit 126 for being arranged in J-1 of the meeting of input 711 of neural processing unit J
The output of device 705 203, and the output 203 of neural processing unit J be then to provide it is more to the neural processing unit 126 for being arranged in J+1
The input 711 of task buffer device 705.In this way, it is N number of nerve processing unit 126 multitask buffer 705 can cooperating syringe,
Such as the rotator of same N number of text, running is similar to aforementioned mode shown in Fig. 3, but is to be used for weight text rather than number
According to text.Multitask buffer 705 can be by multitask buffer using which in the two inputs of 213 control of control input
705 selections are stored in its buffer and are provided in output 203 in subsequent.
Utilize multitask buffer 208 and/or multitask buffer 705 (and other realities as shown in Figure 18 and Figure 23
Apply the multitask buffer in example), data random access memory 122 will be come from by effectively forming a large-scale rotator
And/or weight random access memory 124 one column data/weight rotated, neural network unit 121 there is no need to
Using a very big multiplexer to provide between data random access memory 122 and/or weight random access memory 124
The data needed/weight text is to neural network unit appropriate.
Accumulator value is written back in addition to run function result
For certain applications, processor 100 is allowed to be received back (such as slow to media through the MFNN command reception of Figure 15
Storage 118) untreated 202 numerical value 217 of accumulator, to be supplied in terms of the instruction execution for being implemented in other execution units 112
It calculates, there is its use really.For example, in one embodiment, run function unit 212 is not directed to holding for soft very big run function
Row is configured to reduce the complexity of run function unit 212.So neural network unit 121 can export it is untreated
202 numerical value 217 of accumulator or in which a subset are bonded to data random access memory 122 or weight random access memory
124, and framework program can be read in subsequent step by data random access memory 122 or weight random access memory 124
It takes and this untreated numerical value is calculated.But, not for the application of untreated 202 numerical value 217 of accumulator
It is limited to execute soft very big operation, other application is also covered by the present invention.
Fig. 8 is the block schematic diagram for showing the another embodiment of neural processing unit 126 of Fig. 1.The nerve processing of Fig. 8 is single
Member 126 is similar to the neural processing unit 126 of Fig. 2.But, the neural processing unit 126 of Fig. 8 is in run function unit 212
Including multiplexer 802, and this run function unit 212 has control input 803.The width (in bits) of accumulator 202 is greater than
The width of data literal.Multiplexer 802 has multiple inputs to receive the data literal width segments of the output of accumulator 202 217.
In one embodiment, the width of accumulator 202 is 41 positions, and neural processing unit 216 can be used to export one 16 knots
Fruit text 133;So, for example, there are three multiplexer 802 (or multiplexer 3032 and/or multiplexer 3037 of Figure 30) tools
Input receives the position [15:0] of the output of accumulator 202 217, position [31:16] and position [47:32] respectively.With regard to a preferred embodiment
Speech, the non-output bit (such as position [47:41]) provided by accumulator 202 can be forced to be set as off bit.
Sequencer 128 can control input 803 generate a numerical value, control multiplexer 802 accumulator 202 text (such as
16) in selection first, to be instructed in response to write accumulator, such as write accumulator in subsequent figure 9 positioned at address 3 to 5 refers to
It enables.For a preferred embodiment, multiplexer 802 also has one or more inputs to receive run function circuit (such as Figure 30
In component 3022,3024,3026,3018,3014 and output 3016), and the output that these run function circuits generate
Width is equal to a data literal.Sequencer 128 can be opened in 803 generation numerical value of control input with controlling multiplexer 802 at these
It is selected in dynamic functional circuit output, rather than selects it in the text of accumulator 202, with the starting in response to address 4 in such as Fig. 4
Function unit output order.
Fig. 9 is table, and display one is stored in the program storage 129 of the neural network unit 121 of Fig. 1 and by the mind
The program executed through network unit 121.The example program of Fig. 9 is similar to the program of Fig. 4.Especially, the two is in address 0 to 2
It instructs identical.But, the instruction of address 3 and 4 is then to be instructed to replace by write accumulator in Fig. 9 in Fig. 4, this instruction meeting
Indicate that 512 neural processing units 126 accumulate it the 133 write back data random access memory as a result of the output of device 202 217
122 three column are column 16 to 18 in this example.That is, the instruction of this write accumulator can indicate sequencer 128 at the
The data random access memory address 123 and writing commands that frequency period output numerical value is 16, export in the second time-frequency period
The data random access memory address 123 and writing commands that numerical value is 17, are then that output numerical value is in the third time-frequency period
18 data random access memory address 123 and writing commands.For preferred embodiment, the execution of write accumulator instruction
Time can overlap with other instructions, in this way, write accumulator instruction just can actually be held within these three time-frequency periods
Row, wherein the column of data random access memory 122 can be written in each time-frequency period.In embodiment, the specified starting of user
The numerical value (Figure 29 A) on 2956 columns of output order of function 2934 and control buffer 127, by the required part of accumulator 202
Data random access memory 122 or weight random access memory 124 is written.In addition, write accumulator instruction can choose
The subset of accumulator 202 is write back to property, rather than writes back the full content of accumulator 202.In embodiment, standard type can be write back
Accumulator 202.This part can be described in more detail in the subsequent chapters and sections corresponding to Figure 29 to Figure 31.
Figure 10 is to show that neural network unit 121 executes the timing diagram of the program of Fig. 9.The timing diagram of Figure 10 is similar to Fig. 5
Timing diagram, wherein the time-frequency period 0 to 512 is identical.But, in time-frequency period 513-515, this 512 nerve processing are single
The write accumulator that the run function unit 212 of each neural processing unit 126 can execute address 3 to 5 in Fig. 9 in member 126 refers to
One of enable.Especially, each neural processing unit 126 in the time-frequency period 513,512 neural processing units 126
It can will be in column 16 of the position [15:0] as its 133 write back data random access memory 122 of result of the output of accumulator 202 217
Corresponding text;Each neural processing unit 126 can will tire out in the time-frequency period 514,512 neural processing units 126
Add the position [31:16] of the output of device 202 217 as the phase in the column 17 of its 133 write back data random access memory 122 of result
Corresponding text;And each neural processing unit 126 can will add up in the time-frequency period 515,512 neural processing units 126
The position [40:32] of the output of device 202 217 is as corresponding in the column 18 of its 133 write back data random access memory 122 of result
Text.For a preferred embodiment, position [47:41] can be forced to be set as zero.
Shared run function unit
Figure 11 is the block schematic diagram for showing an embodiment of neural network unit 121 of Fig. 1.In the embodiment of Figure 11
In, a neuron is divided into two parts, i.e., (this part also includes displacement for run function unit part and arithmetic logic unit part
Buffer parts), and each run function unit part is by multiple arithmetic logic unit partial sharings.In Figure 11, arithmetic is patrolled
It collects unit part and refers to neural processing unit 126, and shared run function unit part then refers to run function unit 1112.Phase
For the embodiment of such as Fig. 2, each neuron is then the run function unit 212 comprising oneself.According to this, in Figure 11 embodiment
In one example, neural processing unit 126 (arithmetic logic unit part) may include the accumulator 202 of Fig. 2, arithmetic logic unit
204, multitask buffer 208 and buffer 205, but do not include run function unit 212.In the embodiment in figure 11, neural
Network unit 121 includes 512 neural processing units 126, and but, the present invention is not limited thereto.In the example of Figure 11, this
512 neural processing units 126 are divided into 64 groups, and group 0 to 63 is denoted as in Figure 11, and each group has eight
Neural processing unit 126.
Neural network unit 121 further includes column buffer 1104 and multiple shared run function units 1112, these are opened
Dynamic function unit 1112 is coupled between neural processing unit 126 and column buffer 1104.The width of column buffer 1104 is (with position
Meter) it is identical as a column of data random access memory 122 or weight random access memory 124, such as 512 texts.Often
One neural 126 group of processing unit has a run function unit 1112, that is, each run function unit 1112 is corresponding
In neural 126 group of processing unit;In this way, it is corresponding to 64 to there is 64 run function units 1112 in the embodiment in figure 11
126 group of a nerve processing unit.The shared starting letter corresponding to this group of the neural processing unit 126 of eight of the same group
Counting unit 1112.It is can also be applied to having difference in run function unit and each group with different number
The embodiment of the neural processing unit of quantity.For example, it is can also be applied in each group tool there are two, four or
16 neural processing units 126 share the embodiment of the same run function unit 1112.
Shared run function unit 1112 helps to reduce the size of neural network unit 121.Size reduction can sacrifice effect
Energy.That is, the difference according to shared rate, it may be desirable to entire neural processing unit could be generated using the additional time-frequency period
The result 133 of 126 arrays, for example, additional with regard to needs seven in the case where the shared rate of 8:1 as shown in following figure 12
The time-frequency period.But, it is however generally that, compared to time-frequency periodicity needed for generating progressive total (for example, for each
Neuron has one layers of 512 connections, it is necessary to 512 time-frequency periods), aforementioned additional increased time-frequency periodicity (such as
7) quite few.Therefore, it is very small (for example, increasing about centesimal calculating to share influence of the run function unit to efficiency
Time), it can be a worthwhile cost for it can reduce the size of neural network unit 121.
In one embodiment, each neural processing unit 126 includes that run function unit 212 is relatively easy to execute
Run function, these simple run function units 212 have a lesser size and to be comprised in each nerve processing single
In member 126;Conversely, shared complicated run function unit 1112 is then to execute relative complex run function, size can be bright
It is aobvious to be greater than simple run function unit 212.In this embodiment, it only needs in specified complicated run function by shared multiple
In the case that miscellaneous run function unit 1112 executes, need the additional time-frequency period, specified run function can be by simple
In the case that run function unit 212 executes, there is no need to this additional time-frequency periods.
Figure 12 and Figure 13 is to show that the neural network unit 121 of Figure 11 executes the timing diagram of the program of Fig. 4.The timing of Figure 12
Figure is similar to the timing diagram of Fig. 5, and the time-frequency period 0 to 512 of the two is all the same.But, in the operation in time-frequency period 513 not phase
Together, because the neural processing unit 126 of Figure 11 can share run function unit 1112;That is, the nerve processing of the same group
Unit 126 can share the run function unit 1112 for being associated with this group, and Figure 11 shows this share framework.
Each column of the timing diagram of Figure 13 are corresponding to the continuous time-frequency period for being shown in the first row.Other rows are then right respectively
Different run function units 1112 and its operation should be pointed out into this 64 run function units 1112.Nerve is only shown in figure
The operation of processing unit 0,1,63 is to simplify explanation.The corresponding time-frequency period to Figure 12 in the time-frequency period of Figure 13, but with not Tongfang
Formula shows that neural processing unit 126 shares the operation of run function unit 1112.As shown in figure 13, in the time-frequency period 0 to 512,
This 64 run function units 1112 are at not starting state, and neural processing unit 126 executes initialization nerve processing
Unit instruction, multiply-accumulate instruction are instructed with multiply-accumulate rotation.
As shown in Figure 12 and Figure 13, in the time-frequency period 513, run function unit 0 (is associated with the run function list of group 0
1112) member starts to execute 202 numerical value 217 of accumulator of neural processing unit 0 specified run function, neural processing unit
First neural processing unit 216 in 0 i.e. group 0, and the output of run function unit 1112 will be stored in column buffer
1104 text 0.Equally in the time-frequency period 513, each run function unit 1112 can start single to corresponding nerve processing
202 numerical value 217 of accumulator of first neural processing unit 126 executes specified run function in first 216 groups.Therefore,
As shown in figure 13, in the time-frequency period 513, run function unit 0 starts to execute meaning to the accumulator 202 of neural processing unit 0
Fixed run function is to generate the result of the text 0 that will be stored in column buffer 1104;Run function unit 1 starts to nerve
The accumulator 202 of processing unit 8 executes specified run function to generate the text 8 that will be stored in column buffer 1104
As a result;The rest may be inferred, and run function unit 63 starts to execute the accumulator 202 of neural processing unit 504 specified starting
Function is to generate the result of the text 504 that will be stored in column buffer 1104.
In the time-frequency period 514, run function unit 0 (the run function unit 1112 for being associated with group 0) starts to nerve
202 numerical value 217 of accumulator of processing unit 1 executes specified run function, and neural processing unit 1 is second in group 0
Neural processing unit 216, and the output of run function unit 1112 will be stored in the text 1 of column buffer 1104.Equally
In the time-frequency period 514, each run function unit 1112 can start to second in 216 group of corresponding neural processing unit
202 numerical value 217 of accumulator of a nerve processing unit 126 executes specified run function.Therefore, as shown in figure 13, when
Frequency period 514, run function unit 0 start to execute the accumulator 202 of neural processing unit 1 specified run function to produce
Life will be stored in the result of the text 1 of column buffer 1104;Run function unit 1 starts to the cumulative of neural processing unit 9
Device 202 executes specified run function to generate the result for the text 9 that will be stored in column buffer 1104;The rest may be inferred,
Run function unit 63 starts to execute the accumulator 202 of neural processing unit 505 specified run function will to generate
It is stored in the result of the text 505 of column buffer 1104.Such processing can continue to the time-frequency period 520, run function unit 0
(the run function unit 1112 for being associated with group 0) starts to execute meaning to 202 numerical value 217 of accumulator of neural processing unit 7
Fixed run function, neural processing unit 7 is (the last one) neural processing unit 216 the 8th in group 0, and run function
The output of unit 1112 will be stored in the text 7 of column buffer 1104.Equally in time-frequency period 520, each run function
Unit 1112 can all start the accumulator 202 to the 8th neural processing unit 126 in 216 group of corresponding neural processing unit
Numerical value 217 executes specified run function.Therefore, as shown in figure 13, in the time-frequency period 520, run function unit 0 starts pair
The accumulator 202 of neural processing unit 7 executes specified run function to generate the text that will be stored in column buffer 1104
The result of word 7;Run function unit 1 start to execute the accumulator 202 of neural processing unit 15 specified run function with
Generate the result that will be stored in the text 15 of column buffer 1104;The rest may be inferred, and run function unit 63 starts at nerve
The accumulator 202 of reason unit 511 executes specified run function to generate the text 511 that will be stored in column buffer 1104
Result.
In the time-frequency period 521, once all 512 results of this 512 neural processing units 126 have all been generated and have been write
Fall in lines buffer 1104, column buffer 1104 will start its content data random access memory 122 or weight is written
Random access memory 124.In this way, the run function unit 1112 of each 126 group of neural processing unit is carried out in Fig. 4
A part of the run function instruction of address 3.
The embodiment for sharing run function unit 1112 in 204 group of arithmetic logic unit as shown in figure 11, especially has
Help the use of collocation integer arithmetic logic unit 204.This part such as corresponds at Figure 29 A to Figure 33 in following sections has phase
It speaks on somebody's behalf bright.
MTNN and MFNN framework instruct
Figure 14 is block schematic diagram, and display is moved to neural network (MTNN) framework instruction 1400 and it corresponds to Fig. 1
Neural network unit 121 part running.This MTNN instruction 1400 include execute code field 1402, src1 field 1404,
Src2 field, gpr field 1408 and immediate field 1412.This MTNN instruction is that framework instruction namely this instruction are included in processing
In the instruction set architecture of device 100.For a preferred embodiment, this instruction set architecture can utilize the default for executing code field 1402
Value, to distinguish MTNN instruction 1400 and other instructions in instruction set architecture.The actuating code 1402 of this MTNN instruction 1400 can wrap
The preamble (prefix) for being common in x86 framework etc. is included, can not also include.
Immediate field 1412 provides a numerical value with the control logic 1434 of specified function 1432 to neural network unit 121.
For a preferred embodiment, immediate operand of this function 1432 as the microcommand 105 of Fig. 1.These can be by nerve net
The function 1432 that network unit 121 executes includes write-in data random access memory 122, write-in weight random access memory
124, write-in program memory 129, write-in control buffer 127, the program in beginning executive memory 129, pause are held
Program in line program memory 129, complete the notice request (such as interruption) after program in executive memory 129,
And neural network unit 121 is reseted, but not limited to this.For a preferred embodiment, this neural network unit instruction group meeting
It is instructed including one, the result of this instruction points out that neural network unit program is completed.In addition, this neural network unit instruction set
Interrupt instruction is clearly generated including one.For a preferred embodiment, running that neural network unit 121 is reseted
Including by neural network unit 121, in addition to data random access memory 122, weight random access memory 124, program
The data of memory 129 can maintain complete motionless outer other parts, effectively force to return back to the state of reseting (for example, emptying
Internal state machine simultaneously sets it to idle state).In addition, internal buffer can't be reseted such as accumulator 202
The influence of function, and emptying must be expressed, such as instruct using the initialization nerve processing unit of address 0 in Fig. 4.One
In embodiment, function 1432 may include direct execution function, and first carrys out Source buffer (for example, can join comprising micro- operation
According to micro- operation 3418 of Figure 34).This directly executes function instruction neural network unit 121 and directly executes specified micro- operation.
In this way, framework program, which can directly control neural network unit 121, executes operation, rather than write the instruction into program storage
129 and this is executed in subsequent instruction neural network unit 121 be located at instruction in program storage 129 or through MTNN instruction
The execution of 1400 (or MFNN instructions 1500 of Figure 15).Figure 14 shows the function of this write-in data random access memory 122
One example.
This gpr field specifies the general caching device in general caching device archives 116.In one embodiment, each general slow
Storage is 64.This general caching device archives 116 provides the numerical value of selected general caching device to neural network unit
121, as shown in the figure, and neural network unit 121 is used this numerical value as address 1422.This address 1422 can select letter
One column of the memory specified in number 1432.With regard to data random access memory 122 or weight random access memory 124
Speech, this address 1422 can additionally select a data block, size be twice of the position of media cache in this select column (such as
512 positions).For a preferred embodiment, this position is located at 512 bit boundaries.In one embodiment, multiplexer can select
Address 1422 (address 1422 in the case where MFNN described below instruction 1400) or from sequencer 128
Address 123/125/131 is provided to 124/ weight random access memory of data random access memory, 124/ program storage
129.In one embodiment, data random access memory 122 has dual-port, and neural processing unit 126 is enable to utilize matchmaker
Body buffer 118 is to the read/write of this data random access memory 122, while read/write this data random access is deposited
Reservoir 122.In one embodiment, for similar purpose, weight random access memory 124 also has dual-port.
One media buffer of the specified media cache archives 118 of src1 field 1404 and src2 field 1406 in figure
Device.In one embodiment, each media cache 118 is 256.Media cache archives 118 can will come from selected
The conjoint data (such as 512 positions) of media cache is provided to (or the weight arbitrary access of data random access memory 122
Memory 124 or program storage 129) with writing address 1422 specify select column 1428 and in select column 1428 by
The specified position in address 1422, as shown in the figure.Through a series of (and the MFNN as described below instruction of MTNN instruction 1400
1500) execution, be implemented in processor 100 framework program can fill up data random access memory 122 column and weight with
Machine accesses the column of memory 124 and by program write-in program memory 129, such as program as described herein is (as shown in Fig. 4 and Fig. 9
Program) neural network unit 121 can be made to carry out operation at a very rapid rate to data and weight, to complete this artificial neuron
Network.In one embodiment, this framework program directly controls neural network unit 121 rather than by program write-in program memory
129。
In one embodiment, MTNN instruction 1400 specified one originates source buffer and carrys out the quantity of Source buffer, i.e.,
Q, and non-designated two are come Source buffer (person as specified by field 1404 and 1406).The MTNN instruction 1400 of this form can refer to
Show that processor 100 will be appointed as starting and come the media cache 118 of Source buffer and the media buffer of following Q-1 connecting
Neural network unit 121 is written in device 118, that is, specified data random access memory 122 or weight is written and deposits at random
Access to memory 124.For a preferred embodiment, it is Q all that MTNN instruction 1400 can be translated to write-in by instruction translator 104
The specified required amount of microcommand of media cache 118.For example, in one embodiment, when MTNN instruction 1400 will
Buffer MR4 is appointed as starting to come Source buffer and Q being 8, and MTNN will be instructed 1400 translate to by instruction translator 104
Four microcommands, wherein first microcommand is written buffer MR4 and MR5, second microcommand write-in buffer MR6 with
MR7, buffer MR8 and MR9 is written in third microcommand, and buffer MR10 and MR11 is written in the 4th microcommand.Another
In a embodiment, the data path by media cache 118 to neural network unit 121 be 1024 rather than 512, in this feelings
Under condition, MTNN instruction 1400 can be translated to two microcommands by instruction translator 104, wherein buffer is written in first microcommand
MR4 to MR7, second microcommand are then write-in buffer MR8 to MR11.It is can also be applied to MFNN instruction 1500 is specified
The embodiment of the quantity of one starting purpose buffer and purpose buffer, and allow each MFNN instruction 1500 from data
One column of random access memory 122 or weight random access memory 124 read the data for being greater than single medium buffer 118
Block.
Figure 15 is block schematic diagram, and display is moved to neural network (MTNN) framework instruction 1500 and it corresponds to Fig. 1
Neural network unit 121 part running.This MFNN instruction 1500 include execute code field 1502, dst field 1504,
Gpr field 1508 and immediate field 1512.MFNN instruction is the finger that framework instruction namely this instruction are contained in processor 100
It enables in collection framework.For a preferred embodiment, this instruction set architecture can be using the default value for executing code field 1502, to distinguish
MFNN instruction 1500 and other instructions in instruction set architecture.The actuating code 1502 of this MFNN instruction 1500 may include being common in
The preamble (prefix) of x86 framework etc. can not also include.
Immediate field 1512 provides a numerical value with the control logic 1434 of specified function 1532 to neural network unit 121.
For a preferred embodiment, immediate operand of this function 1532 as the microcommand 105 of Fig. 1.These neural network units
121 functions 1532 that can be executed include reading data random access memory 122, reading weight random access memory
124, reading program memory 129 and reading state buffer 127, but not limited to this.Data are read in the example display of Figure 15
The function 1532 of random access memory 122.
This gpr field 1508 specifies the general caching device in general caching device archives 116.This general caching device archives 116
The numerical value of selected general caching device is provided to neural network unit 121, as shown in the figure, and neural network unit 121 will
This numerical value carries out operation as address 1522 and in a manner of similar to the address 1422 of Figure 14, uses selection 1532 middle finger of function
One column of fixed memory.For data random access memory 122 or weight random access memory 124, this address
1522 can additionally select a data block, and size is the position of media cache (such as 256 positions) in select column thus.With regard to one compared with
For good embodiment, this position is located at 256 bit boundaries.
This dst field 1504 is in a media cache specified in a media cache archives 118.As shown in the figure, media
Register file 118 will be from data random access memory 122 (or weight random access memory 124 or program storage
129) data (such as 256) are received to selected media cache, this reading data address 1522 from data receiver is specified
Select column 1528 and select column 1528 in the specified position in address 1522.
The port of neural network unit internal random access memory configures
Figure 16 is the block schematic diagram for showing an embodiment of data random access memory 122 of Fig. 1.This data is random
Accessing memory 122 includes memory array 1606, read port 1602 and write-in port 1604.Memory array 1606 loads
Data literal, for a preferred embodiment, these data arrangements at the D as previously described N number of text arranged array.Implement one
In example, this memory array 1606 includes an array being made of 64 horizontally arranged static random-access memory cells,
In each memory cell there is 128 width and 64 height, so can provide the data random access of a 64KB
Memory 122, width is 8192 and has 64 column, and crystal grain face used in this data random access memory 122
Substantially 0.2 square millimeter of product.But, the present invention is not limited thereto.
For a preferred embodiment, be written port 1602 with multitask mode be coupled to neural processing unit 126 and
Media cache 118.Furthermore, it is understood that these media caches 118 can be coupled to read port through result bus, and tie
Fruit bus is also used for providing data to reorder buffer and/or result transmission bus to be provided to other execution units 112.These
Neural processing unit 126 shares this read port 1602 with media cache 118, with to data random access memory 122 into
Row is read.Also, write-in port 1604 is also to be coupled to neural processing unit with multitask mode for a preferred embodiment
126 and media cache 118.These neural processing units 126 share this write-in port 1604 with media cache 118, with
This data random access memory 122 is written.In this way, media cache 118 can neural processing unit 126 to data with
Machine access memory 122 is while be read out, and is written data random access memory 122, and neural processing unit 126 is also
Data random access can be written while media cache 118 is read out data random access memory 122
Memory 122.Such ways of carrying out can promote efficiency.For example, these neural processing units 126 can read data
Random access memory 122 (such as continuously carrying out calculating), and this is simultaneously, media cache 118 can be by more data literals
Data random access memory 122 is written.In another example, these neural processing units 126 calculated result can be written
Data random access memory 122, and this is simultaneously, media cache 118 can then be read from data random access memory 122
Calculated result.In one embodiment, data random access memory can be written in a column count result by neural processing unit 126
122, while also a column data text is read from data random access memory 122.In one embodiment, memory array 1606
It is configured to memory block (bank).When neural processing unit 126 accesses data random access memory 122, own
Memory block can all be initiated to access memory array 1606 a complete column;But, it is accessed in media cache 118
When data random access memory 122, only specified memory block can be activated.In one embodiment, each
The width of memory block is 128, and the width of media cache 118 is then 256, so, for example, deposit every time
Media cache 118 is taken just to need to start two memory blocks.In one embodiment, these ports 1602/1604 are wherein
One of be read/write port.In one embodiment, these ports 1602/1604 are all read/write ports.
The advantages of allowing these neural processing units 126 to have the ability of rotator as described herein be, compared to for
Ensure that neural processing unit 126 can be fully utilized and framework program (by media cache 118) is made to be able to continue offer
Data are to data random access memory 122 and while neural processing unit 126 is executed and calculated, from data random access
Memory 122 fetches memory array required for result, this ability helps to reduce depositing for data random access memory 122
The columns of memory array 1606, thus can reduce the size.
Internal random access memory buffer
Figure 17 is to show that the square of an embodiment of weight random access memory 124 and buffer 1704 of Fig. 1 is illustrated
Figure.This weight random access memory 124 includes memory array 1706 and port 1702.This memory array 1706 loads power
Weigh text, for a preferred embodiment, these weight character arrangings at the W as previously described N number of text arranged array.It is real one
It applies in example, this memory array 1706 includes an array being made of 128 horizontally arranged static random-access memory cells,
Wherein each memory cell has 64 width and 2048 height, and the weight that so can provide a 2MB is deposited at random
Access to memory 124, width is 8192 and has 2048 column, and crystalline substance used in this weight random access memory 124
Substantially 2.4 square millimeters of grain product.But, the present invention is not limited thereto.
For a preferred embodiment, this port 1702 is coupled to neural processing unit 126 and buffering with multitask mode
Device 1704.These neural processing units 126, which read through this port 1702 with buffer 1704 and weight arbitrary access is written, to be deposited
Reservoir 124.Buffer 1704 is further coupled to the media cache 118 of Fig. 1, in this way, media cache 118 can pass through buffer
1704 read and weight random access memory 124 are written.The advantages of this mode, is, when neural processing unit 126 is being read
When taking or be written weight random access memory 124, media cache 118 with write buffer 118 or can postpone
Rush device 118 read (if but neural processing unit 126 be carrying out, it is single to shelve these nerve processing in the preferred case
Member 126 accesses weight random access memory to avoid when buffer 1704 accesses weight random access memory 124
124).This mode can promote efficiency, especially because reading of the media cache 118 for weight random access memory 124
It takes and is significantly less than reading and write-in of the neural processing unit 126 for weight random access memory 124 in write-in relatively.It lifts
For example, in one embodiment, the read/write 8192 of neural processing unit 126 1 times position (column), but, media buffer
The width of device 118 is only 256, and each MTNN instruction 1400 is only written two media caches 118, i.e., 512.Therefore,
In the case where framework program executes 16 MTNN instruction 1400 to fill up buffer 1704, neural processing unit 126 with deposit
The time clashed between the framework program of weighting weight random access memory 124 can be less than the percent of the substantially the entirety of time
Six.In another embodiment, a MTNN instruction 1400 is translated to two microcommands 105 by instruction translator 104, and each micro-
Instruction can be by single a 118 write buffer 1704 of data buffer, in this way, neural processing unit 126 is being deposited with framework program
The frequency that conflict is generated when weighting weight random access memory 124 can be also further reduced.
In the embodiment comprising buffer 1704, needed using framework program write-in weight random access memory 124
Multiple MTNN instructions 1400.One or more MTNN instructions 1400 specify a function 1432 to specify in write buffer 1704
Data block indicates neural network unit 121 by the content of buffer 1704 with the specified function 1432 of latter MTNN instruction 1400
A select column of weight random access memory 124 is written.The size of single a data block is the digit of media cache 118
Twice, and these data blocks can come into line naturally in buffer 1704.In one embodiment, each specified function 1432 is to write
The MTNN instruction 1400 for entering 1704 specified data block of buffer includes a bit mask (bitmask), corresponding to buffering with position
Each data block of device 1704.The data of carrying out Source buffer 118 specified from two are written into the data block of buffer 1704
In, the correspondence position in bit mask is each data block being set.This embodiment facilitates weight random access memory 124
A column memory repeated data value situation.For example, in order to which by buffer 1704, (and subsequent weight is deposited at random
One column of access to memory 124) it is zeroed, zero load can be carried out Source buffer and set all of bit mask by program designer
Position.In addition, the selected data block that bit mask can also allow program designer to be only written in buffer 1704, and make other data blocks
Maintain its previous data mode.
In the embodiment comprising buffer 1704, weight random access memory 124 is read using framework program and is needed
Multiple MFNN instructions 1500.Initial MFNN instruction 1500 specifies a function 1532 by a finger of weight random access units 124
Determine column load buffer 1704, subsequent one or more MFNN instruction 1500 specifies a function 1532 by the one of buffer 1704
Specified data block is read to purpose buffer.The size of single a data block is the digit of media cache 118, and these are counted
It can be come into line naturally in buffer 1704 according to block.Technical characteristic of the invention is equally applicable to other embodiments, as weight with
Machine, which accesses memory 124, has multiple buffers 1704, and framework program deposits when executing through the neural processing unit 126 of increase
Access amount, to be further reduced between neural processing unit 126 and framework program because access weight random access memory 124 is produced
Raw conflict, and increase within the time-frequency period that neural processing unit 126 is not necessary to access weight random access memory 124, change
A possibility that being accessed by buffer 1704.
Figure 16 describes dual port data random access memory 122, and but, the present invention is not limited thereto.Skill of the invention
It is the other embodiments of dual-port design that art feature, which is equally applicable to weight random access memory 124 also,.In addition, being retouched in Figure 17
It states buffer collocation weight random access memory 124 to use, but, the present invention is not limited thereto.Technical characteristic of the invention
It is equally applicable to the implementation that data random access memory 122 is similar to the corresponding buffer of buffer 1704 with one
Example.
Dynamically configurable neural processing unit
Figure 18 is the block schematic diagram for showing the dynamically configurable neural processing unit 126 of Fig. 1.At the nerve of Figure 18
Manage the neural processing unit 126 that unit 126 is similar to Fig. 2.But, the neural processing unit 126 of Figure 18 is dynamically configurable with fortune
Make in two it is different configuration of one of them.In first configuration, the running of the neural processing unit 126 of Figure 18 is similar to
The neural processing unit 126 of Fig. 2.That is, being denoted as " wide " configuration herein in first configuration or " single " matching
It sets, the arithmetic logic unit 204 of neural processing unit 126 is to single wide data literal and single wide weight text
(such as 16 positions) execute operation to generate single wide result.In comparison, it in second configuration, i.e., indicates herein
It can be to two narrow data literals and two narrow weights for " narrow " configuration or " even numbers " configuration, neural processing unit 126
Text (such as 8 positions) executes operation and generates two narrow results respectively.In one embodiment, neural processing unit 126 is matched
It sets (wide or narrow) and is reached by initializing the neural processing unit instruction instruction of address 0 (such as in earlier figures 20).In addition, this
Configuration can also have function 1432 to specify to set the MTNN of the configuration (wide or narrow) of neural processing unit setting and refer to by one
It enables to reach.For a preferred embodiment, the MTNN instruction of the instruction of program storage 129 or determining configuration (wide or narrow) can be filled out
Full configuration sets buffer.For example, the output of allocating cache device be supplied to arithmetic logic unit 204, run function unit 212 with
And generate the logic of multitask cache control signal 213.Substantially, in the component and Fig. 2 of the neural processing unit 126 of Figure 18
The component of identical number can execute similar function, can therefrom obtain referring to the embodiment to understand Figure 18.Below for Figure 18
Embodiment include its be illustrated with not existing together for Fig. 2.
The neural processing unit 126 of Figure 18 includes two buffer 205A and 205B, two three input multitask buffers
208A and 208B, arithmetic logic unit 204, two accumulator 202A and 202B and two run function unit 212A
With 212B.Buffer 205A/205B is respectively provided with the half (such as 8 positions) of the width of the buffer 205 of Fig. 2.Buffer 205A/
205B receives a corresponding narrow weight text 206A/B206 (such as 8 positions) simultaneously from weight random access memory 124 respectively
It outputs it 203A/203B and selects logic 1898 in the operand that a subsequent time-frequency period is provided to arithmetic logic unit 204.
When neural processing unit 126 is in wide configuration, buffer 205A/205B will operate random from weight to receive together
The width weight text 206A/206B (such as 16 positions) for accessing memory 124, similar to the buffer in the embodiment of Fig. 2
205;When neural processing unit 126 is in narrow configuration, buffer 205A/205B actually will be independent work, respectively
The narrow weight text 206A/206B (such as 8 positions) from weight random access memory 124 is received, in this way, nerve processing
Unit 126 is actually equivalent to two narrow neural processing units respectively independent work.But, no matter neural processing unit
Why is 126 configuration aspect, and the identical output bit of weight random access memory 124 can all couple and be provided to buffer
205A/205B.For example, the buffer 205A of neural processing unit 0 receives the buffer of byte 0, neural processing unit 0
205B receives byte 1, the buffer 205A of neural processing unit 1 receives byte 2, the buffer of neural processing unit 1
205B receives that byte 3, the rest may be inferred, and the buffer 205B of neural processing unit 511 will receive byte 1023.
Multitask buffer 208A/208B is respectively provided with the half (such as 8 positions) of the width of the buffer 208 of Fig. 2.It is more
Task buffer device 208A can select a storage to its buffer and in subsequent time-frequency week in input 207A, 211A and 1811A
Phase is provided by output 209A, and multitask buffer 208B can select one to store to it in input 207B, 211B and 1811B
Buffer is simultaneously provided to operand selection logic 1898 by exporting 209B in the subsequent time-frequency period.It is random from data to input 207A
It accesses memory 122 and receives a narrow data literal (such as 8 positions), input 207B is received from data random access memory 122
One narrow data literal.When neural processing unit 126 is in wide configuration, multitask buffer 208A/208B is actually
Can be operated together to receive a wide data literal 207A/207B (such as 16 from data random access memory 122
Position), similar to the multitask buffer 208 in the embodiment of Fig. 2;When neural processing unit 126 is in narrow configuration, more
Business buffer 208A/208B actually will be independent work, and it is narrow respectively to receive one from data random access memory 122
Data literal 207A/207B (such as 8 positions), in this way, neural processing unit 126 is actually equivalent to two narrow nerves
The respective independent work of processing unit.But, though the configuration aspect of neural processing unit 126 why, data random access storage
The identical output bit of device 122 can all couple and be provided to multitask buffer 208A/208B.For example, neural processing unit 0
Multitask buffer 208A receive byte 0, the multitask buffer 208B of neural processing unit 0 receives byte 1, nerve
The multitask buffer 208A of processing unit 1 receives byte 2, the multitask buffer 208B of neural processing unit 1 is received
Byte 3, the rest may be inferred, and the multitask buffer 208B of neural processing unit 511 will receive byte 1023.
Input 211A receives the output 209A of the multitask buffer 208A of neighbouring neural processing unit 126, input
211B receives the output 209B of the multitask buffer 208B of neighbouring neural processing unit 126.It is neighbouring to input 1811A reception
The output 209B of the multitask buffer 208B of neural processing unit 126, and input 1811B and receive neighbouring neural processing unit
The output 209A of 126 multitask buffer 208A.Nerve processing unit 126 shown in Figure 18 belongs to N number of mind shown in FIG. 1
Through one of processing unit 126 and it is denoted as neural processing unit J.That is, nerve processing unit J is this N number of mind
One through processing unit represents example.For a preferred embodiment, the multitask buffer 208A of neural processing unit J is defeated
The multitask buffer 208A output 209A of neural processing unit 126 of example J-1 can be received by entering 211A, and nerve processing is single
The multitask buffer 208A input 1811A of first J can receive the multitask buffer of the neural processing unit 126 of example J-1
208B exports 209B, and the multitask buffer 208A output 209A of neural processing unit J can be provided to example J+1 simultaneously
The multitask of the neural processing unit 126 of the multitask buffer 208A input 211A and example J of neural processing unit 126 is slow
Storage 208B inputs 211B;The input 211B of the multitask buffer 208B of neural processing unit J can receive the mind of example J-1
Multitask buffer 208B through processing unit 126 exports 209B, and the multitask buffer 208B's of neural processing unit J is defeated
The multitask buffer 208A output 209A of neural processing unit 126 of example J can be received by entering 1811B, also, nerve processing is single
The output 209B of the multitask buffer 208B of first J can be provided to the multitask of the neural processing unit 126 of example J+1 simultaneously
The multitask buffer 208B that buffer 208A inputs the neural processing unit 126 of 1811A and example J+1 inputs 211B.
Each of 213 control multitask buffer 208A/208B of control input, selects one from these three inputs
It stores to its corresponding buffer, and is provided to corresponding output 209A/209B in subsequent step.When nerve processing is single
Member 126 be instructed to from data random access memory 122 load one column when (such as in Figure 20 address 1 multiply-accumulate instruction,
The details will be described later), no matter this neural processing unit 126 is in wide configuration or narrow configuration, and control input 213 can control more
It is engaged in each of buffer 208A/208B multitask buffer, from the opposite of the select column of data random access memory 122
Answer one corresponding narrow data literal 207A/207B (such as 8) of selection in narrow text.
When the reception instruction of neural processing unit 126 needs to rotate the data columns value of previous receipt (such as scheme
The multiply-accumulate rotation instruction of address 2 in 20, the details will be described later), if neural processing unit 126 is controlled defeated in narrow configuration
The corresponding input 1811A/ of each multitask buffer selection in multitask buffer 208A/208B will be controlled by entering 213
1811B.In the case, multitask buffer 208A/208B actually can be independent work and make neural processing unit 126
Actually just as two independent narrow neural processing units.In this way, the multitask buffer of N number of nerve processing unit 126
208A and 208B cooperating syringe will be such as the rotators of same 2N narrow texts, this part is subsequent more detailed corresponding to having at Figure 19
Thin explanation.
When neural processing unit 126, which receives instruction, to be needed to rotate the data columns value of previous receipt, if refreshing
It is in wide configuration through processing unit 126, it is more that control input 213 will control each in multitask buffer 208A/208B
Task buffer device selects corresponding input 211A/211B.In the case, multitask buffer 208A/208B can cooperating syringe
It and actually just look like this neural processing unit 126 is single wide neural processing unit 126.In this way, N number of nerve processing is single
The multitask buffer 208A and 208B cooperating syringe of member 126 will be as similar to correspond to such as the rotator of same N number of wide text
Mode described in Fig. 3.
Arithmetic logic unit 204 includes that operand selects 1898, wide multiplier 242A of logic, a narrow multiplier
242B, dual input multiplexer 1896A one wide, dual input multiplexer 1896B one narrow, adder 244A one wide and one narrow
Adder 244B.In fact, this arithmetic logic unit 204 can be regarded as including operand selection logic, a wide arithmetical logic
Unit 204A (including aforementioned wide multiplier 242A, aforementioned width multiplexer 1896A and aforementioned width adder 244A) and a narrow calculation
Art logic unit 204B (including aforementioned narrow multiplier 242B, aforementioned narrow multiplexer 1896B and aforementioned narrow adder 244B).With regard to one
For preferred embodiment, two wide text can be multiplied by wide multiplier 242A, similar to the multiplier 242 of Fig. 2, such as one 16
Position multiplies 16 multipliers.Two narrow text can be multiplied by narrow multiplier 242B, such as one 8 multiply 8 multipliers to produce
Raw one 16 results.When neural processing unit 126 is in narrow configuration, through the assistance of operand selection logic 1898, i.e.,
Wide multiplier 242A can be made full use of, so that two narrow text is multiplied as a narrow multiplier, so nerve processing unit
126 will be such as the narrow neural processing unit of two effective operations.For a preferred embodiment, wide adder 244A can will be wide
The output of multiplexer 1896A is added with the output 217A of wide accumulator 202A have been generated a sum 215A and has made for wide accumulator 202A
With running is similar to the adder 244 of Fig. 2.Narrow adder 244B can be by the output of narrow multiplexer 1896B and narrow accumulator
202B output 217B addition is used with generating a sum 215B for narrow accumulator 202B.In one embodiment, narrow accumulator 202B
With 28 width, accuracy can be lost to avoid when carrying out the accumulating operation of up to 1024 16 products.At nerve
When managing unit 126 in wide configuration, narrow multiplier 244B, narrow accumulator 202B are preferably in narrow run function unit 212B
Starting state is not to reduce energy dissipation.
Operand selection logic 1898 can be provided to arithmetical logic by selection operation number from 209A, 209B, 203A and 203B
Other components of unit 204, the details will be described later.For a preferred embodiment, operand selects logic 1898 also to have other function
Can, such as execute the symbol extension of signed magnitude data literal and weight text.For example, if neural processing unit
126 be in narrow configuration, and the symbol of narrow data literal and weight text can be extended into wide text by operand selection logic 1898
Width, be then just supplied to wide multiplier 242A.Similarly, if arithmetic logic unit 204 receives instruction and to transmit one
Narrow data/weight text (skip wide multiplier 242A using wide multiplexer 1896A), and operand selects the meeting of logic 1898 will be narrow
The symbol of data literal and weight text extends into the width of wide text, is then just supplied to wide adder 244A.Preferably with regard to one
For embodiment, this logic for executing symbol extension function exists in the arithmetical logic fortune of the neural processing unit 126 of Fig. 2
Calculate 204 inside.
The output and the operation from operand selection logic 1898 that wide multiplexer 1896A receives wide multiplier 242A
Number, and select one to be supplied to wide adder 244A from these inputs, narrow multiplexer 1896B receives the defeated of narrow multiplier 242B
Out with from operand selection logic 1898 an operand, and from these inputs select one be supplied to narrow adder
244B。
Operand selects configuration and arithmetic logic unit 204 of the meeting of logic 1898 according to neural processing unit 126 will
The arithmetic of execution and/or logical operation provide operand, the finger that this arithmetic/logic is executed according to neural processing unit 126
Specified function is enabled to determine.For example, if instructing the one multiply-accumulate operation of execution of instruction arithmetic logic unit 204
Neural processing unit 126 is in wide configuration, and operand selection logic 1898 will just export 209A and concatenate the one wide of composition with 209B
Text is provided to an input of wide multiplier 242A, and the width text for exporting 203A with 203B and concatenate composition is provided to another
Input, and narrow multiplier 242B is then not start, in this way, the running of neural processing unit 126 will be similar to such as single
The wide neural processing unit 126 of the neural processing unit 126 of Fig. 2.But, if instruction instruction arithmetic logic unit executes one
Multiply-accumulate operation and neural processing unit 126 be in narrow configuration, operand select logic 1898 just by after an extension or
The narrow data literal 209A of version is provided to an input of wide multiplier 242A after expansion, and the narrow weight of version after extension is literary
Word 203A is provided to another input;In addition, narrow data literal 209B can be provided to narrow multiplier by operand selection logic 1898
An input of 242B, and narrow weight text 203B is provided to another input.Extend as previously described to narrow text to reach
Or the operation of expansion, if narrow text has symbol, operand selects logic 1898 that will carry out symbol extension to narrow text;If
Narrow text without symbol, operand select logic 1898 will above the addition of narrow text off bit.
In another example, if neural processing unit 126 is in wide configuration and instructs instruction arithmetic logic unit 204
The accumulating operation of a weight text is executed, wide multiplier 242A will be skipped, and operand selection logic 1898 will will be defeated
203A is concatenated with 203B out is provided to wide multiplexer 1896A to be supplied to wide adder 244A.But, if neural processing unit
126 in the narrow accumulating operation configured and instruction arithmetic logic unit 204 is instructed to execute a weight text, wide multiplier 242A
It will be skipped, and the output 203A of version after an extension will be provided to wide multiplexer by operand selection logic 1898
1896A is to be supplied to wide adder 244A;In addition, narrow multiplier 242B can be skipped, operand selection logic 1898 can will prolong
The output 203B of version is provided to narrow multiplexer 1896B to be supplied to narrow adder 244B after exhibition.
In another example, if neural processing unit 126 is in wide configuration and instructs instruction arithmetic logic unit 204
The accumulating operation of a data literal is executed, wide multiplier 242A will be skipped, and operand selection logic 1898 will will be defeated
209A is concatenated with 209B out is provided to wide multiplexer 1896A to be supplied to wide adder 244A.But, if neural processing unit
126 in the narrow accumulating operation configured and instruction arithmetic logic unit 204 is instructed to execute a data literal, wide multiplier 242A
It will be skipped, and the output 209A of version after an extension will be provided to wide multiplexer by operand selection logic 1898
1896A is to be supplied to wide adder 244A;In addition, narrow multiplier 242B can be skipped, operand selection logic 1898 can will prolong
The output 209B of version is provided to narrow multiplexer 1896B to be supplied to narrow adder 244B after exhibition.Weight/data literal is cumulative
Calculating facilitates average calculating operation, the common source that the available certain artificial neural networks as including image processing of average calculating operation are applied
(pooling) layer.
For a preferred embodiment, neural processing unit 126 further includes the second wide multiplexer (not shown), to skip
Wide adder 244A, in favor of by wide data/weight text under wide configuration or narrow data/power after the extension under narrow configuration
The text load width narrow multiplexer of accumulator 202A and second (not shown) is weighed, to skip narrow adder 244B, in favor of inciting somebody to action
Narrow data/weight text under narrow configuration loads narrow accumulator 202B.For a preferred embodiment, this arithmetic logic unit
204 further include that width with narrow comparator/multiplexer combine (not shown), this comparator/multiplexer combination reception is corresponding tires out
Add device numerical value 217A/217B to export with corresponding multiplexer 1896A/1896B, use accumulator value 217A/217B with
Maximum value, the common source of certain artificial neural network applications are selected between one data/weight text 209A/209B/203A/203B
(pooling) layer uses this operation, this part has in more detail in following sections, such as corresponding at Figure 27 and Figure 28
It is bright.In addition, operand selects logic 1898 to provide the operand of value of zero (for adding zero add operation or to clear
Except accumulator), and the operand (for multiplying one multiplying) of numerical value one is provided.
Narrow run function unit 212B receives the output 217B of narrow accumulator 202B and executes run function to it to generate
Narrow result 133B, wide run function unit 212A receive the output 217A of wide accumulator 202A and it is executed run function with
Generate wide result 133A.When neural processing unit 126 is in narrow configuration, it is tired that wide run function unit 212A can configure understanding according to this
Add the output 217A of device 202A and run function is executed to it to generate narrow as a result, this part is such as corresponding in following sections such as 8
It is described in more detail at Figure 29 A to Figure 30.
As previously mentioned, single neural processing unit 126 effectively can function as two narrow nerves when being in narrow configuration
Processing unit operates, and therefore, for lesser text, when compared to width configuration, can generally provide up to twice
Processing capacity.For example, it is assumed that neural net layer has 1024 neurons, and each neuron is received from preceding layer
1024 narrow input (and having narrow weight text), will so generate 1,000,000 connections.For having 512 nerve processing
For the neural network unit 121 of unit 126, (1024 narrow neural processing unit is equivalent to) under narrow configuration, although processing
Be narrow text rather than wide text, but its connective number that can be handled can achieve four times of wide configuration that (1,000,000 link
Upper 256K is linked), and substantially half of required time (about 1026 time-frequency periods are to upper 514 time-frequency periods).
In one embodiment, the dynamic configuration nerve processing unit 126 of Figure 18 includes being similar to multitask buffer 208A
With the three of 208B input multitask buffers to replace buffer 205A and 205B, to constitute a rotator, processing by weight with
Machine accesses the received weight character string of memory 124, mode but application described in embodiment of this operational part similar to Fig. 7
In dynamic configuration described in Figure 18.
Figure 19 be a block schematic diagram, display according to Figure 18 embodiment, using Fig. 1 neural network unit 121 it is N number of
2N multitask buffer 208A/208B of neural processing unit 126, for the data random access memory 122 by Fig. 1
The column data text 207 obtained executes the running such as same rotator.In the embodiment of figure 19, N is 512, nerve processing
Unit 121 has 1024 multitask buffer 208A/208B, is denoted as 0 to 511, is respectively corresponding to 512 nerve processing
Unit 126 and actually 1024 narrow neural processing unit.Two narrow neural processing unit in neural processing unit 126
It is respectively designated as A and B, in each multitask buffer 208, corresponding narrow neural processing unit is also indicated.Into
For one step, the multitask buffer 208A for being denoted as 0 neural processing unit 126 is denoted as 0-A, is denoted as at 0 nerve
The multitask buffer 208B of reason unit 126 is denoted as 0-B, is denoted as the multitask buffer of 1 neural processing unit 126
208A is denoted as 1-A, and the multitask buffer 208B for being denoted as 1 neural processing unit 126 is denoted as 1-B, is denoted as 511
The multitask buffer 208A of neural processing unit 126 is denoted as 511-A, and be denoted as 511 neural processing unit 126 it is more
Task buffer device 208B is denoted as 511-B, and numerical value is also corresponded to narrow nerve processing unit described in subsequent figure 21.
Each multitask buffer 208A receives its phase in the wherein column that the D of data random access memory 122 is arranged
Corresponding narrow data literal 207A, and each multitask buffer 208B is arranged wherein in the D of data random access memory 122
Its corresponding narrow data literal 207B is received in one column.It is deposited that is, multitask buffer 0-A receives data random access
The narrow data literal 0 that reservoir 122 arranges, multitask buffer 0-B receive the narrow data literal that data random access memory 122 arranges
1, multitask buffer 1-A receive the narrow data literal 2 that data random access memory 122 arranges, and multitask buffer 1-B is received
The narrow data literal 3 that data random access memory 122 arranges, and so on, multitask buffer 511-A receives data and deposits at random
The narrow data literal 1022 that access to memory 122 arranges, and multitask buffer 511-B is then to receive data random access memory
The narrow data literal 1023 of 122 column.In addition, multitask buffer 1-A receives the output 209A of multitask buffer 0-A as it
211A is inputted, the output 209B that multitask buffer 1-B receives multitask buffer 0-B inputs 211B as it, and so on,
The output 209A that multitask buffer 511-A receives multitask buffer 510-A inputs 211A, multitask buffer as it
The output 209B that 511-B receives multitask buffer 510-B inputs 211B as it, and multitask buffer 0-A reception is more
The output 209A of task buffer device 511-A inputs 211A as it, and multitask buffer 0-B receives multitask buffer 511-B
Output 209B as its input 211B.Each multitask buffer 208A/208B can receive control input 213 with control
It inputs 211A/211B after selecting data literal 207A/207B rotation or inputs 1811A/1811B after rotating.Most
Afterwards, the output 209B that multitask buffer 1-A receives multitask buffer 0-B inputs 1811A, multitask buffer 1- as it
The output 209A that B receives multitask buffer 1-A inputs 1811B as it, and so on, multitask buffer 511-A is received
The output 209B of multitask buffer 510-B inputs 1811A as it, and multitask buffer 511-B receives multitask buffer
The output 209A of 511-A inputs 1811B as it, and multitask buffer 0-A receives the output of multitask buffer 511-B
209B inputs 1811A as it, and multitask buffer 0-B receives the output 209A of multitask buffer 0-A as its input
1811B.Each multitask buffer 208A/208B can receive control input 213 and select data literal 207A/ to control it
1811A/1811B is inputted after inputting 211A/211B or rotation after 207B rotation.In an operation mode, at first
The frequency period, control input 213 can control each multitask buffer 208A/208B selection data literal 207A/207B store to
Buffer is provided to arithmetic logic unit 204 for subsequent;And in subsequent time-frequency period (such as M-1 time-frequency period above-mentioned), control
Input 1811A/1811B is stored to buffer after system input 213 can control each multitask buffer 208A/208B selection rotation
It is provided to arithmetic logic unit 204 for subsequent, this part can be described in more detail in following sections.
Figure 20 is a table, and display one is stored in the program storage 129 of the neural network unit 121 of Fig. 1 and by this
The program that neural network unit 121 executes, and this neural network unit 121 has nerve processing as shown in the embodiment of figure 18
Unit 126.The example program of Figure 20 is similar to the program of Fig. 4.It is illustrated below for its difference.Positioned at the initial of address 0
Narrow configuration will be entered by changing the specified neural processing unit 126 of neural processing unit instruction.In addition, as shown in the figure, being located at address
2 multiply-accumulate rotation instructs the count value that a specified numerical value is 1023 and needs 1023 time-frequency periods.This is because Figure 20
Example in assume to be of virtually 1024 narrow (such as 8) neuron (i.e. neural processing unit), each narrow mind in one layer
There is the connection input of 1024 1024 neurons from preceding layer, therefore a total of 1024K connection through member.Each
Neuron is inputted from each connection to be received 8 bit data values and this 8 bit data value is multiplied by 8 weighted value appropriate.
Figure 21 is to show that neural network unit 121 executes the timing diagram of the program of Figure 20, this neural network unit 121 has
Neural processing unit 126 as shown in figure 18 is implemented in narrow configuration.The timing diagram of Figure 21 is similar to the timing diagram of Fig. 5.With knit stitch
Its difference is illustrated.
In the timing diagram of Figure 21, these neural processing units 126 can be in narrow configuration, this is because being located at address 0
It initializes neural processing unit instruction and is initialized with narrow configuration.So this 512 neural processing units 126 are actually transported
Make to get up just as 1024 narrow neural processing unit (or neuron), this 1024 narrow nerve processing unit in field with
(the two narrow nerve processing for being denoted as 0 neural processing unit 126 is single by neural processing unit 0-A and neural processing unit 0-B
Member), (the two narrow nerve for being denoted as 1 neural processing unit 126 is handled by neural processing unit 1-A and neural processing unit 1-B
Unit), and so on until neural processing unit 511-A and neural processing unit 511-B (are denoted as 511 nerve processing list
The two narrow neural processing unit of member 126), it is indicated.Only shown to simplify explanation, in figure narrow neural processing unit 0-A,
The operation of 0-B and 511-B.Because the count value for being located at the multiply-accumulate rotation instruction of address 2 is 1023, and is needed
Therefore 1023 time-frequency periods are operated, the columns of the timing diagram of Figure 21 includes up to 1026 time-frequency periods.
In the time-frequency period 0, each of this 1024 neural processing units can execute the initialization directive of Fig. 4, i.e. Fig. 5
The shown running for assigning zero to accumulator 202.
In the time-frequency period 1, each of this 1024 narrow nerve processing unit can execute multiplying positioned at address 1 in Figure 20
Method accumulated instruction.As shown in the figure, accumulator 202A numerical value (i.e. zero) is added data random access by narrow nerve processing unit 0-A
The product of the column 17 narrow text 0 and the narrow text 0 of column 0 of weight random access units 124 of unit 122;Narrow nerve processing unit 0-
B is by accumulator 202B numerical value (i.e. zero) plus the narrow text 1 of column 17 and weight random access units of data random access unit 122
The product of the 124 narrow text 1 of column 0;Accumulator 202B numerical value (i.e. zero) is added so on up to narrow nerve processing unit 511-B
The narrow text 1023 of column 17 of upper data random access unit 122 and multiplying for the narrow text 1023 of column 0 of weight random access units 124
Product.
In the time-frequency period 2, each of this 1024 narrow nerve processing unit can execute multiplying positioned at address 2 in Figure 20
The first time iteration of the cumulative rotation instruction of method.As shown in the figure, narrow nerve processing unit 0-A adds accumulator 202A numerical value 217A
On by narrow data literal after the received rotations of multitask buffer 208B output 209B institute of narrow neural processing unit 511-B
1811A (namely by the 122 received narrow data literal 1023 of institute of data random access memory) and weight random access units
The product of the 124 narrow text 0 of column 1;Narrow nerve processing unit 0-B adds accumulator 202B numerical value 217B single by narrow nerve processing
Narrow data literal 1811B is (namely random by data after the received rotation of multitask buffer 208A output 209A institute of first 0-A
Access memory 122 institute received narrow data literal 0) and weight random access units 124 the narrow text 1 of column 1 product;According to this
Analogize, until narrow nerve processing unit 511-B adds accumulator 202B numerical value 217B by narrow neural processing unit 511-A's
Narrow data literal 1811B (is namely deposited by data random access after the received rotation of multitask buffer 208A output 209A institute
Reservoir 122 received narrow data literal 1022) product with the narrow texts 1023 of column 1 of weight random access units 124.
In the time-frequency period 3, each of this 1024 narrow nerve processing unit can execute multiplying positioned at address 2 in Figure 20
Second of iteration of the cumulative rotation instruction of method.As shown in the figure, narrow nerve processing unit 0-A adds accumulator 202A numerical value 217A
On by narrow data literal after the received rotations of multitask buffer 208B output 209B institute of narrow neural processing unit 511-B
1811A (namely by the 122 received narrow data literal 1022 of institute of data random access memory) and weight random access units
The product of the 124 narrow text 0 of column 2;Narrow nerve processing unit 0-B adds accumulator 202B numerical value 217B single by narrow nerve processing
Narrow data literal 1811B is (namely random by data after the received rotation of multitask buffer 208A output 209A institute of first 0-A
Access memory 122 institute received narrow data literal 1023) and weight random access units 124 the narrow text 1 of column 2 product;
The rest may be inferred, until narrow nerve processing unit 511-B adds accumulator 202B numerical value 217B by narrow neural processing unit 511-
Narrow data literal 1811B is (namely by data random access after the received rotation of multitask buffer 208A output 209A institute of A
Memory 122 received narrow data literal 1021) product with the narrow texts 1023 of column 2 of weight random access units 124.
As shown in figure 21, this operation can persistently carry out in subsequent 1021 time-frequency periods, until the time-frequency period 1024 as described below.
In the time-frequency period 1024, each of this 1024 narrow nerve processing unit, which can execute, is located at address 2 in Figure 20
Multiply-accumulate rotation instruction the 1023rd iteration.As shown in the figure, narrow nerve processing unit 0-A is by accumulator 202A number
Value 217A adds narrow number after the received rotation of multitask buffer 208B output 209B institute by narrow neural processing unit 511-B
According to text 1811A (namely by the 122 received narrow data literal 1 of institute of data random access memory) and weight arbitrary access list
The product of the narrow text 0 of column 1023 of member 124;Narrow nerve processing unit 0-B adds accumulator 202B numerical value 217B by narrow nerve
Narrow data literal 1811B is (namely by counting after the received rotation of multitask buffer 208A output 209A institute of processing unit 0-A
According to random access memory 122 received narrow data literal 2) with the narrow texts 1 of column 1023 of weight random access units 124
Product;The rest may be inferred, is handled until narrow nerve processing unit 511-B adds accumulator 202B numerical value 217B by narrow nerve
Narrow data literal 1811B is (namely by data after the received rotation of multitask buffer 208A output 209A institute of unit 511-A
Random access memory 122 received narrow data literal 0) with the narrow texts 1023 of column 1023 of weight random access units 124
Product.
In the time-frequency period 1025, the run function unit 212A/ of each of this 1024 narrow nerve processing unit
The run function that 212B can execute in Figure 20 positioned at address 3 instructs.Finally, in the time-frequency period 1026, at this 1024 narrow nerve
Managing the meeting of each of unit will be opposite in the column 16 of its narrow result 133A/133B write back data random access memory 122
Narrow text is answered, is instructed with executing the write-in run function unit in Figure 20 positioned at address 4.That is, neural processing unit 0-A's is narrow
As a result 133A can be written into the narrow text 0 of data random access memory 122, the narrow result 133B meeting of neural processing unit 0-B
It is written into the narrow text 1 of data random access memory 122, and so on, until the narrow result of neural processing unit 511-B
133B can be written into the narrow text 1023 of data random access memory 122.Figure 22 is shown aforementioned corresponding to Figure 21 with block diagram
Operation.
Figure 22 is the block schematic diagram for showing the neural network unit 121 of Fig. 1, this neural network unit 121 has as schemed
Nerve processing unit 126 shown in 18 is to execute the program of Figure 20.This neural network unit 121 includes that 512 nerve processing are single
Member 126, i.e., 1024 narrow nerve processing unit, data random access memory 122 and weight random access memory 124,
Data random access memory 122 receives its address input 123, and weight random access memory 124 receives the input of its address
125.Although not shown in figure, but, in the time-frequency period 0, this 1024 narrow nerve processing unit can all execute the first of Figure 20
Beginningization instruction.As shown in the figure, in the time-frequency period 1,1024 8 data literals of column 17 can be from data random access memory
122 read and are provided to this 1024 narrow neural processing unit.In the time-frequency period 1 to 1024,1024 8 of column 0 to 1023
Weight text can read respectively from weight random access memory 124 and be provided to this 1024 narrow neural processing unit.Although
It is not shown in figure, but, in the time-frequency period 1, this 1024 narrow nerve processing unit can be to the data literal and weight of load
Text executes its corresponding multiply-accumulate operation.In the time-frequency period 2 to 1024, more of this 1024 narrow nerve processing unit
The rotator of the running of business buffer 208A/208B such as same 1024 8 texts, can deposit at random the previously loaded data
The data literal of the column 17 of access to memory 122 is rotated to neighbouring narrow neural processing unit, and these narrow neural processing unit meetings
It is executed to data literal after corresponding rotation and by the corresponding narrow weight text that weight random access memory 124 loads
Multiply-accumulate operation.Although not shown in figure, in the time-frequency period 1025, run function unit 212A/212B this 1024 narrow
Enabled instruction can be executed.In the time-frequency period 1026, this 1024 narrow nerve processing unit can be by its corresponding 1024 8 knots
The column 16 of fruit 133A/133B write back data random access memory 122.
It is possible thereby to find, the embodiment compared to Fig. 2, the embodiment of Figure 18 allows program designer that there is elasticity can select
It selects and executes calculating using wide data and weight text (such as 16) and narrow data and weight text (such as 8), in response to specific
For the demand of accuracy under.From one towards from the point of view of, for the application of narrow data, the embodiment of Figure 18 compared to
The embodiment of Fig. 2 can provide twice of efficiency, but must increase additional narrow component (such as multitask buffer 208B, caching
Device 205B, narrow arithmetic logic unit 204B, narrow accumulator 202B, narrow run function unit 212B) it is used as cost, these are additional
Narrow component can make neural processing unit 126 increase about 50% area.
Three mould nerve processing units
Figure 23 is the block schematic diagram for showing another embodiment of dynamically configurable neural processing unit 126 of Fig. 1.Figure
23 neural processing unit 126 may not only be applied to wide configuration and narrow configuration, also can be used to the third configuration, hereon referred to as " funnel
(funnel) " it configures.The neural processing unit 126 of Figure 23 is similar to the neural processing unit 126 of Figure 18.But, in Figure 18
For wide adder 244A in the neural processing unit 126 of Figure 23 as replaced the wide adder 2344A of one three input, this is three defeated
Enter wide adder 2344A and receive a third addend 2399, is that the one of the output of narrow multiplexer 1896B extends version.With figure
Program performed by the neural network unit of 23 neural processing unit is similar to the program of Figure 20.But, wherein being located at address 0
The instruction of initialization nerve processing unit these neural processing units 126 can be initialized as to funnel configuration, rather than narrow configuration.This
Outside, positioned at address 2 multiply-accumulate rotation instruct count value be 511 rather than 1023.
When in funnel configuration, the running of neural processing unit 126, which is similar to, is in narrow configuration, when execution is as in Figure 20
When the multiply-accumulate instruction of location 1, neural processing unit 126 can receive data literal 207A/207B two narrow and two narrow weights
Text 206A/206B;Data literal 209A and weight text 203A can be multiplied to produce wide multiplexer by wide multiplier 242A
The product 246A of 1896A selection;Data literal 209B and weight text 203B can be multiplied to produce narrow more by narrow multiplier 242B
The product 246B of work device 1896B selection.But, wide adder 2344A (can be selected product 246A) by wide multiplexer 1896A
And product 246B/2399 (being selected by wide multiplexer 1896B) is added with wide accumulator 202A output 217A, and narrow adder
244B is then not start with narrow accumulator 202B.In addition, being configured in funnel and executing the multiply-accumulate rotation such as address 2 in Figure 20
When turning instruction, control input 213 can make multitask buffer 208A/208B rotate two narrow text (such as 16), that is,
It says, multitask buffer 208A/208B can select its corresponding input 211A/211B, just as the same in wide configuration.No
It crosses, data literal 209A and weight text 203A can be multiplied to produce multiplying for wide multiplexer 1896A selection by wide multiplier 242A
Product 246A;Data literal 209B and weight text 203B can be multiplied to produce narrow multiplexer 1896B and selected by narrow multiplier 242B
Product 246B;Also, wide adder 2344A can be by product 246A (being selected by wide multiplexer 1896A) and product 246B/
2399 (being selected by wide multiplexer 1896B) are all added with wide accumulator 202A output 217A, and narrow adder 244B adds up with narrow
Device 202B is not start as aforementioned.Finally, when being configured in funnel and executing the run function instruction of address 3 in such as Figure 20,
Wide run function unit 212A can execute run function to result sum 215A to generate a narrow result 133A, and narrow run function
Unit 212B is then not start.In this way, narrow result 133A can be generated by being only denoted as the narrow neural processing unit of A, it is denoted as B
Narrow neural processing unit caused by narrow result 133B be then invalid.Therefore, the column of write-back result are (such as address 4 in Figure 20
The indicated column 16 of instruction) can be comprising cavity, this is because only narrow result 133A is effective, narrow result 133B is then invalid.Cause
This, conceptually, in each time-frequency period, it is defeated that each neuron (the neural processing unit of Figure 23) can execute two connection data
Enter, i.e., two narrow data literals is multiplied by its corresponding weight and by the two product additions, in comparison, Fig. 2 and Figure 18
Embodiment the input of connection data is only carried out within each time-frequency period.
It is deposited at random in the embodiment of Figure 23 it can be found that generating simultaneously write back data random access memory 122 or weight
The quantity of the result text (neuron output) of access to memory 124 is subduplicate the one of received data input (connection) quantity
Half, and the column that write back of result have cavity, i.e., be exactly every a narrow text results it is invalid, more precisely, be denoted as the narrow of B
Neural processing unit result does not have meaning.Therefore, the embodiment of Figure 23 is especially effective for the neural network with continuous two layers
Rate, for example, the neuronal quantity that first layer has be the second layer twice (such as first layer have 1024 neurons fill
Divide 512 neurons for being connected to the second layer).In addition, other execution unit 122 (such as media units, as x86 it is advanced to
Measure expanding element) if necessary, can a dispersion results be arranged and (have cavity) with execution union operation (pack operation)
So that its close (not having cavity).Subsequent nerve processing unit 121 of working as is stored in the other data random access that are associated with of execution
It, can be by this treated data column based on when the calculating of other column of device 122 and/or weight random access memory 124
It calculates.
Hybrid neural networks unitary operation: convolution and common source operational capability
The advantages of neural network unit 121 described in the embodiment of the present invention, is that this neural network unit 121 can be same
When by be similar to a coprocessor execute oneself internal processes in a manner of operate and be similar to a processor processing
Unit executes issued framework and instructs (or the microcommand gone out by framework instruction translation).Framework instruction, which is included in, has mind
In framework program performed by processor through network unit 121.In this way, neural network unit 121 can be transported in a mixed manner
Make, and the high usage of neural processing unit 121 can be maintained.For example, Figure 24 to Figure 26 shows that neural network unit 121 is held
The running of row convolution algorithm, wherein neural network unit is fully utilized, and Figure 27 to Figure 28 shows that neural network unit 121 is held
The running of row common source operation.The application that convolutional layer, common source layer and other numerical datas calculate, such as image processing (such as edge
Detecting, sharpened, blurring, identification/classification) it needs to use these operations.But, the mixing fortune of neural processing unit 121
Calculation, which is not limited to execution convolution or common source operation, this composite character, can also be used for executing other operations, such as described in Fig. 4 to Figure 13
The multiply-accumulate operation of traditional neural network and run function operation.That is, 100 (more precisely, reservation station of processor
108) MTNN instruction 1400 and MFNN instruction 1500 can be issued to neural network unit 121, in response to the instruction of this publication, nerve
Network unit 121 can write data into 122/124/129 and deposit result from what is be written by neural network unit 121
It is read in reservoir 122/124, at the same time, (instructs) write-in program memory through MTNN1400 to execute processor 100
129 program, neural network unit 121 can read and memory 122/124/129 are written.
Figure 24 is a block schematic diagram, and display uses the data to execute convolution algorithm by the neural network unit 121 of Fig. 1
One example of structure.This block diagram includes the data random access memory of convolution kernel 2402, data array 2404 and Fig. 1
122 with weight random access memory 124.For a preferred embodiment, data array 2404 (such as corresponding to image picture
Element) it is loaded into the system storage (not shown) for being connected to processor 100 and is instructed by processor 100 through MTNN is executed
The weight random access memory 124 of 1400 load neural network units 121.Convolution algorithm is by the first array and second array
Convolution is carried out, this second array is convolution kernel as described herein.As described herein, convolution kernel is a coefficient matrix, these are
Number is alternatively referred to as weight, parameter, element or numerical value.For a preferred embodiment, this convolution kernel 2042 is held by processor 100
The static data of capable framework program.
This data array 2404 is the two-dimensional array of a data value, and each data value (such as image pixel value) is big
Small is the size (such as 16 or 8) of the text of data random access memory 122 or weight random access memory 124.
In this example, data value is 16 texts, and neural network unit 121 is the neural processing unit configured with 512 wide configurations
126.In addition, in this embodiment, neural processing unit 126 includes that multitask buffer is deposited with receiving from weight arbitrary access
The weight text 206 of reservoir 124, such as the multitask buffer 705 of Fig. 7, use and connect to by weight random access memory 124
The column data value received executes the operation of collective's rotator, this part can be described in more detail in following sections.In this example,
Data array 2404 is the pixel array of 2560 row X1600 column.As shown in the figure, when framework program is by data array 2404
When carrying out convolutional calculation with convolution kernel 2402, data array 2402 can be divided into 20 data blocks, and each data block is respectively
The data array 2406 of 512x400.
In this example, convolution kernel 2402 is one by coefficient, weight, parameter or element, the 3x3 array of composition.This
The first row of a little coefficients is denoted as C0, and 0;C0,1;With C0,2;The secondary series of these coefficients is denoted as C1, and 0;C1,1;With C1,2;
The third of these coefficients, which arranges, is denoted as C2, and 0;C2,1;With C2,2.For example, the convolution kernel with following coefficient can be used for holding
Row edge detection: 0,1,0,1, -4,1,0,1,0.In another embodiment, the convolution kernel with following coefficient can be used for executing height
This fuzzy operation: 1,2,1,2,4,2,1,2,1.In this example, it will usually execute one again to the numerical value after final add up and remove
Method, wherein divisor is the aggregation of the absolute value of each element of convolution kernel 2042, is 16 in this example.In another example
In, divisor can be the number of elements of convolution kernel 2042.In another example, divisor, which can be, is compressed to one for convolution algorithm
Numerical value used in target value range, this divisor is by the element numerical value of convolution kernel 2042, target zone and executes convolution fortune
The range of the input value array of calculation is determined.
4 and Figure 25 of wherein details is described in detail referring to figure 2., framework program by the coefficient write-in data of convolution kernel 2042 with
Machine accesses memory 122.For a preferred embodiment, the continuous nine column (convolution kernel 2402 of data random access memory 122
Interior number of elements) each column on all texts, can be added using the different elements of convolution kernel 2402 with arranging for its primary sequence
With write-in.That is, as shown in the figure, same row each text with the first coefficient C0,0 write-in;Next column be then with
Second coefficient C0,1 write-in;Next column is then with third coefficient C0,2 write-ins;Next column is then with the 4th coefficient C1,0 write-in again;
The rest may be inferred, until each text of the 9th column is with the 9th coefficient C2,2 write-ins.In order to what is be partitioned into data array 2404
The data matrix 2406 of data block carries out convolution algorithm, and neural processing unit 126 can read data according to sequence repetition and deposit at random
Nine column of 2042 coefficient of convolution kernel are loaded in access to memory 122, this part is particularly corresponding to the portion of Figure 26 A in following sections
Point, it can be described in more detail.
4 and Figure 25 of wherein details is described in detail referring to figure 2., weight is written in the numerical value of data matrix 2406 by framework program
Random access memory 124.When neural network unit program executes convolution algorithm, result array can be write back to weight arbitrary access
Memory 124.For a preferred embodiment, weight random access memory can be written in the first data matrix 2406 by framework program
Device 124 simultaneously makes neural network unit 121 start operation, when neural network unit 121 is to the first data matrix 2406 and convolution
When core 2402 executes convolution algorithm, weight random access memory 124 can be written in the second data matrix 2406 by framework program, such as
This, after neural network unit 121 completes the convolution algorithm of the first data matrix 2406, can start to execute the second data matrix
2406 convolution algorithm, this part it is subsequent correspond to Figure 25 at be described in more detail.By this method, framework program can be past
It returns in two regions of weight random access memory 124, to ensure that neural network unit 121 is sufficiently used.Therefore, Figure 24
Example show the first data matrix 2406A and the second data matrix 2406B, the first data matrix 2406A is corresponding to accounting for
According to the first data block of column 0 to 399 in weight random access memory 124, and the second data matrix 2406B is corresponding to accounting for
According to the second data block of column 500 to 899 in weight random access memory 124.In addition, as shown in the figure, neural network unit
121 can write back the result of convolution algorithm the column 900-1299 of weight random access memory 124 and column 1300-1699, with
Framework program can read these results from weight random access memory 124 afterwards.It is loaded into weight random access memory 124
The data value of data matrix 2406 is denoted as that " Dx, y ", wherein " x " is 124 columns of weight random access memory, " y " is weight
The text or line number of random access memory.For example, it is denoted as in Figure 24 positioned at the data literal 511 of column 399
D399,511, this data literal is received by the multitask buffer 705 of neural processing unit 511.
Figure 25 is a flow chart, shows that the processor 100 of Fig. 1 executes framework program with right using neural network unit 121
The data array 2404 of Figure 24 executes the convolution algorithm of convolution kernel 2042.This process starts from step 2502.
In step 2502, processor 100 executes the processor 100 for having framework program, can be by the convolution kernel of Figure 24
Data random access memory 122 is written in a manner of 2402 descriptions shown by Figure 24.In addition, framework program can will be at the beginning of variable N
Beginning turns to numerical value 1.The data block that neural network unit 121 is being handled in variable N unlabeled data array 2404.In addition, framework
Variable NUM_CHUNKS can be initialized as numerical value 20 by program.Following process advances to step 2504.
In step 2504, as shown in figure 24, processor 100 data matrix 2406 of data block 1 can be written weight with
Machine accesses memory 124 (such as data matrix 2406A of data block 1).Following process advances to step 2506.
In step 2506, processor 100 will use a specified function 1432 with write-in program memory 129
121 program storage 129 of neural network unit is written in convolution program by MTNN instruction 1400.Processor 100 then will use one
A specified function 1432 instructs 1400 with the MTNN for starting to execute program, to start neural network unit convolution program.Nerve net
One example of network unit convolution program is corresponding at Figure 26 A and can be described in more detail.Following process advances to step
2508。
Whether it is less than NUM_CHUNKS in the numerical value of steps in decision-making 2508, framework program validation variable N.If so, process meeting
Advance to step 2512;Otherwise step 2514 is proceeded to.
In step 2512, as shown in figure 24, processor 100 is random by the write-in of data matrix 2406 weight of data block N+1
It accesses memory 124 (such as data matrix 2406B of data block 2).Therefore, when neural network unit 121 is to current data
When block executes convolution algorithm, the data matrix 2406 of next data block can be written weight arbitrary access and deposited by framework program
Reservoir 124, in this way, after the convolution algorithm for completing current data block, i.e., after write-in weight random access memory 124, nerve
Network unit 121 can immediately begin to execute convolution algorithm to next data block.
In step 2514, processor 100 confirm be carrying out neural network unit program (for data block 1 but from
Step 2506 starts to execute, and is then to execute for data block 2-20 since step 2518) whether complete to execute.
For a preferred embodiment, processor 100 reads 121 state cache of neural network unit through MFNN instruction 1500 is executed
Device 127 is to be confirmed whether to have completed to execute.In another embodiment, neural network unit 121 can generate interruption, indicate
Complete convolution program.Following process advances to steps in decision-making 2516.
In steps in decision-making 2516, whether the numerical value of framework program validation variable N is less than NUM_CHUNKS.If so, process
Advance to step 2518;Otherwise step 2522 is proceeded to.
In step 2518, processor 100 will be updated convolution program to be implemented in data block N+1.More precisely, locate
Managing device 100 can be by the train value of the neural processing unit instruction of the initialization for corresponding to address 0 in weight random access memory 124 more
It is newly the first row of data matrix 2406 (for example, being updated to the column 0 of data matrix 2406A or the column of data matrix 2406B
500), and output column (such as being updated to column 900 or 1300) be will be updated.Being followed by subsequent processing device 100 can start after executing this update
Neural network unit convolution program.Following process advances to step 2522.
In step 2522, neural network list of the processor 100 from 124 read block N of weight random access memory
The implementing result of first convolution program.Following process advances to steps in decision-making 2524.
In steps in decision-making 2524, whether the numerical value of framework program validation variable N is less than NUM_CHUNKS.If so, process
Advance to step 2526;Otherwise it just terminates.
In step 2526, the numerical value of N can be increased by one by framework program.Following process returns to steps in decision-making 2508.
Figure 26 A is the program listing of neural network unit program, this neural network unit program utilizes the convolution kernel of Figure 24
The convolution algorithm of 2402 execution data matrixes 2406 is simultaneously write back weight random access memory 124.This program is by address 1
The instruction cycles that are constituted of instruction to 9 recycle certain number.Initialization nerve processing unit instruction positioned at address 0 is specified
Each nerve processing unit 126 executes the number of this instruction cycles, and the loop count possessed by the example of Figure 26 A is 400,
Corresponding to the columns in the data matrix 2406 of Figure 24, and the recursion instruction (being located at address 10) for being located at circulation terminal can make currently
Loop count successively decreases, if result is nonzero value, is just returned to the top (returning to the instruction of address 1) of instruction cycles.
Initializing neural processing unit instruction also can be cleared to zero for accumulator 202.For a preferred embodiment, positioned at address 10
Accumulator 202 can be also cleared to zero by recursion instruction.In addition, as the aforementioned multiply-accumulate instruction positioned at address 1 can also will add up
Device 202 is cleared to zero.
Execution each time for instruction cycles in program, this 512 neural processing units 126 can be performed simultaneously 512
The convolution algorithm of 512 corresponding 3x3 submatrixs of 3x3 convolution kernel and data matrix 2406.Convolution algorithm is by convolution
The aggregation for nine products that corresponding element in the element of core 2042 and corresponding submatrix calculates.In the reality of Figure 26 A
It applies in example, the origin of each (central element) of this 512 corresponding 3x3 submatrixs is data literal Dx+1, y in Figure 24
+ 1, wherein y (row number) is that neural processing unit 126 is numbered, and x (column number) is present weight random access memory 124
In by address 1 in the program of Figure 26 A the read column number of multiply-accumulate instruction (this column number also can by address 0 just
Beginningization nerve processing unit instruction carries out initialization process, can also pass when executing the multiply-accumulate instruction for being located at address 3 and 5
Increase, the decrement commands that can be also located at address 9 update).In this way, in each circulation of this program, this 512 nerve processing
Unit 126 can calculate 512 convolution algorithms and the result of this 512 convolution algorithms is write back weight random access memory 124
Instruction column.It omits edge processing (edge handling) herein to simplify explanation, but should be noted that and utilize
Collective's hyperspin feature of these neural processing units 126 will cause (for the image processor i.e. image of data matrix 2406
Data matrix) multirow data in have two rows from the vertical edge of one side to another vertical edge (such as from left side
Edge is to right side edge, and vice versa) it generates around (wrapping).It is illustrated now for instruction cycles.
Address 1 is multiply-accumulate instruction, this instruction can specify the column 0 of data random access memory 122 and utilize in the dark
The column of present weight random access memory 124, this column preferably be loaded in sequencer 128 (and by be located at address 0 finger
It enables and is initialized with zero to execute the operation that first time instruction cycles transmit).That is, being located at the instruction of address 1 can make
It is each nerve processing unit 126 from the column 0 of data random access memory 122 read its corresponding text, from present weight with
Machine accesses the column of memory 124 and reads its corresponding text, and executes a multiply-accumulate operation to this two texts.In this way, citing
For, C0,0 and Dx, 5 are multiplied (wherein " x " is that present weight random access memory 124 arranges) by neural processing unit 5, will tie
Fruit adds 202 numerical value 217 of accumulator, and sum is write back accumulator 202.
Address 2 is a multiply-accumulate instruction, this instruction can specify the column of data random access memory 122 to be incremented by (i.e.
It increases to 1), then reads this column from the incremental rear address of data random access memory 122 again.This instructs and can specify will be every
Numerical value in the multitask buffer 705 of a nerve processing unit 126 is rotated to neighbouring neural processing unit 126, in this model
It is the instruction in response to address 1 in example and the column of 2406 value of data matrix from the reading of weight random access memory 124.Scheming
In 24 to Figure 26 embodiment, these neural processing units 126 numerical value of multitask buffer 705 to be rotated to the left,
Rotate from neural processing unit J to neural processing unit J-1, rather than if earlier figures 3, Fig. 7 and Figure 19 are from neural processing unit J
It rotates to neural processing unit J+1.It is worth noting that, in the dextrorotary embodiment of neural processing unit 126, framework program
Can by convolution kernel 2042 be numerical value with different order be written data random access memory 122 (such as around its central row revolve
Turn) to achieve the purpose that similar convolution results.In addition, when needed, additional convolution kernel pretreatment (example can be performed in framework program
Such as movement (transposition)).In addition, the count value that instruction is specified is 2.Therefore, the instruction positioned at address 2 can make each
Neural processing unit 126 reads its corresponding text from the column 1 of data random access memory 122, by received text after rotation
At most task buffer device 705, and multiply-accumulate operation is executed to the two texts.Because count value is 2, this instruction can also make
Each nerve processing unit 126 repeats aforementioned running.That is, sequencer 128 can be such that data random access memory 122 arranges
Address 123 is incremented by (increasing to 2), and each neural processing unit 126 can be read from the column 2 of data random access memory 122
Received text at most task buffer device 705 after taking its corresponding text and rotating, and multiplication is executed to the two texts
Accumulating operation.In this way, for example, it is assumed that present weight random access memory 124 is classified as 27, in the instruction for executing address 2
Afterwards, neural processing unit 5 can be by the product and C0 of C0,1 and D27,6,2 and D27,7 product accumulation to its accumulator 202.Such as
This, after the instruction for completing address 1 and address 2, C0,0 and D27,5 product, the product and C0 of C0,1 and D27,6,2 and D27,7
It will be added to accumulator 202, the accumulated value of other all instruction cycles from first front transfer is added.
Operation performed by the instruction of address 3 and 4 is similar to the instruction of address 1 and 2, utilizes weight random access memory
The effect of 124 column increment pointers, these instructions can carry out operation, and this to the next column of weight random access memory 124
A little instructions can carry out operation to subsequent three column of data random access memory 122, i.e. column 3 to 5.That is, at nerve
For managing unit 5, after the instruction for completing address 1 to 4, C0,0 and D27,5 product, the product of C0,1 and D27,6, C0,2 with
D27,7 product, the product of C1,0 and D28,5, C1,1 and D28,6 product and C1,2 and D28,7 product can add up
To accumulator 202, the accumulated value of other all instruction cycles from first front transfer is added.
Operation performed by the instruction of address 5 and 6 is similar to the instruction of address 3 and 4, these instructions can deposit at random weight
The next column of access to memory 124 and subsequent three column of data random access memory 122, i.e. column 6 to 8, carry out operation.?
That is by taking neural processing unit 5 as an example, after the instruction for completing address 1 to 6, C0,0 and D27,5 product, C0,1 and D27,6
Product, the product of C0,2 and D27,7, C1,0 and D28,5 product, the product of C1,1 and D28,6, C1,2 and D28,7, C2,
0 and D29,5 product, the product and C2 of C2,1 and D29,6,2 and D29,7 product can be added to accumulator 202, be added
The accumulated value of other all instruction cycles from first front transfer.That is, after the instruction of completion address 1 to 6, it is assumed that refer to
When circulation being enabled to start, weight random access memory 124 is classified as 27, by taking neural processing unit 5 as an example, it will utilizes convolution kernel
2042 pairs or less 3x3 submatrixs carry out convolution algorithm:
D27,5 D27,6 D27,7
D28,5 D28,6 D28,7
D29,5 D29,6 D29,7
In general, this 512 neural processing units 126 have all used convolution kernel after the instruction of completion address 1 to 6
2042 pairs of following 3x3 submatrixs carry out convolution algorithm:
Dr, n Dr, n+1 Dr, n+2
Dr+1, n Dr+1, n+1 Dr+1, n+2
Dr+2, n Dr+2, n+1 Dr+2, n+2
When wherein r is that instruction cycles start, the column address value of weight random access memory 124, and n is that nerve processing is single
The number of member 126.
The instruction of address 7 can transmit 202 numerical value 217 of accumulator through run function unit 121.This transfer function can be transmitted
One text, size (in bits) are equal to by data random access memory 122 and weight random access memory
124 texts (being 16 in this example) read.For a preferred embodiment, user may specify output format, such as
How many position is the position decimal (fractional) in output bit, this part can be described in more detail in following sections.In addition, this
It is specified to may specify a division run function, and non-designated transmitting run function, this division run function can be by accumulators
202 numerical value 217 as described in corresponding to Figure 29 A and Figure 30 herein, such as utilize " divider " of Figure 30 divided by a divisor
One of 3014/3016.For example, for a convolution kernel 2042 with coefficient, there are 16 points Ru aforementioned
One of coefficient Gaussian Blur core, the instruction of address 7 can specify a division run function (such as divided by 16), and non-designated
One transmission function.In addition, framework program can be before being written data random access memory 122 for convolution kernel coefficient, to convolution
2042 coefficient of core executes this operation divided by 16, and adjusts the position of the binary point of 2042 numerical value of convolution kernel, example accordingly
Such as use the data binary point 2922 of Figure 29 as described below.
The output of run function unit 212 can be written in weight random access memory 124 by defeated for the instruction of address 8
It falls out column specified by the current value of buffer.This current value can be initialized by the instruction of address 0, and by being incremented by instruction
Pointer is just incremented by this numerical value often passing through one cycle.
As described in the example that there is a 3x3 convolution kernel 2402 such as Figure 24 to Figure 26, when neural processing unit 126 is every about three
The frequency period can read weight random access memory 124 to read a column of data matrix 2406, and when every about 12
Weight random access memory 124 can be written in convolution kernel matrix of consequence by the frequency period.Furthermore, it is assumed that in one embodiment, having
Write-in and read buffers such as the buffer 1704 of Figure 17, while neural processing unit 126 is read out with being written, place
Reason device 100 can be read out and be written to weight random access memory 124, and buffer 1704 is every about 16 time-frequency weeks
Phase can execute primary reading and write activity to weight random access memory, to read data matrix and write-in convolution respectively
Core matrix of consequence.Therefore, the approximately half of bandwidth of weight random access memory 124 can be by neural network unit 121 with mixed
The convolution kernel operation that conjunction mode executes is consumed.This example includes a 3x3 convolution kernel 2042, but, the present invention is not limited to
This, the convolution kernel of other sizes, such as 2x2,4x4,5x5,6x6,7x7,8x8, it is equally applicable to different neural network units
Program.Using larger convolution kernel, because of the rotation version (address 2,4 and 6 of such as Figure 26 A of multiply-accumulate instruction
Instruction, biggish convolution kernel, which may require that, uses these instructions) there is biggish count value, neural processing unit 126 reads power
The time accounting of weight random access memory 124 can reduce, therefore, the bandwidth use of weight random access memory 124 than
It can reduce.
In addition, framework program can make neural network unit program to no longer needing column to be used in input data matrix 2406
Override, rather than by convolution algorithm result write back weight random access memory 124 different lines (as column 900-1299 with
1300-1699).For example, for the convolution kernel of a 3x3, weight can be written in data matrix 2406 by framework program
The column 2-401 of random access memory 124, and write-not column 0-399, and neural processing unit program then can be random from weight
The column 0 of access memory 124 start to be written convolution algorithm result, and often pass through once command circulation and be just incremented by columns.
In this way, neural network unit program can only will no longer be required to using column override.For example, it is passed through for the first time
(or more precisely, it loads weight random access memory 124 after the instruction for executing address 1 after instruction cycles
Column 0), the data of column 0 can be written, and but, the data needs for arranging 1-3 leave the operation for passing through instruction cycles for the second time for
And it cannot be written;Similarly, after passing through instruction cycles for the second time, the data of column 1 can be written, but, column
The data needs of 2-4 leave third time for and pass through the operation of instruction cycles and cannot be written;The rest may be inferred.In this embodiment
In, the height (such as 800 column) of each data matrix 2406 (data block) can be increased, thus less data block can be used.
In addition, framework program can make neural network unit program that the result of convolution algorithm is write back 2402 top of convolution kernel
Data random access memory 122 arrange (such as above column 8), rather than convolution algorithm result is write back into weight arbitrary access and is deposited
Reservoir 124, when result is written in neural network unit 121, framework program can read from data random access memory 122 and tie
Fruit (such as being most recently written 2606 address of column using data random access memory 122 in Figure 26).This configuration is suitable for having
The embodiment of single port weight random access memory 124 and dual port data random access memory.
Operation according to neural network unit 121 in the embodiment of Figure 24 to Figure 26 A is it can be found that the program of Figure 26 A
Every time execute may require that about 5000 time-frequency periods, in this way, in Figure 24 the data array 2404 of entire 2560x1600 volume
Product operation needs about 100,000 time-frequency period, hence it is evident that all less than time-frequency required for same task is executed in a conventional manner
Issue.
Figure 26 B is the embodiment for showing certain fields of control buffer 127 of the neural network unit 121 of Fig. 1
Block schematic diagram.This status register 127 includes a field 2602, it is indicated that quilt recently in weight random access memory 124
The address for the column that neural processing unit 126 is written;One field 2606, it is indicated that quilt recently in data random access memory 122
The address for the column that neural processing unit 126 is written;One field 2604, it is indicated that quilt recently in weight random access memory 124
The address for the column that neural processing unit 126 is read;An and field 2608, it is indicated that in data random access memory 122 most
The address of the column closely read by neural processing unit 126.In this way, mind can be confirmed by being implemented in the framework program of processor 100
Processing progress through network unit 121, when to data random access memory 122 and/or weight random access memory 124 into
When the reading and/or write-in of row data.Using this ability, and as aforementioned selection overrides input data matrix (or
Data random access memory 122 is write the result into Ru aforementioned), as described in example below, the data array 2404 of Figure 24 is just
The data block of 5 512x1600 can be considered as to execute, rather than the data block of 20 512x400.Processor 100 is random from weight
The column 2 of access memory 124 start that the data block of first 512x1600 is written, and neural network unit program is made to start (this
It is 1600 cycle count that program, which has a numerical value, and 0) it is that weight random access memory 124, which is exported row initialization,.When
When neural network unit 121 executes neural network unit program, processor 100 can monitor weight random access memory 124
Output position/address, using (1) (using MFNN instruction 1500) and reading in weight random access memory 124 has by nerve
The column of effective convolution operation result of network unit 121 (by column 0) write-in;And (2) by second 512x1600 data
Matrix 2406 (starting from column 2) overriding is in the effective convolution operation result being read, so when neural network unit 121 is right
Neural network unit program is completed in first 512x1600 data block, processor 100 can update nerve immediately if necessary
Network unit program is simultaneously again started up neural network unit program to be implemented in second 512x1600 data block.This program can be again
It executes in triplicate and is left three 512x1600 data blocks, so that neural network unit 121 can be used sufficiently.
In one embodiment, run function unit 212 has can perform effectively one effectively to 202 numerical value 217 of accumulator
The ability of division arithmetic, this part especially corresponds at Figure 29 A, Figure 29 B and Figure 30 in following sections to be had in more detail
It is bright.For example, 202 numerical value of accumulator instruct divided by the run function neural network unit of 16 division arithmetic available
In Gaussian Blur matrix as described below.
Convolution kernel 2402 used in the example of Figure 24 is the small-sized static for being applied to entire data matrix 2404
Convolution kernel, but, the present invention is not limited thereto, this convolution kernel can also be a large-scale matrix, and there is specific weight to correspond to number
According to the different data value of array 2404, such as it is common in the convolution kernel of convolutional neural networks.When neural network unit 121 is with this side
Formula by use, framework program can by the location swap of data matrix and convolution kernel, also i.e. by data matrix be placed in data with
Convolution kernel is placed in weight random access memory 124 in machine access memory 122, and executes neural network unit journey
The columns handled needed for sequence also can be relatively fewer.
Figure 27 is a block schematic diagram, shows and inserts the one of the weight random access memory 124 of input data in Fig. 1
Example, this input data execute common source operation (pooling operation) by the neural network unit 121 of Fig. 1.Common source operation
It is to be executed by the common source layer of artificial neural network, most through the subregion or submatrix and calculated sub-matrix for obtaining input matrix
Big value or average value are with matrix as a result, that is, common source matrix, to reduce input data matrix (image after such as image or convolution)
Size (dimension).In the example of Figure 27 and Figure 28, common source operation calculates the maximum value of each submatrix.Common source fortune
It calculates particularly useful for the artificial neural network for such as executing object classification or detecting.In general, common source operation can actually
The first prime number for the factor submatrix detected for reducing input matrix, especially can be by each dimension side of input matrix
To first prime number in the corresponding dimension direction for all reducing submatrix.In the example of Figure 27, input data be a wide text (such as
16) 512x1600 matrix, be stored in the column 0 to 1599 of weight random access memory 124.In Figure 27, these texts
With its column row location mark, e.g., positioned at 0 row 0 of column word indicating be D0,0;Word indicating positioned at 0 row 1 of column is D0,
1;Positioned at 0 row 2 of column word indicating be D0,2;The rest may be inferred, positioned at 0 row 511 of column word indicating be D0,511.In the same manner,
Positioned at 1 row 0 of column word indicating be D1,0;Positioned at 1 row 1 of column word indicating be D1,1;It is positioned at 1 row of column, 2 word indicating
D1,2;The rest may be inferred, positioned at 1 row 511 of column word indicating be D1,511;So the rest may be inferred, positioned at the text of 1599 row 0 of column
It is denoted as D1599,0;Positioned at 1599 row 1 of column word indicating be D1599,1 be located at 1599 row 2 of column word indicating be
D1599,2;The rest may be inferred, positioned at 1599 row 511 of column word indicating be D1599,511.
Figure 28 is the program listing of neural network unit program, this neural network unit program executes the input data of Figure 27
The common source operation of matrix is simultaneously write back weight random access memory 124.In the example of Figure 28, common source operation can calculate defeated
Enter the maximum value of each 4x4 submatrix in data matrix.The instruction cycles being made of instruction 1 to 10 can be performed a plurality of times in this program.
The neural processing unit of initialization positioned at address 0 instructs the number that each neural processing unit 126 can be specified to execute instruction circulation,
Loop count in the example of Figure 28 is 400, and the recursion instruction in circulation end (in address 11) can make previous cycle
Count value is successively decreased, and if generated the result is that nonzero value, the top for being just returned to instruction cycles (return to address 1
Instruction).Input data matrix in weight random access memory 124 substantially can be considered as 400 by neural network unit program
A mutual exclusion group being made of four adjacent columns, i.e. column 0-3, column 4-7, column 8-11, the rest may be inferred, until arranging 1596-1599.Often
One group being made of four adjacent columns includes 128 4x4 submatrixs, four column and four phases of these submatrixs thus group
The infall element of adjacent rows is formed by 4x4 submatrix, these adjacent rows at once 0-3, row 4-7, row 8-11, so on up to
Row 508-511.In this 512 neural processing units 126, every four the 4th neural processing units 126 (one for one group of calculating
Altogether be 128) can execute common source operation to a corresponding 4x4 submatrix, and other three neural processing units 126 then not by
It uses.More precisely, neural processing unit 0,4,8, so on up to neural processing unit 508, can be corresponding to its
4x4 submatrix executes common source operation, and the leftmost side row number of this 4x4 submatrix corresponds to neural processing unit and numbers, and under
Side's column correspond to the train value of present weight random access memory 124, this numerical value can be initialized as by the initialization directive of address 0
Zero and it will increase 4 after repeating each instruction cycles, this part can be described in more detail in following sections.This 400 times fingers
Enabling the corresponding 4x4 submatrix group number into the input data matrix of Figure 27 of the repetitive operation of circulation, (i.e. input data matrix has
1600 column having are divided by 4).The neural processing unit instruction of initialization, which can also remove accumulator 202, makes its zero.It is preferably real with regard to one
For applying example, the recursion instruction of address 11, which can also remove accumulator 202, makes its zero.In addition, the maxwacc of address 1 instructs meeting
Specified accumulator 202 of removing makes its zero.
Every time when executing the instruction cycles of program, this 128 neural processing units 126 used can be to input data
128 other 4x4 submatrixs in the current four column group of matrix, are performed simultaneously 128 common source operations.Furthermore, it is understood that
This common source operation can confirm the maximum value element in 16 elements of this 4x4 submatrix.In the embodiment of Figure 28, for this
For each of 128 neural processing units 126 used nerve processing unit y, the lower left element of 4x4 submatrix
For element Dx, the y in Figure 27, wherein x is the columns of present weight random access memory 124 when instruction cycles start, and this
By the maxwacc instruction reading of address 1 in the program of Figure 28, (this columns can also be handled column data by the initialization nerve of address 0
Unit instruction is initialized, and is incremented by the maxwacc instruction for executing address 3,5 and 7 every time).Therefore, for this program
Each circulation for, this 128 neural processing units 126 that are used can will work as corresponding 128 4x4 of forefront group
The maximum value element of submatrix writes back the specified column of weight random access memory 124.It is retouched below for this instruction cycles
It states.
The maxwacc instruction of address 1 can be arranged using present weight random access memory 124 in the dark, this column preferably fills
Be loaded in sequencer 128 (and the instruction by being located at address 0 is initialized with zero and passes through instruction cycles for the first time to execute
Operation).The instruction of address 1 can make each neural processing unit 126 from weight random access memory 124 when forefront is read
Its corresponding text by this text compared with 202 numerical value 217 of accumulator, and the maximum of the two numerical value is stored in cumulative
Device 202.So that it takes up a position, for example, neural processing unit 8 can confirm 202 numerical value 217 of accumulator and data literal Dx, 8 (wherein " x "
Be present weight random access memory 124 arrange) in maximum value and write back accumulator 202.
Address 2 is a maxwacc instruction, this instruction can be specified the multitask caching of each neural processing unit 126
Numerical value in device 705 is rotated to neighbouring to neural processing unit 126, and the instruction as in response to address 1 is just random from weight herein
Access the column input data array of values that memory 124 is read.In the embodiment of Figure 27 to Figure 28, neural processing unit 126
It rotates to rotate to the left 705 numerical value of multiplexer, namely from neural processing unit J to neural processing unit J-1, it is such as right above
It should be described in chapters and sections of the Figure 24 to Figure 26.In addition, it is 3 that this instruction, which can specify a count value,.In this way, the instruction of address 2 can make often
It is a nerve processing unit 126 by the at most task buffer device 705 of received text after rotation and confirm this rotation after text and accumulator
Then this operation is repeated two more times by the maximum value in 202 numerical value.That is, each nerve processing unit 126 can be held
Row by the at most task buffer device 705 of received text after rotation and confirms after rotation maximum in text and 202 numerical value of accumulator three times
The operation of value.In this way, for example, it is assumed that present weight random access memory 124 is classified as 36 when starting this instruction cycles,
By taking neural processing unit 8 as an example, after the instruction for executing address 1 and 2, neural processing unit 8 will store up in its accumulator 202
Deposit accumulator 202 and four 124 text D36 of weight random access memory when circulation starts, 8, D36,9, D36,10 with
D36, the maximum value in 11.
The performed operation of the maxwacc instruction of address 3 and 4 is similar to the instruction of address 1, is deposited using weight arbitrary access
124 column increment pointers of reservoir have effects that the instruction of address 3 and 4 can hold the next column of weight random access memory 124
Row.That is, it is assumed that the column of present weight random access memory 124 are 36 when instruction cycles start, with neural processing unit 8
For, after the instruction for completing address 1 to 4, neural processing unit 8 will store tired when circulation starts in its accumulator 202
Add device 202 and eight 124 text D36 of weight random access memory, 8, D36,9, D36,10, D36,11, D37,8, D37,
9, D37,10 and D37, the maximum value in 11.
The performed operation of the maxwacc instruction of address 5 to 8 is similar to the instruction of address 1 to 4, the instruction of address 5 to 8
Lower two column of weight random access memory 124 can be executed.That is, it is assumed that present weight is random when instruction cycles start
Accessing the column of memory 124 is 36, by taking neural processing unit 8 as an example, after the instruction for completing address 1 to 8, and neural processing unit 8
Accumulator 202 and 16 124 texts of weight random access memory when circulation starts will be stored in its accumulator 202
D36,8, D36,9, D36,10, D36,11, D37,8, D37,9, D37,10, D37,11, D38,8, D38,9, D38,10, D38,
11, D39,8, D39,9, D39,10 and D39, the maximum value in 11.That is, it is assumed that present weight when instruction cycles start
The column of random access memory 124 are 36, and by taking neural processing unit 8 as an example, after the instruction for completing address 1 to 8, nerve processing is single
Member 8 will be completed to confirm the maximum value of following 4x4 submatrix:
D36,8 D36,9 D36,10 D36,11
D37,8 D37,9 D37,10 D37,11
D38,8 D38,9 D38,10 D38,11
D39,8 D39,9 D39,10 D39,11
Substantially, each in this 128 neural processing units 126 used after the instruction for completing address 1 to 8
A nerve processing unit 126 will be completed to confirm the maximum value of following 4x4 submatrix:
Dr, n Dr, n+1 Dr, n+2 Dr, n+3
Dr+1, n Dr+1, n+1 Dr+1, n+2 Dr+1, n+3
Dr+2, n Dr+2, n+1 Dr+2, n+2 Dr+2, n+3
Dr+3, n Dr+3, n+1 Dr+3, n+2 Dr+3, n+3
Wherein r is the column address value of present weight random access memory 124 when instruction cycles start, and n is neural processing
Unit 126 is numbered.
The instruction of address 9 can transmit 202 numerical value 217 of accumulator through run function unit 212.This transfer function can be transmitted
One text, size (in bits) are equal to the text read by weight random access memory 124 (in this example
In i.e. 16).For a preferred embodiment, user may specify how many position is decimal in output format, such as output bit
(fractional) position, this part can be described in more detail in following sections.
The instruction of address 10 202 numerical value 217 of accumulator can be written slow by output column in weight random access memory 124
Column specified by the current value of storage, this current value can be initialized by the instruction of address 0, and utilize the incremental finger in instruction
This numerical value is incremented by by needle after passing through circulation every time.Furthermore, it is understood that the instruction of address 10 can be wide by the one of accumulator 202
Weight random access memory 124 is written in text (such as 16).For a preferred embodiment, this instruction can by this 16 positions according to
It is written according to output binary point 2916, this part is following more detailed corresponding to having at Figure 29 A and Figure 29 B
Explanation.
It has been observed that the column of iteration once command recurrent wrIting weight random access memory 124 can be comprising having invalid value
Cavity.That is, the wide text 1 to 3 of result 133,5 to 7,9 to 11, the rest may be inferred, until wide text 509 to 511 all
It is invalid or not used.In one embodiment, run function unit 212 includes that enabled be incorporated into result of multiplexer arranges buffering
The adjacent text of device, such as the column buffer 1104 of Figure 11 are arranged with writing back output weight random access memory 124.With regard to one compared with
For good embodiment, run function instruction can specify the text number in each cavity, and the text number in this cavity controls multiplexing
Device amalgamation result.In one embodiment, empty number can be designed to numerical value 2 to 6, with merge common source 3x3,4x4,5x5,6x6 or
The output of 7x7 submatrix.In addition, institute can be read from weight random access memory 124 by being implemented in the framework program of processor 100
Sparse (there is cavity) the result column generated, and other execution units 112 are utilized, such as merge the matchmaker of instruction using framework
Body unit executes pooling function such as x86 single-instruction multiple-data stream (SIMD) extension (SSE) instruction.To be similar to side that is aforementioned while carrying out
Formula is simultaneously essential using the mixing of neural network unit 121, and the framework program for being implemented in processor 100 can be with reading state buffer
127 with monitor weight random access memory 124 be most recently written column (such as field 2602 of Figure 26 B) with read caused by
One sparse result column, are merged and are write back the same row of weight random access memory 124, so just complete to prepare and can make
For an input data matrix, it is supplied to next layer of use of neural network, such as convolutional layer or traditional neural network layer (namely
Multiply-accumulate layer).In addition, embodiment as described herein with 4x4 submatrix execute common source operation, but the present invention is not limited to
This, the neural network unit program of Figure 28 can be adjusted, and with the submatrix of other sizes, such as 3x3,5x5,6x6 or 7x7, hold
Row common source operation.
As aforementioned it can be found that the quantity of the result column of write-in weight random access memory 124 is input data matrix
Columns a quarter.Finally, in this example and data random access memory 122 is not used.But, number can also be used
According to random access memory 122, rather than weight random access memory 124, Lai Zhihang common source operation.
In the embodiment of Figure 27 and Figure 28, the maximum value in common source operation accounting operator region.But, the program of Figure 28
It may be adjusted to calculate the average value of subregion, benefit, which enters, to be instructed through by maxwacc with sumwacc instruction substitution (by weight text
Word and 202 numerical value 217 of accumulator add up) and be by accumulation result divided by each sub-district by the run function instruction modification of address 9
First prime number (preferably is through multiplying reciprocal as described below) in domain, is 16 in this example.
By neural network unit 121 according in the operation of Figure 27 and Figure 28 it can be found that each time execute Figure 28 program
It needs to execute a common source operation to entire 512x1600 data matrix shown in Figure 27 using about 6000 time-frequency periods,
Time-frequency periodicity used in this operation is considerably less than the time-frequency periodicity that traditional approach executes similar required by task.
In addition, framework program can make result write back data random access memory of the neural network unit program by common source operation
Device 122 arranges, rather than results back into weight random access memory 124, when neural network unit 121 writes the result into data
When random access memory 122 (such as the ground of column 2606 is most recently written using the data random access memory 122 of Figure 26 B
Location), framework program can read result from data random access memory 122.This configuration is applicable in, and there is single port weight to deposit at random
The embodiment of access to memory 124 and dual port data random access memory 122.
There is user to provide binary point for fixed point arithmetic operation, and full precision fixed point is cumulative, the specified inverse of user
Value, the random rounding-off of accumulator value and optional starting/output function
In general, the hardware cell for executing arithmetical operation in digital computing system executes pair of arithmetical operation according to it
As being commonly divided into " integer " unit and " floating-point " unit for integer or floating number.Floating number has numerical value (magnitude)
(or mantissa) and index, usually there are also symbols.Index is radix (radix) point (usually binary point) relative to numerical value
Position pointer.In comparison, integer does not have index, and only has numerical value, and usually there are also symbols.Floating point unit can allow
Program designer can obtain its work institute number to be used from a very large-scale different numerical value, and hardware is then
It is responsible for the index value of this number of adjustment when needed, is handled without program designer.For example, it is assumed that two floating numbers
0.111x 1029With 0.81x 1031It is multiplied.Although (floating point unit typically operate in 2 based on floating number, institute in this example
Decimal fraction is used, or the floating number based on 10.) floating point unit can be responsible for automatically mantissa multiplication, index be added,
Result is then normalized to numerical value .8911x 10 again59.In another example, it is assumed that same two floating numbers are added.It is floating
Dot element can be responsible for automatically the binary fraction point alignment by mantissa before addition to generate numerical value as .81111x 1031It is total
Number.
But, it is well known that complicated in this way operation and the size that will lead to floating point unit increases, energy consumption increases, every finger
Time-frequency periodicity needed for enabling increases and/or cycle time is elongated.For this reason that many device (such as embedded processings
The microprocessor of device, microcontroller and relatively low cost and/or low-power) and do not have floating point unit.It can be with by previous cases
It was found that the labyrinth of floating point unit includes to execute the logic for being associated with the index of floating add and multiplication/division and calculating (i.e. pair
The index of operand executes plus/minus operation to generate the adder of floating-point multiplication/division exponential number, by operand index phase
Subtract the subtracter to confirm the binary point alignment offset amount of floating add), comprising in order to reach mantissa in floating add
Binary point alignment deviator, include the deviator being standardized to floating point result.In addition, process into
Row usually also need to be implemented the logic of the rounding-off operation of floating point result, execute integer data format between floating-point format and different floating-points
The logic of conversion between format (such as amplification precision, double precision, single precision, half precision), leading zero with leading one detector,
And the logic of the special floating number of processing, such as outlying observation, nonumeric and infinite value.
In addition, the correctness verifying about floating point unit can be big because the numerical space for needing to be verified in design increases
Width increases its complexity, and can extend product development cycle and Time To Market.In addition, it has been observed that floating-point operation arithmetic needs pair
The mantissa field and exponent field of each floating number for calculating are stored respectively and are used, and will increase required storage space
And/or accuracy is reduced in the case where given storage space is to store integer.Many disadvantages can penetrate integer list
Member executes arithmetical operation to avoid.
Program designer usually requires to write the program of processing decimal, and decimal is the numerical value of incomplete number.This program
Although may need to execute on the processor for not having floating point unit or processor has floating point unit, but by handling
The integer unit of device executes integer instructions can be than very fast.For the advantage using integer processor in efficiency, program designer
Known fixed point arithmetic operation can be used to fixed-point value (fixed-point numbers).Such program will include execution
The instruction of integer or integer data is handled in integer unit.Software knows that data are decimals, this software and include instruction pair
Integer data executes operation and handles the problem of this data is actually decimal, such as alignment offset device.Substantially, it pinpoints soft
Part can manually perform the function that some or all floating point unit can execute.
Herein, one " fixed point " number (or value or operand or input or output) is a number, bit of storage quilt
It is interpreted as indicating a fractional part of this fixed-point number comprising position, this position is referred to here as " decimal place ".The bit of storage packet of fixed-point number
Contained in one 8 or 16 texts in memory or buffer, such as in memory or buffer.In addition, the storage of fixed-point number
It deposits position and is all used to express a numerical value, and in some cases, one of position, which can be used to expression symbol, but not to be had
The bit of storage of one fixed-point number can be used to express the index of this number.In addition, the decimal place quantity or binary system of this fixed-point number
Scaling position is specified in one and is different from the storage space of fixed-point number bit of storage, and is referred in shared or general mode
The quantity of decimal place or binary point position out are shared with the fixed-point number set comprising this fixed-point number, such as defeated
Enter the set of the output result of operand, accumulating values or pe array.
In embodiment described here, arithmetic logic unit is integer unit, and but, run function unit is then comprising floating
Point arithmetic hardware auxiliary accelerates.Arithmetic logic unit part can be so set to become smaller and more quick, in favor of giving
More arithmetic logic unit are used on fixed chip space.This is also illustrated in unit chip spatially and more minds can be set
Through member, and it is particularly conducive to neural network unit.
In addition, requiring index bit of storage compared to each floating number, the fixed-point number in embodiment as described herein is with one
Belong to the quantity of the bit of storage of decimal place in the whole digital collection of a pointer expression, but, this pointer is single positioned at one, total
The storage space enjoyed and widely point out all numbers entirely gathered, such as input set, a series of fortune of a series of operations
Set, the set of output of the cumulative number of calculation, the wherein quantity of decimal place.For a preferred embodiment, neural network unit
User can to this digital collection specify decimal bit of storage quantity.Although it is understood, therefore, that in many cases
(mathematics as), the term of " integer " refer to that a tape symbol completely counts, that is, a number without fractional part,
But, in the train of thought of this paper, the term of " integer " can indicate the number with fractional part.In addition, in the train of thought of this paper,
The term of " integer " is the part position meeting in order to distinguish with floating number, for floating number, in respective storage space
For expressing the index of floating number.Similarly, integer arithmetic operation, the multiplication of integers executed such as integer unit or addition or compares
Operation, it is assumed that in operand do not have index, therefore, the whole array part of integer unit, as integer multiplier, integer adder,
Integer comparator there is no need to handle index comprising logic, such as does not need to move mantissa for addition or comparison operation
It is directed at binary point, does not need to be added index for multiplying.
In addition, embodiment as described herein includes a large-scale hardware integer accumulator to the whole of large series
Number operation is added up (such as 1000 multiply-accumulate operations) without losing accuracy.At so avoidable neural network unit
Floating number is managed, while cumulative number can be made to maintain full precision again, without being saturated it or generating the knot of inaccuracy because of overflow
Fruit.Once this series of integers operation adds up out a result and inputs this full precision accumulator, this fixed-point hardware auxiliary can execute necessity
Scaling and saturation arithmetic, use small required for the accumulated value decimal place quantity pointer specified using user and output valve
This full precision accumulated value is converted to an output valve by numerical digit quantity, this part can be described in more detail in following sections.
It is inputted for use in the one of run function when needing to compress accumulated value from full precision form or is used to pass
It passs, for a preferred embodiment, random rounding-off operation, this part is executed to the run function unit property of can choose to accumulated value
It can be described in more detail in following sections.Finally, the different demands to given layer according to neural network, neural processing unit can
Selectively to receive to indicate to use different run function and/or many various forms of accumulated values of output.
Figure 29 A is the block schematic diagram for showing an embodiment of control buffer 127 of Fig. 1.This control buffer 127 can
Including multiple control buffers 127.As shown in the figure, this control buffer 127 includes fields: configuration 2902, tape symbol
Data 2912, tape symbol weight 2914, data binary point 2922, weight binary point 2924, arithmetical logic list
Meta-function 2926, rounding control 2932, run function 2934, inverse 2942, offset 2944, output random access memory
2952, binary point 2954 and output order 2956 are exported.Control 127 value of buffer can use MTNN instruction
1400 carry out write activity with the instruction of NNU program, such as enabled instruction.
Configuring 2902 values and specifying neural network unit 121 is to belong to narrow configuration, wide configuration or funnel configuration, such as preceding institute
It states.Configuration 2902 is also set by data random access memory 122 and the received input of weight random access memory 124 text
The size of word.In narrow configuration with funnel configuration, the size for inputting text is narrow (such as 8 or 9), but, is matched in width
In setting, the size for inputting text is then wide (such as 12 or 16).In addition, configuration 2902 is also set and input text
The size of the identical output result 133 of size.
When tape symbol data value 2912 is genuine, that is, indicate by the received data text of data random access memory 122
Word is signed value, if vacation, then it represents that these data literals are not signed value.When tape symbol weighted value 2914 is genuine
It waits, that is, indicates that by the received weight text of weight random access memory 122 be signed value, if vacation, then it represents that these power
Text is weighed as not signed value.
2922 value of data binary point indicate by the received data literal of data random access memory 122 two into
Scaling position processed.For a preferred embodiment, for the position of binary point, data binary point
2922 values are the position number of positions for indicating binary point and calculating from right side.In other words, 2922 table of data binary point
Show the quantity for belonging to decimal place in the least significant bit of data literal, that is, the digit being located on the right side of binary point.Similarly,
2924 value of weight binary point indicates the binary point by the received weight text of weight random access memory 124
Position.For a preferred embodiment, when arithmetic logic unit function 2926 is a multiplication and is added up or output is cumulative, nerve
It is small that digit on the right side of the binary point for being loaded into the numerical value of accumulator 202 is determined as data binary system by processing unit 126
The aggregation of several points 2922 and weight binary point 2924.If so that it takes up a position, for example, data binary point 2922
Value is 5 and the value of weight binary point 2924 is 3, and the value in accumulator 202 will have 8 on the right side of binary point
Position.When arithmetic logic unit function 2926 be a sum/maximum value accumulator and data/weight text or transmitting data/
Weight text, neural processing unit 126 can distinguish the digit on the right side of the binary point for the numerical value for being loaded into accumulator 202
It is determined as data/weight binary point 2922/2924.In another embodiment, then refer to one accumulator two of order into
Decimal point 2923 processed, without removing specified other data binary point 2922 and weight binary point 2924.This portion
Divide and corresponds at Figure 29 B and can be described in more detail subsequent.
The specified function executed by the arithmetic logic unit 204 of neural processing unit 126 of arithmetic logic unit function 2926.
It has been observed that arithmetic logic unit function 2926 may include following operation but be not limited to: by data literal 209 and weight text 203
It is multiplied and is added this product with accumulator 202;Accumulator 202 is added with weight text 203;By accumulator 202 and data
Text 209 is added;Maximum value in accumulator 202 and data literal 209;Maximum in accumulator 202 and weight text 209
Value;Export accumulator 202;Transmit data literal 209;Transmit weight text 209;Export zero.In one embodiment, this arithmetic
Logic unit function 2926 is specified by neural network unit initialization directive, and by 204 use of arithmetic logic unit with
(not shown) is executed instruction in response to one.In one embodiment, this arithmetic logic unit function 2926 is by a other neural network list
Metainstruction is specified, and such as aforementioned multiply-accumulate and maxwacc is instructed.
Rounding control 2932 specifies the form of rounding-off operation used in (in Figure 30) rounder 3004.In an embodiment
In, assignable rounding mode includes but is not limited to: unrounded, be rounded up to most recent value and random rounding-off.Preferably implement with regard to one
For example, processor 100 includes random order source 3003 (referring to figure 3. 0) to generate random order 3005, these random orders
3005 is sampled to execute random rounding-off to reduce a possibility that generating rounding-off deviation.In one embodiment, work as rounding bit
3005 for one and stick the position (sticky) be zero, if the random order 3005 of sampling is that very, neural processing unit 126 will be given up upwards
Enter, if the random order 3005 of sampling is vacation, neural processing unit 126 would not be rounded up to.In one embodiment, random order
Source 3003 is sampled based on the random characteristic electron that processor 100 has to generate random order 3005, these random electronics
The thermal noise of characteristic such as semiconductor diode or resistance, but the present invention is not limited thereto.
Run function 2934 specifies the function for 202 numerical value 217 of accumulator to generate the defeated of neural processing unit 126
Out 133.As described herein, run function 2934 includes but is not limited to: S type function;Hyperbolic tangent function;Soft plus function;Correction
Function;Divided by two specified power side;The reciprocal value that a user specifies is multiplied by reach equivalent division;Transmitting is entire cumulative
Device;And transmit accumulator with standard size, this part can be described in more detail in following sections.In one embodiment,
Run function is as specified by neural network unit starting function instruction.In addition, run function can also as specified by initialization directive,
And it is used in response to output order, such as the run function unit output order of address 4 is located at figure in this embodiment in Fig. 4
The run function instruction of address 3 can be contained in output order in 4.
2942 values specified one reciprocal is multiplied to reach to the progress of 202 numerical value 217 of accumulator with 202 numerical value 217 of accumulator
The numerical value of division arithmetic.That is, 2942 value of inverse specified by user can be falling for the divisor actually wished to carry out
Number.This is conducive to arrange in pairs or groups convolution as described herein or common source operation.For a preferred embodiment, user can will be reciprocal
2942 values are appointed as two parts, this corresponds at Figure 29 C and can be described in more detail subsequent.In one embodiment, it controls
Buffer 127 includes that a field (not shown) allows user that can specify a progress division in multiple built-in divider values, this
The sizableness of a little built-in divider values is in the size of common convolution kernel, such as 9,25,36 or 49.In this embodiment, start letter
Counting unit 212 can store the inverse of these built-in divisors, to be multiplied with 202 numerical value 217 of accumulator.
The digit that offset 2944 specifies the shift unit of run function unit 212 that can move to right 202 numerical value 217 of accumulator,
With reach by its divided by two power side operation.This convolution kernel having a size of two power side that is conducive to arrange in pairs or groups carries out operation.
Exporting 2952 value of random access memory can be in data random access memory 122 and weight random access memory
One is specified in 124 to receive output result 133.
Exporting 2954 value of binary point indicates the position of binary point of output result 133.It is preferably real with regard to one
For applying example, for exporting the position of binary point of result 133, output 2954 value of binary point is indicated
The position number of positions calculated from right side.In other words, output binary point 2954 indicates the minimum effective of output result 133
Belong to the quantity of decimal place, that is, the digit being located on the right side of binary point in position.Run function unit 212 can be based on output two
The numerical value of system decimal point 2954 (in most cases, can also be based on data binary point 2922, weight binary system
Decimal point 2924, run function 2934 and/or configure 2902 numerical value) executes rounding-off, compression, be saturated and size conversion fortune
It calculates.
Output order 2956 can export result 133 from many Control-orienteds.In one embodiment, run function unit 121
Standard-sized concept can be utilized, standard size is twice for configuring 2902 specified width sizes (in bits).In this way, citing
For, if 2902 setting of configuration is received defeated with weight random access memory 124 by data random access memory 122
The size for entering text is 8, and standard size will be 16;In another example, if configuration 2902 setting by data with
It is 16 that machine, which accesses memory 122 and the size of the received input text of weight random access memory 124, and standard size will
It is 32.As described herein, larger (for example, narrow accumulator 202B is 28 to the size of accumulator 202, and wide tired
Adding device 202A then is 41) to maintain intermediate computations, such as 1024 and 512 multiply-accumulate instructions of neural network unit, full essence
Degree.In this way, 202 numerical value 217 of accumulator will be greater than (in bits) standard size, and for most of number of run function 2934
Value (in addition to transmitting entire accumulator), run function unit 212 (such as standard size described in the paragraph below corresponding to Figure 30
Compressor 3008) 202 numerical value 217 of accumulator will be compressed to standard-sized size.First default of output order 2956
Value can indicate run function unit 212 execute specified run function 2934 using generate internal result and by this internal result as
It exports result 133 to export, the size of this internal result is equal to the size for being originally inputted text, i.e., standard-sized half.Output
Second default value of order 2956 can indicate that run function unit 212 executes specified run function 2934 to generate internal result
And exported the lower half of this internal result as output result 133, the size of this internal result, which is equal to, is originally inputted text
Twice of size, i.e. standard size;And the third default value for exporting order 2956 can indicate run function unit 212 by gauge
The upper half of very little inside result is exported as output result 133.4th default value of output order 2956 can indicate starting letter
Counting unit 212 is exported untreated minimum effective text of accumulator 202 as output result 133;And export order
2956 the 5th default value can indicate run function unit 212 using the effective text in untreated centre of accumulator 202 as
Result 133 is exported to export;6th default value of output order 2956 can indicate run function unit 212 by accumulator 202 not
The processed effective text of highest (its width is as specified by configuration 2902) exports as output result 133, this is corresponded to above
It is described in more detail in the chapters and sections of Fig. 8 to Figure 10.It has been observed that export entire 202 size of accumulator or it is standard-sized in
Portion's result helps to allow other execution units 112 of processor 100 that can execute run function, such as soft very big run function.
Field described in Figure 29 A (and Figure 29 B and Figure 29 C) is located inside control buffer 127, but, the present invention
It is not limited to this, wherein one or more fields may be alternatively located at the other parts of neural network unit 121.With regard to a preferred embodiment
For, many fields may be embodied in neural network unit instruction internal, and be decoded by sequencer 128 micro- to generate
Instruct 3416 (referring to figure 3. 4) control arithmetic logic unit 204 and/or run function unit 212.In addition, these fields
It may be embodied in and be stored in micro- operation 3414 of media cache 118 (referring to figure 3. 4), to control arithmetic logic unit 204
And/or run function unit 212.This embodiment can reduce the use of initialization neural network unit instruction, and other
It then can remove this initialization neural network unit instruction in embodiment.
It has been observed that the instruction of neural network unit can specify to memory operand (as stored from data random access
The text of device 122 and/or weight random access memory 123) or one rotation after operand (as come from multitask buffer
208/705) arithmetical logic ordering calculation is executed.In one embodiment, neural network unit instruction can also be by an operand
It is appointed as the buffer output (output of the buffer 3038 of such as Figure 30) of run function.In addition, it has been observed that neural network list
Metainstruction can specify make data random access memory 122 or weight random access memory 124 when top address is passed
Increase.In one embodiment, the instruction of neural network unit may specify that signed integer difference is added when forefront is incremental to reach immediately
Or the purpose for the numerical value other than one of successively decreasing.
Figure 29 B is the block schematic diagram for showing another embodiment of control buffer 127 of Fig. 1.The control of Figure 29 B is slow
Storage 127 is similar to the control buffer 127 of Figure 29 A, but, the control buffer 127 of Figure 29 B include an accumulator two into
Decimal point 2923 processed.The binary point position of the expression accumulator 202 of accumulator binary point 2923.It is preferably real with regard to one
For applying example, 2923 value of accumulator binary point indicates position number of positions of this binary point position from right side.It changes
Yan Zhi, accumulator binary point 2923 indicate to belong to the quantity of decimal place in the least significant bit of accumulator 202, that is, are located at
Position on the right side of binary point.In this embodiment, accumulator binary point 2923 is explicitly indicated, rather than such as Figure 29 A
Embodiment be to confirm in the dark.
Figure 29 C is display with the block schematic diagram of an embodiment of the inverse 2942 of two section store Figure 29 A.First
Part 2962 is a deviant, indicates that user wants to be multiplied by the true reciprocal value of 202 numerical value 217 of accumulator and is suppressed
Leading zero quantity 2962.The quantity of leading zero is an immediately proceeding on the right side of binary point continuously arranged zero quantity.The
Two parts 2694 are leading null suppression reciprocal values, that is, by all leading zeroes remove after true reciprocal value.In an embodiment
In, it is suppressed leading zero quantity 2962 and is stored with 4, and leading null suppression reciprocal value 2964 is then with 8 not signed value storages
It deposits.
As an example it is assumed that user wants the reciprocal value that 202 numerical value 217 of accumulator is multiplied by numerical value 49.Numerical value 49
It will be 0.0000010100111 that reciprocal value, which is presented with two dimension and set 13 decimal places, wherein there are five leading zeroes.In this way,
Suppressed leading zero quantity 2962 can be inserted numerical value 5 by user, and leading null suppression reciprocal value 2964 is inserted numerical value
10100111.In multiplier reciprocal " divider A " 3014 (referring to figure 3. 0) by 202 numerical value 217 of accumulator and leading null suppression
After reciprocal value 2964 is multiplied, generated product can be moved to right according to leading zero quantity 2962 is suppressed.Such embodiment helps
In the requirement for expressing 2942 values reciprocal using relatively small number of position and reaching pinpoint accuracy.
Figure 30 is the block schematic diagram for showing an embodiment of run function unit 212 of Fig. 2.This run function unit
212 127, positive type converters (PFC) of control logic comprising Fig. 1 and output binary point aligner (OBPA)
3002 is small to receive 202 numerical value 217 of accumulator and output binary system to receive 202 numerical value of accumulator, 217, rounder 3004
The pointer of bit quantity that several aligners 3002 remove, a random order source 3003 as the aforementioned with generate random order 3005,
One the first multiplexer 3006 is to receive positive type converter and export output and the house of binary point aligner 3002
Enter output, standard size compressor (CCS) and the saturator 3008 of device 3004 with receive the output of the first multiplexer 3006,
One digit selector receives the output of standard size compressor and saturator 3008, a corrector 3018 with saturator 3012
With receive standard size compressor and saturator 3008 output, a multiplier 3014 reciprocal to be to receive standard size compression
The output of device and saturator 3008, a right shift device 3016 are to receive the defeated of standard size compressor and saturator 3008
Out, tanh (tanh) module 3022 is to receive the output of digit selector and saturator 3012, a S pattern block 3024
To receive, the output of digit selector and saturator 3012, one soft plus module 3026 is to receive digit selector and saturator 3012
Output, second multiplexer 3032 to be to receive tanh module 3022, S pattern block 3024, soft plus module 3026, correction
The output and standard size compressor and saturator 3008 of device 3018, multiplier 3014 reciprocal with right shift device 3016 are passed
The standard size passed exports 3028, symbol restorers 3034 to receive the output of the second multiplexer 3032, a size turn
Parallel operation and saturator 3036 with receive output, the third multiplexer 3037 of symbol restorer 3034 with receive size converter with
The output of saturator 3036 and accumulator output 217 and an output state 3038 to receive the output of multiplexer 3037,
And its output is the result 133 in Fig. 1.
Positive type converter and output binary point aligner 3002 receive 202 value 217 of accumulator.It is preferably real with regard to one
For applying example, it has been observed that 202 value 217 of accumulator is a full precision value.That is, accumulator 202 has enough storages
For digit to load cumulative number, this cumulative number is by integer adder 244 by a series of products generated by integer multiplier 242
Be added caused by sum, and this operation do not give up multiplier 242 individual products or adder it is each sum in it is any
One position is to maintain accuracy.For a preferred embodiment, at least there is accumulator 202 enough digits to load nerve net
Network unit 121 can be programmed the maximum quantity for executing the product accumulation generated.For example, program referring to figure 4., in width
Under configuration, it is 512 that neural network unit 121, which can be programmed and execute the maximum quantity of the product accumulation generated, and cumulative number 202
Bit width is 41.In another example, 0 program referring to figure 2., under narrow configuration, neural network unit 121 can be programmed
The maximum quantity for executing the product accumulation generated is 1024, and 202 bit width of cumulative number is 28.Substantially, full precision accumulator
202 have at least Q position, and wherein Q is M and log2The aggregation of P, wherein M is that the bit width of the integer multiplication of multiplier 242 (is lifted
For example, it is 16 for narrow multiplier 242, is 32 for wide multiplier 242), and P is 202 institute of accumulator
The maximum allowable quantity of product that can be cumulative.For a preferred embodiment, the maximum quantity of product accumulation is according to nerve net
Specified by the program specification of the program designer of network unit 121.In one embodiment, it is assumed that a previous multiplications accumulated instruction is used
To load data/column of weight text 206/207 (finger of address 1 in such as Fig. 4 from data/weight random access memory 122/124
Enable) on the basis of, sequencer 128 can execute the counting of multiply-accumulate neural network unit instruction (instruction of address 2 in such as Fig. 4)
Maximum value be such as 511.
There is enough bit widths using one and cumulative fortune can be executed to the full precision value of allowed cumulative maximum quantity
The accumulator 202 of calculation can simplify the design of the arithmetic logic unit 204 of neural processing unit 126.In particular, processing in this way
The demand for needing to execute saturation arithmetic to the sum that integer adder 244 generates using logic can be mitigated, because integer adds
Musical instruments used in a Buddhist or Taoist mass 244 can make a small-sized accumulator generate overflow, and need to keep track the binary point position of accumulator with true
Recognize and whether generates overflow to be confirmed whether to need to be implemented saturation arithmetic.For example, for non-full precision accumulator but tool
For having saturation logic to handle the design of the overflow of non-full precision accumulator, it is assumed that there are following situations.
(1) range of data literal value be between 0 and 1 and all bit of storage are all to store decimal place.Weight text
The range of word value is all bit of storage between -8 and+8 and in addition to three all to store decimal place.As one
The range of the accumulated value of the input of tanh run function is and all storages in addition to three between -8 and 8
Position is all to store decimal place.
(2) bit width of accumulator is non-full precision (as there was only the bit width of product).
(3) assume that accumulator is full precision, final accumulated value is also big to date between -8 and 8 (such as+4.2);But, exist
Product in this sequence before " point A " can relatively frequently generate positive value, and the product after point A then can relatively frequently generate negative value.
In the case, it is possible to obtain incorrect result (such as the result other than+4.2).This is because in front of point A
Certain points, be more than numerical value that it is saturated maximum value+8 when needing to make accumulator to reach one, such as+8.2, will lose and have more
0.2.Accumulator even can make remaining product accumulation result maintain saturation value, and can lose more positive values.Therefore, it adds up
The end value of device may be less than using the accumulator numerical value calculated (being less than+4.2) with full precision bit width.
Positive type converter 3004 can be converted into positive type, and generate volume when 202 numerical value 217 of accumulator is negative
The positive and negative of script numerical value is pointed out in outer position, this meeting is passed down to 212 pipeline of run function unit with herewith numerical value.By negative
Being converted to positive type can simplify the operation of subsequent run function unit 121.For example, after this treatment, only positive value meeting
Tanh module 3022 and S pattern block 3024 are inputted, thus can simplify the design of these modules.In addition it is also possible to simplify
Rounder 3004 and saturator 3008.
Output binary point aligner 3002 can move right or scale this positive type value, keep it slow in alignment with control
The output binary point 2954 specified in storage 127.For a preferred embodiment, binary point aligner is exported
3002 decimal digits that can calculate 202 numerical value 217 of accumulator are (such as specified by accumulator binary point 2923 or number
According to the aggregation of binary point 2922 and weight binary point 2924) decimal digits of output is subtracted (such as by exporting
Specified by binary point 2954) difference as offset.So, for example, if 202 binary fraction of accumulator
It is 3 that point 2923, which exports binary point 2954 for 8 (i.e. above-described embodiments), and output binary point aligner 3002 is just
This positive type numerical value can be moved to right 5 positions to generate the result for being provided to multiplexer 3006 Yu rounder 3004.
Rounder 3004 can execute rounding-off operation to 202 numerical value 217 of accumulator.For a preferred embodiment, rounder
The 3004 positive type numerical value that can be generated to positive type converter and output binary point aligner 3002 generate a rounding-off
Version afterwards, and by this be rounded after version be provided to multiplexer 3006.Rounder 3004 can be executed according to aforementioned rounding control 2932
It is rounded operation, as described herein, aforementioned rounding control will include the random rounding-off using random order 3005.Multiplexer 3006 can be according to
According to rounding control 2932 (as described herein, to may include being rounded at random), the selection in its multiple input is first, namely from just
After rounding-off of the type transducer with the positive type numerical value for exporting binary point aligner 3002 or from rounder 3004
Version, and the numerical value after selection is supplied to standard size compressor and saturator 3008.For a preferred embodiment, if
It is that rounding control is specified without rounding-off, multiplexer 3006 will select positive type converter to be aligned with output binary point
Otherwise the output of device 3002 will select the output of rounder 3004.It in other embodiments, can also be by run function unit
212 execute additional rounding-off operation.For example, in one embodiment, when digit selector 3012 to standard size compressor with
When output (as be described hereinafter) position of saturator 3008 is compressed, low cis-position position of the meeting of digit selector 3012 based on loss is given up
Enter operation.In another example, the product of multiplier 3014 (as be described hereinafter) reciprocal can be subjected to rounding-off operation.In another model
In example, the needs of size converter 3036 convert out Output Size appropriate (as be described hereinafter), this conversion may relate to lose certain use
In the low cis-position position for determining rounding-off, rounding-off operation is carried out.
3006 output valve of multiplexer can be compressed to standard size by standard size compressor 3008.So that it takes up a position, for example, if
Be neural processing unit 126 be in it is narrow configuration or funnel configuration 2902, standard size compressor 3008 can be by 28 multiplexings
3006 output valve of device is compressed to 16;And if neural processing unit 126 is in wide configuration 2902, standard size compressor
41 3006 output valves of multiplexer can be compressed to 32 by 3008.But, before being compressed to standard size, if value before compression
Greater than the maximum value that standard type can be expressed, before saturator 3008 will be such that this compresses, value, which is filled up to standard type, to express
Maximum value.For example, if being located at any position before highest is effectively compressed on the left of value position before compressing in value is all numerical value 1,
Saturator 3008 will be filled up to maximum value (such as fill up is all 1).
For a preferred embodiment, tanh module 3022, S pattern block 3024 and soft plus module 3026 are all wrapped
Containing look-up table, such as programmable logic array (PLA), read-only memory (ROM), combinational logic lock.In one embodiment,
In order to simplify and reduce the size of these modules 3022/3024/3026, the input value for being provided to these modules has 3.4 type
Formula, i.e. three integer characters and four decimal places namely input value tool are located on the right side of binary point and have there are four position
There are three positions to be located on the left of binary point.Since at the extreme place of the input value range (- 8 ,+8) of 3.4 patterns, output valve
Can be progressively close to its min/max, therefore select these numerical value.But, the present invention is not limited thereto, and the present invention can also answer
For other embodiments that binary point is placed on to different location, such as with 4.3 patterns or 2.5 patterns.Digit selector
3012 selection can select the position for meeting 3.4 pattern specifications in the position that standard size compressor and saturator 3008 export, this is related to
And compression processing, that is, certain positions can be lost, because standard type then has more digit.But, in selection/compression mark
Before object staff cun compressor and 3008 output valve of saturator, if value is greater than the maximum value that 3.4 patterns can be expressed before compression, satisfy
Value before compressing will be made to be filled up to the maximum value that 3.4 patterns can be expressed with device 3012.For example, if before compression in value
Any position on the left of highest effective 3.4 pattern position is all numerical value 1, and saturator 3012 will be filled up to maximum value and (such as fill up
1) to whole.
Tanh module 3022, S pattern block 3024 and soft plus module 3026 can be to standard size compressor and saturators
3.4 pattern numerical value of 3008 outputs execute corresponding run function (described above) to generate a result.With regard to a preferred embodiment
For, it is 7 of 0.7 pattern caused by tanh module 3022 and S pattern block 3024 as a result, i.e. zero integer word
Member with seven decimal places namely input value there are seven positions to be located on the right side of binary point.It is soft for a preferred embodiment
Add the generation of module 3026 is 7 of 3.4 patterns as a result, i.e. its pattern is identical as the entry type of this module 3026.Just
For one preferred embodiment, tanh module 3022, S pattern block 3024 and soft plus module 3026 output can be extended to mark
Pseudotype formula (such as adding leading zero if necessary) is simultaneously aligned and makes binary point by the number of output binary point 2954
Specified by value.
Corrector 3018 can generate standard size compressor and version after the correction of the output valve of saturator 3008.Namely
It says, if standard size compressor and the output valve (its such as aforementioned symbol is moved down with pipeline) of saturator 3008 are negative, corrector
3018 can export zero;Otherwise, corrector 3018 will be inputted value output.For a preferred embodiment, corrector
3018 output is standard type and has the binary point as specified by 2954 numerical value of output binary point.
The meeting of multiplier 3014 reciprocal is by the output of standard size compressor and saturator 3008 and is specified in reciprocal value 2942
User specify reciprocal value be multiplied, to generate standard-sized product, this product actually be standard size compressor with
The output valve of saturator 3008, the quotient calculated using the inverse of reciprocal value 2942 as divisor.With regard to a preferred embodiment
Speech, the output of multiplier 3014 reciprocal are standard type and have the binary system specified by 2954 numerical value of output binary point
Decimal point.
Right shift device 3016 can be by the output of standard size compressor and saturator 3008, to be specified in offset value
2944 user specifies digit to move, to generate standard-sized quotient.For a preferred embodiment, right shift
The output of device 3016 is standard type and has the binary point specified by 2954 numerical value of output binary point.
Multiplexer 3032 selects to be properly entered specified by 2934 value of run function, and is selected to be provided to symbol recovery
Device 3034, if 202 numerical value 217 of accumulator of script is negative value, what symbol restorer 3034 will export multiplexer 3032
Positive type numerical value conversion is negative type, such as is converted to two complement code types.
Size converter 3036 can be according to the numerical value of the output order 2956 as described in Figure 29 A, by symbol restorer 3034
Output convert to size appropriate.For a preferred embodiment, the output of symbol restorer 3034 is with one by exporting
The specified binary point of 2954 numerical value of binary point.For a preferred embodiment, for the first of output order
For default value, size converter 3036 can give up the upper portion of the output of symbol restorer 3034.In addition, if symbol restores
The output of device 3034 is positive and is more than the maximum value that the specified character size of configuration 2902 can express, or output is negative simultaneously
And be less than the minimum value that can express of character size, saturator 3036 will output it fill up so far character size respectively can
Express maximum/minimum value.For second and third default value, size converter 3036 can transmit the defeated of symbol restorer 3034
Out.
Multiplexer 3037 can be according to output order 2956, in data converter and the output of saturator 3036 and accumulator 202
Select one to be supplied to output state 3038 in output 217.Furthermore, it is understood that first and for output order 2956
Two default values, multiplexer 3037 can select the lower section text of size converter and the output of saturator 3036, and (size is by configuring
2902 is specified).For third default value, multiplexer 3037 can select the top text of the output of size converter and saturator 3036
Word.For the 4th default value, multiplexer 3037 can select the lower section text of untreated 202 numerical value 217 of accumulator;For
5th default value, multiplexer 3037 can select the midamble of untreated 202 numerical value 217 of accumulator;And it is silent for the 6th
Recognize value, multiplexer 3037 can select the top text of untreated 202 numerical value 217 of accumulator.It has been observed that preferably implementing with regard to one
For example, run function unit 212 can add zero upper position in the top text of untreated 202 numerical value 217 of accumulator.
Figure 31 is the example for showing the running of run function unit 212 of Figure 30.As shown in the figure, neural processing unit
126 configuration 2902 is set as narrow configuration.In addition, signed number is true with 2914 value of tape symbol weight according to 2912.In addition, data
2922 value of binary point indicates that for 122 text of data random access memory, binary system scaling position is right
There are 7 positions in side, neural processing unit 126 exemplary values of received first data literal be rendered as 0.1001110.In addition,
2924 value of weight binary point indicates the binary point position for 124 text of weight random access memory
Setting right side has 3 positions, neural processing unit 126 the exemplary values of received first weight text be rendered as 00001.010.
First data are presented with 16 products (this product can be added with the initial zero value of accumulator 202) of weight text
It is 000000.1100001100.Because data binary point 2912 is 7 and weight binary point 2914 is 3, for
For 202 binary point of accumulator implied, right side has 10 positions.In the case where narrow configuration, such as this implementation
Shown in example, accumulator 202 has 28 bit wides.After for example, completing all arithmetic logical operations (such as Figure 20 all 1024
A multiply-accumulate operation), the numerical value 217 of accumulator 202 can be 000000000000000001.1101010100.
Output 2954 value of binary point indicates that there are 7 positions on the binary point right side of output.Therefore, defeated in transmitting
With after standard size compressor 3008,202 numerical value 217 of accumulator can be scaled, give up binary point aligner 3002 out
Enter and be compressed to the numerical value of standard type, i.e., 000000001.1101011.In this example, binary fraction dot address is exported
Indicate 7 decimal places, and 202 binary point position of accumulator indicates 10 decimal places.Therefore, binary point is exported
Aligner 3002 can calculate difference 3, and penetrate 202 numerical value 217 of accumulator moving to right 3 positions to zoom in and out to it.Scheming
I.e. display 202 numerical value 217 of accumulator can lose 3 least significant bits (binary number 100) in 31.In addition, in this example, house
Entering 2932 values of control indicates using random rounding-off, and assumes that sampling random order 3005 is true in this example.In this way, as before
It states, least significant bit will be rounded up to, this is because (this 3 because of accumulator for the rounding bit of 202 numerical value 217 of accumulator
The zoom operation of 202 numerical value 217 and the most significant bit in the position that is moved out of) it is one, and (this 3 because of accumulator 202 for glutinous position
The zoom operation of numerical value 217 and in the position that is moved out of, the boolean of 2 least significant bits or operation result) be zero.
In this example, it is S type function that run function 2934, which indicates used,.In this way, digit selector 3012 will select
It selects the position of standard type value and has the input of S pattern block 3024 there are three integer character and four decimal places, it has been observed that i.e. institute
The numerical value 001.1101 shown.The output numerical value of S pattern block 3024 can be put into standard type, i.e., shown in numerical value
000000000.1101110。
The first default value, the i.e. character size of 2902 expression of output configuration are specified in the output order 2956 of this example, herein
In the case of i.e. narrow text (8).In this way, standard S type output valve can be converted to 8 amounts, tool by size converter 3036
There is an implicit binary point, i.e., there are 7 positions on the right side of this binary point, and generates an output valve
01101110, as shown in the figure.
Figure 32 is second example for showing the running of run function unit 212 of Figure 30.The example description of Figure 32, which is worked as, opens
When dynamic function 2934 indicates to transmit 202 numerical value 217 of accumulator with standard size, the operation of run function unit 212.Such as institute in figure
Show, this configuration 2902 is set as the narrow configuration of neural processing unit 216.
In this example, the width of accumulator 202 is 28 positions, is had on the right side of the position of 202 binary point of accumulator
10 positions are (this is because data binary point 2912 and the aggregation of weight binary point 2914 are in one embodiment
10, or accumulator binary point 2923 is clearly designated as with numerical value 10 in another embodiment).For example,
After executing all arithmetic logical operations, 202 numerical value 217 of accumulator shown in Figure 32 is
000001100000011011.1101111010。
In this example, output 2954 value of binary point indicates for output there is 4 on the right side of binary point
A position.Therefore, after transmitting output binary point aligner 3002 and standard size compressor 3008, accumulator 202
Numerical value 217 can be saturated and standard type value 111111111111.1111 shown in being compressed to, this numerical value is by 3032 institute of multiplexer
It is received as standard size delivery value 3028.
Two output orders 2956 are shown in this example.First specified second default value of output order 2956, i.e., it is defeated
The lower section text of standard type size out.Because size indicated by configuration 2902 is narrow text (8), standard size will be
16, and size converter 3036 can selection criteria size delivery value 3028 the position of lower section 8 to generate as illustrated in the drawing 8
Bit value 11111111.2956 specified third default values, i.e. the top text of outputting standard pattern size are ordered in second output.
In this way, the position of top 8 of the meeting selection criteria size delivery value 3028 of size converter 3036 is to generate 8 digit as illustrated in the drawing
Value 11111111.
Figure 33 is the third example for showing the running of run function unit 212 of Figure 30.The example of Figure 33 is disclosed to work as and be opened
The running of run function unit 212 when dynamic function 2934 indicates to transmit entire untreated 202 numerical value 217 of accumulator.Such as
As shown in the figure, this configuration 2902 is set as the wide configuration (such as 16 input text) of neural processing unit 126.
In this example, the width of accumulator 202 is 41 positions, and accumulator has 8 on the right side of 202 binary point position
A position (this is because the aggregation of data binary point 2912 and weight binary point 2914 is 8 in one embodiment,
Or accumulator binary point 2923 is clearly designated as with numerical value 8 in another embodiment).For example, it is holding
After all arithmetic logical operations of row, 202 numerical value 217 of accumulator shown in Figure 33 is 001000000000000000001100000
011011.11011110。
Three output orders 2956 are shown in this example.Specified 4th default value is ordered in first output, i.e., output without
The lower section text of 202 numerical value of accumulator of processing;Specified 5th default value is ordered in second output, that is, is exported untreated tired
Add the midamble of 202 numerical value of device;And specified 6th default value is ordered in third output, that is, exports untreated accumulator
The top text of 202 numerical value.Because size indicated by configuration 2902 is wide text (16), as shown in figure 33, in response to first
Output order 2956, multiplexer 3037 can select 16 place values 0001101111011110;It is more in response to the second output order 2956
Work device 3037 can select 16 place values 0000000000011000;And in response to third output order 2956, multiplexer 3037 can select
16 place values 0000000001000000.
It has been observed that neural network unit 121 can be implemented in integer data rather than floating data.In this way, facilitating letter
Change each and every one neural processing unit 126, or 204 part of arithmetic logic unit at least within.For example, this arithmetical logic list
There is no need to the adder needed for the index of multiplier to be added is incorporated in floating-point operation for multiplier 242 for member 204.Class
As, this arithmetic logic unit 204 there is no need to for adder 234 and being incorporated in floating-point operation needs for being directed at addend
Binary point shift unit.For technical field tool usually intellectual when that can understand, floating point unit is often very multiple
It is miscellaneous;Therefore, exemplifications set out herein is simplified only for arithmetic logic unit 204, has hardware fixed point auxiliary using described
And user is allowed to may specify that the integer embodiment of associated binary decimal point can also be used for simplifying other parts.Compared to
The embodiment of floating-point uses integer unit to be can produce at the nerve of one smaller (and very fast) as arithmetic logic unit 204
Unit 126 is managed, and is conducive to for large-scale 126 array of neural processing unit being integrated into neural network unit 121.It opens
The part of dynamic function unit 212 can specified, cumulative number needs based on user decimal place quantity and output valve need
Decimal place quantity, to handle the scaling and saturation arithmetic of 202 numerical value 217 of accumulator, and preferably is specified based on user.Appoint
What additional complexity and adjoint size increase and the energy in the fixed-point hardware of run function unit 212 auxiliary and/or when
Between consume, can be shared through the mode for sharing run function unit 212 between arithmetic logic unit 204, this is
Because the quantity of run function unit 1112 can be reduced using the embodiment of sharing mode as shown in the embodiment of Figure 11.
Embodiment as described herein can enjoy the advantages of many utilization integer arithmetic units are to reduce hardware complexity (phase
Compared with using floating point arithmetic unit), and can be used for the arithmetical operation of decimal simultaneously, i.e., with the number of binary point.
The advantages of floating-point arithmetic, is that it can provide date arithmetic and fall in a very wide numerical value to the individual number of data
In range (it is actually limited only in the size of index range, therefore can be a very big range).That is, each floating
Points have its potential unique index value.But, embodiment as described herein understands and utilizes and has in certain applications
There is input data height parallel and falls within the spy for making all panel datas that there is identical " index " in the range of a relative narrower
Property.In this way, these embodiments allow user that binary point position is once assigned to all input values and/or is added up
Value.Similarly, through the characteristic for understanding and having using parallel output similar range, these embodiments allow user by binary system
Scaling position is once assigned to all output valves.Artificial neural network is an example of such application, but of the invention
Embodiment can also be applied to execute the calculating of other application.Through by binary point position be once assigned to it is multiple input and
Non- to give to a other input number, compared to floating-point operation is used, the embodiment of the present invention can be efficiently empty using memory
Between (such as need less memory) and/or promote precision using the memory of similar quantity, this is because
The position of index for floating-point operation can be used to promote numerical precision.
In addition, the embodiment of the present invention understands that (such as overflow or forfeiture are less in the integer arithmetic to a large series
Important decimal place) precision may be lost when executing cumulative, therefore a solution is provided, it is mainly sufficiently large using one
Accumulator avoid the precision from losing.
The direct execution of the micro- operation of neural network unit
Figure 34 is the block schematic diagram for showing the part details of processor 100 and neural network unit 121 of Fig. 1.Mind
It include the pipeline stages 3401 of neural processing unit 126 through network unit 121.Each pipeline stages 3401 are distinguished with grade buffer, and
The operation for reaching the neural processing unit 126 of this paper including combinational logic, such as Boolean logic lock, multiplexer, adder, multiplication
Device, comparator etc..Pipeline stages 3401 receive micro- operation 3418 from multiplexer 3402.Micro- operation 3418 can flow downward to pipe
Line grade 3401 simultaneously controls a combination thereof logic.Micro- operation 3418 is a position set.For a preferred embodiment, micro- operation 3418
Position, 124 storage address 125 of weight random access memory including 122 storage address 123 of data random access memory
Position, the position of 129 storage address 131 of program storage, multitask buffer 208/705 control signal 213/713, also
The field (such as control buffer of Figure 29 A to Figure 29 C) of many control buffers 217.In one embodiment, micro- operation 3418
Including about 120 positions.Multiplexer 3402 receives micro- operations from three different sources, and select one of them as being supplied to
Micro- operation 3418 of pipeline stages 3401.
The micro- operation source of one of multiplexer 3402 is the sequencer 128 of Fig. 1.The meeting of sequencer 128 will be by program storage
129 received neural network unit Instruction decodings simultaneously generate a micro- operation 3416 accordingly and are provided to the first defeated of multiplexer 3402
Enter.
The micro- operation source of second of multiplexer 3402 is microcommand 105 to be received from the reservation station 108 of Fig. 1 and from general
Buffer 116 and media cache 118 receive the decoder 3404 of operand.For a preferred embodiment, it has been observed that micro- finger
It enables produced by 105 translations for instructing 1500 with MFNN in response to MTNN instruction 1400 as instruction translator 104.Microcommand 105 can wrap
An immediate field is included with a specified specific function (as specified by a MTNN instruction 1400 or a MFNN instruction 1500), example
As the beginning and stopping of 129 internal program of program storage executes, directly from media cache 118 executes a micro- operation or such as
The memory of aforementioned read/write neural network unit.Decoder 3404 can be micro- by the decoding of microcommand 105 and accordingly generation one
Operation 3412 is provided to the second input of multiplexer.For a preferred embodiment, for MTNN instruction 1400/MFNN instruction
For 1500 certain functions 1432/1532, decoder 3404 does not need one micro- operation 3412 of generation and is sent to pipeline downwards
3401, such as write-in control buffer 127, the program in beginning executive memory 129, pause executive memory
The program in program, waiting program storage 129 in 129 is completed to execute, reads and reset nerve from status register 127
Network unit 121.
The micro- operation source of third of multiplexer 3402 is media cache 118 itself.For a preferred embodiment, such as
Correspond to described in Figure 14 above, MTNN instruction 1400 may specify a function to indicate that neural network unit 121 directly executes one
Micro- operation 3414 of the third input of multiplexer 3402 is provided to by media cache 118.Directly execute by framework media buffer
Micro- operation 3414 that device 118 provides is conducive to test neural network unit 121, such as built-in self-test (BIST), or
Except wrong movement.
For a preferred embodiment, decoder 3404 can generate a mode pointer 3422 and control multiplexer 3402
Selection.When the specified function of MTNN instruction 1400 starts to execute a program from program storage 129, decoder
3404, which can generate 3422 value of a mode pointer, makes multiplexer 3402 select micro- operation 3416 from sequencer 128, until occurring
Mistake encounters the specified functions of MTNN instruction 1400 until decoder 3404 and stops executing from program storage 129
Program.When the specified function instruction neural network unit 121 of MTNN instruction 1400 is directly executed by media cache 118
The micro- operation 3414 provided, decoder 3404, which can generate 3422 value of mode pointer, makes the selection of multiplexer 3402 from meaning
Micro- operation 3414 of fixed media cache 118.Otherwise, decoder 3404 will generate 3422 value of mode pointer make it is more
Work device 3402 selects micro- operation 3412 from decoder 3404.
Variable rate neural network unit
In many cases, neural network unit 121 will be to be processed into standby mode (idle) etc. after executing program
Device 100 handles some things for needing to handle before executing next program.As an example it is assumed that being in one is similar to Fig. 3
To situation described in Fig. 6 A, neural network unit 121 (alternatively referred to as can before award nerve net to a multiply-accumulate run function program
Network layers program (feed forward neural network layer program)) it continuously performs two or more times.It compares
Execute program the time it takes in neural network unit 121, processor 100 obviously need to spend longer time by
The weighted value write-in weight random access memory 124 of 512KB is for neural network unit program use next time.In other words,
Neural network unit 121 can execute program in a short time, then enter standby mode, until processor 100 will be following
Weighted value write-in weight random access memory 124 for next secondary program execute use.This situation can refer to Figure 36 A, in detail such as
It is aftermentioned.In the case, frequency operation is used when neural network unit 121 can be used lower with extending the time for executing program
So that energy consumption needed for executing program is dispersed to longer time range, and makes neural network unit 121, or even entire place
Device 100 is managed, lower temperature is maintained.This situation is known as mitigation mode, can refer to Figure 36 B, the details will be described later.
Figure 35 is a block diagram, and showing has the processor 100 of variable rate neural network unit 121.This 100 class of processor
It is similar to the processor 100 of Fig. 1, and the component in figure with identical label is also similar.The processor 100 of Figure 35 simultaneously has
Time-frequency generates the functional unit that logic 3502 is coupled to processor 100, these functional units instruct acquisition unit 101, instruction
Cache 102, instruction translator 104 rename unit 106, reservation station 108, neural network unit 121, other execution units
112, memory sub-system 114, general caching device 116 and media cache 118.Time-frequency generates logic 3502 and generates including time-frequency
Device, such as phase-locked loop (PLL), the time frequency signal of frequency or time-frequency frequency when having main with generation one.For example,
Frequency can be 1GHz, 1.5GHz, 2GHz etc. when this is main.When frequency indicate periods per second, as time frequency signal exists
Concussion number between high low state.Preferably, this time frequency signal is with equilibration period (duty cycle), the i.e. half in this period
For high state, the other half is low state;In addition, this time frequency signal can also have the non-equilibrium period, that is, time frequency signal is in
The time of high state is longer than its time for being in low state, and vice versa.Preferably, frequency when phase-locked loop is to generate multiple
Main time frequency signal.Preferably, processor 100 includes power management module, according to the main time-frequency of many factors adjust automatically
Rate, these factors include the dynamic detection operation temperature of processor 100, utilization rate (utilization), and soft from system
The order of efficiency needed for part (such as operating system, basic input output system (BIOS)) indicates and/or energy-saving index.Implement one
In example, power management module includes the microcode of processor 100.
Time-frequency generates logic 3502 and including time-frequency distribution network or time-frequency tree (clock tree).Time-frequency tree can be by master
Time frequency signal is wanted to be disseminated to the functional unit of processor 100, as shown in figure 35, this distribution movement is exactly by time frequency signal 3506-1
It is sent to instruction acquisition unit 101, time frequency signal 3506-2 is sent to instruction cache 102, time frequency signal 3506-10 is transmitted
To instruction translator 104, time frequency signal 3506-9 is sent to renaming unit 106, time frequency signal 3506-8 is sent to guarantor
Station 108 is stayed, time frequency signal 3506-7 is sent to neural network unit 121, time frequency signal 3506-4 is sent to other execution
Unit 112, is sent to memory sub-system 114 for time frequency signal 3506-3, and time frequency signal 3506-5 is sent to general caching
Device 116, and time frequency signal 3506-6 is sent to media cache 118, these signals are referred to collectively as time frequency signal 3506.
This time-frequency tree has node or line, to transmit main time frequency signal 3506 to its corresponding functional unit.Additionally it is preferred that
It may include time-frequency buffer that time-frequency, which generates logic 3502, is needing to provide cleaner time frequency signal and/or is needing to be promoted main
When the voltage quasi position of time frequency signal, especially for farther away node, time-frequency buffer can regenerate main time frequency signal.This
Outside, each functional unit and with its own the period of the day from 11 p.m. to 1 a.m frequency set, regenerate and/or promote when needed it is received corresponding
The voltage quasi position of main time frequency signal 3506.
Neural network unit 121 includes that time-frequency reduces logic 3504, and time-frequency, which reduces logic 3504 and receives, mitigates pointer 3512
With main time frequency signal 3506-7, to generate the second time frequency signal.Frequency when second time frequency signal has.If not frequency phase at this time
Frequency when being same as main, be in mitigation mode from it is main when frequency reduce numerical value with reduce thermal energy generation, this mathematical program
Change to mitigation pointer 3512.Time-frequency reduces logic 3504 and is similar to time-frequency generation logic 3502, with time-frequency distribution network, or
Time-frequency tree, to spread the multiple functions square of the second time frequency signal to neural network unit 121, this distribution movement is exactly by time-frequency
Signal 3508-1 is sent to neural pe array 126, and time frequency signal 3508-2 is sent to sequencer 128 i.e. by time-frequency
Signal 3508-3 is sent to interface logic 3514, these signals are referred to collectively as the second time frequency signal 3508.Preferably, these are refreshing
It include multiple pipeline stages 3401 through processing unit 126, as shown in figure 34, pipeline stages 3401 include pipeline hierarchical cache device, to
Logic 3504, which is reduced, from time-frequency receives the second time frequency signal 3508-1.
Also there is neural network unit 121 interface logic 3514 to be believed with receiving main time frequency signal 3506-7 and the second time-frequency
Number 3508-3.Interface logic 3514 is coupled to lower part (such as reservation station 108, the media cache 118 of 100 front end of processor
With general caching device 116) and the multiple functions square of neural network unit 121 between, these function blocks reduce logic frequently in real time
3504, data random access memory 122, weight random access memory 124, program storage 129 and sequencer 128.It connects
Mouth logic 3514 includes data random access memory buffering 3522, and weight random access memory buffers translating for 3524, Figure 34
Code device 3404, and mitigate pointer 3512.It mitigates pointer 3512 and loads a numerical value, the specified neural pe array of this numerical value
126 can execute the instruction of neural network unit program with speed how slowly.Preferably, mitigating pointer 3512 specifies a divider value N, when
Frequency reduces logic 3504 and main time frequency signal 3506-7 is generated the second time frequency signal 3508 divided by this divider value, in this way, the
The when frequency of two time frequency signals will be 1/N.Preferably, the numerical value of N is programmable to any one in multiple and different default values
A, these default values can make time-frequency reduce the second time frequency signal that logic 3504 corresponds to frequency when generation is multiple to have different
3508, these when frequency be less than it is main when frequency.
In one embodiment, it includes time-frequency divider circuit that time-frequency, which reduces logic 3504, to by main time frequency signal
3506-7 is divided by mitigation 3512 numerical value of pointer.In one embodiment, it includes time-frequency lock (such as AND lock) that time-frequency, which reduces logic 3504,
Time-frequency lock can pass through an enabling signal to gate main time frequency signal 3506-7, and enabling signal is in main time frequency signal per N number of
A true value can be only generated in period.By one comprising counter by taking the circuit for generating enabling signal as an example, this counter can be to
On count up to N.When the output that adjoint logic circuit detects counter is matched with N, logic circuit will be believed in the second time-frequency
Numbers 3508 generate a true value pulses and redesign number device.Give program preferably, mitigating 3512 numerical value of pointer and can be instructed by framework
Change, such as the MTNN instruction 1400 of Figure 14.Preferably, starting to execute nerve net in framework program instruction neural network unit 121
Before network unit program, the framework program for operating on processor 100 can be by the sequencing of mitigation value to pointer 3512 is mitigated, this part exists
It is subsequent to correspond at Figure 37 and be described in more detail.
Weight random access memory buffering 3524 is coupled to weight random access memory 124 and media cache 118
Between as data therebetween transmission buffering.Preferably, weight random access memory buffering 3524 is similar to the buffering of Figure 17
One or more embodiments of device 1704.Preferably, weight random access memory buffering 3524 is received from media cache 118
The part of data using with it is main when frequency main time frequency signal 3506-7 as time-frequency, and weight random access memory is slow
Punching 3524 receives the part of data from weight random access memory 124 with the second time frequency signal with frequency when second
3508-3 as time-frequency, when second frequency can according to sequencing in mitigate the numerical value of pointer 3512 from it is main when frequency downgrade or
It is no, namely according to neural network unit 121 mitigation or normal mode are implemented in be downgraded or no.In one embodiment, it weighs
Weight random access memory 124 is single port, and as described in Figure 17 above, weight random access memory 124 can also be delayed by media
Storage 118 is buffered through weight random access memory buffering 3524, and by the column of neural processing unit 126 or Figure 11
1104, (arbitrated fashion) is accessed in a manner of arbitrating.In another embodiment, weight random access memory 124
For dual-port, as described in Figure 16 above, each port can be buffered by media cache 118 through weight random access memory
It 3524 and is accessed in a parallel fashion by neural processing unit 126 or column buffer 1104.
Similar to weight random access memory buffering 3524, data random access memory buffering 3522 is coupled to data
Buffering between random access memory 122 and media cache 118 as data transmission therebetween.Preferably, data are deposited at random
One or more embodiments of the access to memory buffering 3522 similar to the buffer 1704 of Figure 17.Preferably, data random access
Memorizer buffer 3522 from media cache 118 receive data part with it is main when frequency main time frequency signal
3506-7 is as time-frequency, and data random access memory buffering 3522 receives data from data random access memory 122
Second time frequency signal 3508-3 of frequency is as time-frequency when part is using with second, and frequency can be according to sequencing in mitigation when second
The numerical value of pointer 3512 from it is main when frequency downgrade or no, namely be implemented in mitigation or normal mode according to neural network unit 121
Formula is downgraded or no.In one embodiment, data random access memory 122 is single port, as described in Figure 17 above, number
Also 3522 can be buffered through data random access memory by media cache 118 according to random access memory 122, and by mind
Through the column of processing unit 126 or Figure 11 buffering 1104, accessed in a manner of arbitrating.In another embodiment, data random access is deposited
Reservoir 122 is dual-port, and as described in Figure 16 above, each port can be stored by media cache 118 through data random access
Device buffers 3522 and is accessed in a parallel fashion by neural processing unit 126 or column buffer 1104.
No matter preferably, data random access memory 122 and/or weight random access memory 124 be single port or
Dual-port, interface logic 3514 will include data random access memory buffering 3522 and buffer with weight random access memory
3524 to synchronize main time-frequency domain and the second time-frequency domain.Preferably, data random access memory 122, weight arbitrary access is deposited
Reservoir 124 and program storage 129 all have static random access memory (SRAM), wherein enabling letter comprising other read
Number, write-in enable signal and memory select enable signal.
It has been observed that neural network unit 121 is the execution unit of processor 100.Execution unit is that frame is executed in processor
The microcommand or execute the functional unit that framework instructs itself that structure instruction translation goes out, such as execute framework in Fig. 1 and instruct 103 turns
Microcommand 105 or the framework instruction 103 translated itself.Execution unit receives operand, example from the general caching device of processor
Such as from general caching device 116 and media cache 118.Execution unit execute microcommand or framework instruction after can generate one as a result,
This result can be written into general caching device.The instruction of MTNN described in Figure 14 and Figure 15 1400 and MFNN instruction 1500 instructs for framework
103 example.Microcommand is to realize that framework instructs.More precisely, execution unit for framework instruction translation go out one
Or the collective of multiple microcommands executes, and will be the fortune that framework instruction is executed for the input of framework instruction
It calculates, to generate the result of framework instruction definition.
Figure 36 A is a timing diagram, the fortune that there is video-stream processor 100 neural network unit 121 to operate on general modfel
Make example, this general modfel i.e. with it is main when frequency operation.In timing diagram, the process of time is right by a left side.Processor 100
With it is main when frequency execute framework program.More precisely, processor 100 front end (such as instruction acquisition unit 101, instruction
Cache 102, instruction translator 104 rename unit 106 and reservation station 108) with it is main when frequency seize, decode and issue frame
Structure is instructed to neural network unit 121 and other execution units 112.
Originally, framework program executes framework instruction (such as MTNN instruction 1400), this framework is instructed and sent out by processor front end 100
Cloth indicates that neural network unit 121 starts to execute the neural network list in its program storage 129 to neural network unit 121
Metaprogram.Before, framework program can execute framework instruction and the numerical value write-in of frequency when specifying main is mitigated pointer 3512,
Even if neural network unit is in general modfel.More precisely, sequencing to the numerical value for mitigating pointer 3512 can be such that time-frequency drops
Low logic 3504 with main time frequency signal 3506 it is main when frequency generate the second time frequency signal 3508.Preferably, in this example
In, the time-frequency buffer that time-frequency reduces logic 3504 promotes merely the voltage quasi position of main time frequency signal 3506.In addition before,
Framework program can execute framework instruction so that data random access memory 122 is written, and weight random access memory 124 simultaneously will be refreshing
Through network unit program write-in program memory 129.In response to neural network unit program MTNN instruction 1400, neural network list
Member 121 can start with it is main when frequency execute neural network unit program, this is because mitigate pointer 3512 be with main time-frequency
Rate value gives sequencing.After neural network unit 121 starts execution, framework program can continue with it is main when frequency execute framework
Instruction, including mainly deposited at random with 1400 write-in of MTNN instruction and/or reading data random access memory 122 with weight
Access to memory 124 to complete the example next time (instance) for neural network unit program, or calls
(invocation) or execute (run) preparation.
In the example of Figure 36 A, completed compared to framework program random for data random access memory 122 and weight
Access 124 write-ins of memory/reading the time it takes, neural network unit 121 can be with obvious less time (such as four
/ mono- time) complete neural network unit program execution.For example, with it is main when frequency operation in the case where, mind
About 1000 time-frequency periods are spent to execute neural network unit program through network unit 121, but, framework program can be spent
About 4000 time-frequency periods.In this way, neural network unit 121 would be at standby mode within the remaining time, in this model
In example, this is a considerable time, such as about 3000 it is main when frequency cycle.As shown in the example of Figure 36 A, according to mind
The difference of size and configuration through network can execute previous mode again, and may continuously carry out many times.Because of neural network
Unit 121 is a quite big and intensive transistor functional unit in processor 100, and the running of neural network unit 121 will
A large amount of thermal energy can be generated, especially with it is main when frequency operation when.
Figure 36 B is a timing diagram, the fortune that there is video-stream processor 100 neural network unit 121 to operate on mitigation mode
Make example, frequency when frequency is lower than main when mitigating the running of mode.The timing diagram of Figure 36 B is similar to Figure 36 A, in Figure 36 A,
Processor 100 with it is main when frequency execute framework program.This example assumes framework program and neural network unit journey in Figure 36 B
Sequence is identical to the framework program and neural network unit program of Figure 36 A.But, before starting neural network unit program, frame
Structure program can execute MTNN instruction 1400 and mitigate pointer 3512 with a mathematical programization, this numerical value can make time-frequency reduce logic 3504
Be less than it is main when frequency second when frequency generate the second time frequency signal 3508.That is, framework program can make nerve net
Network unit 121 is in the mitigation mode of Figure 36 B, rather than the general modfel of Figure 36 A.In this way, neural processing unit 126 will be with the
Frequency executes neural network unit program when two, under mitigation mode, frequency when frequency is less than main when second.It is false in this example
Surely mitigate pointer 3512 be with one by frequency when second be appointed as a quarter it is main when frequency numerical value give sequencing.Such as
This, it can be it in general mould that neural network unit 121 executes neural network unit program the time it takes under mitigation mode
Time taking four times are spent under formula, as shown in Figure 36 A and Figure 36 B, can find that neural network unit 121 is in through this two figure is compared
The time span of standby mode can significantly shorten.In this way, neural network unit 121 executes neural network unit journey in Figure 36 B
The duration big appointment of sequence consumption energy is four times that neural network unit 121 executes program under general modfel in Figure 36 A.
Therefore, neural network unit 121 executes the big appointment of the thermal energy that generates within the unit time of neural network unit program and is in Figure 36 B
The a quarter of Figure 36 A, and have the advantages that described herein.
Figure 37 is a flow chart, shows the running of the processor 100 of Figure 35.The running of this flow chart description is similar to above
Running corresponding to Figure 35, Figure 36 A and Figure 36 B figure.This process starts from step 3702.
In step 3702, processor 100 executes MTNN instruction 1400 and weight random access memory is written in weight
124 and write data into data random access memory 122.Following process advances to step 3704.
In step 3704, processor 100 executes MTNN instruction 1400 and mitigates pointer 3512 with a numerical value sequencing,
Specified one of this numerical value lower than it is main when frequency when frequency, even if also neural network unit 121 is in mitigation mode.It connects down
Incoming flow Cheng Qian proceeds to step 3706.
In step 3706, processor 100 executes 1400 instruction neural network unit 121 of MTNN instruction and starts to execute mind
Through network unit program, i.e., the mode that is presented similar to Figure 36 B.Following process advances to step 3708.
In step 3708, neural network unit 121 starts to execute this neural network unit program.Meanwhile processor 100
MTNN instruction 1400 can be executed and new data (can also may be write to new weight write-in weight random access memory 124
Enter data random access memory 122), and/or execute MFNN instruction 1500 and read from data random access memory 122
Take result (also can may read result from weight random access memory 124).Following process advances to step 3712.
In step 3712, processor 100 executes MFNN and instructs 1500 (such as reading state buffers 127), with detecting
Neural network unit 121 has terminated program execution.Assuming that framework procedure selection one good 3512 numerical value of mitigation pointer, nerve net
Network unit 121, which executes neural network unit program the time it takes, will be identical to 100 execution part framework program of processor
To access 122 the time it takes of weight random access memory 124 and/or data random access memory, such as Figure 36 B institute
Show.Following process advances to step 3714.
In step 3714, processor 100 executes MTNN instruction 1400 and a mathematical programization is utilized to mitigate pointer 3512, this
Frequency when numerical value specifies main, even if also neural network unit 121 is in general modfel.Next step 3716 is advanced to.
In step 3716, processor 100 executes 1400 instruction neural network unit 121 of MTNN instruction and starts to execute mind
Through network unit program, i.e., the mode that is presented similar to Figure 36 A.Following process advances to step 3718.
In step 3718, neural network unit 121 starts to execute neural network unit program with general modfel.This process
Terminate at step 3718.
It has been observed that compared under general modfel execute neural network unit program (i.e. with processor it is main when frequency
Execute), executing under mitigation mode can disperse to execute the time and be avoided that generation high temperature.Furthermore, it is understood that working as neural network
Unit mitigate mode execute program when, neural network unit be with it is lower when frequency generate thermal energy, these thermal energy can be suitable
Sharply via the packaging body and cooling of neural network unit (such as semiconductor device, the substrate of metal layer and lower section) and surrounding
Mechanism (such as cooling fin, fan) discharge, also therefore, the device (such as transistor, capacitor, conducting wire) in neural network unit just compares
It may operate at a lower temperature.On the whole, running also contributes to reducing the other of processor crystal grain under mitigation mode
Unit temp in part.Lower operational temperature can mitigate electric leakage for the junction temperature of these devices
The generation of stream.In addition, inductance noise can also be reduced with IR pressure drop noise because the magnitude of current flowed into the unit time reduces.This
Outside, temperature reduces the Negative Bias Temperature Instability (NBTI) for the metal-oxide half field effect transistor (MOSFET) in processor
Also there are positive influences with positive bias unstability (PBSI), and the longevity of reliability and/or device and processor part can be promoted
Life.Temperature reduces and can reduce Joule heat and electromigration effect in the metal layer of processor.
About the communication mechanism between the framework program and nand architecture program of neural network unit shared resource
It has been observed that in the example of Figure 24 to Figure 28 and Figure 35 to Figure 37, data random access memory 122 and weight
The resource of random access memory 124 is shared.The front end shared data of neural processing unit 126 and processor 100 is random
Access memory 122 and weight random access memory 124.More precisely, neural processing unit 126 and processor 100
Front end can all read data random access memory 122 and weight random access memory 124 such as media cache 118
It takes and is written.In other words, it is implemented in the framework program of processor 100 and is implemented in the neural network of neural network unit 121
Unit program meeting shared data random access memory 122 and weight random access memory 124, and in some cases, such as
It is preceding described, it needs to control the process between framework program and neural network unit program.The resource of program storage 129
It is also shared under to a certain degree, this is because it can be written in framework program, and sequencer 128 can read it
It takes.Embodiment as described herein provides a dynamical solution, to control between framework program and neural network unit program
Access the process of shared resource.
In the embodiments described herein, neural network unit program is also referred to as nand architecture program, and neural network unit refers to
It enables and is also referred to as nand architecture instruction, and neural network unit instruction set (also referred to as neural processing unit instruction set as previously described) is also referred to as
For nand architecture instruction set.Nand architecture instruction set is different from architecture instruction set.It include that instruction translator 104 will in processor 100
Framework instruction translation goes out in the embodiment of microcommand, and nand architecture instruction set is also different from microinstruction set.
Figure 38 is a block diagram, displays the details of the serial device 128 of neural network unit 121.Serial device 128 provides storage
Device address is supplied to the nand architecture instruction of serial device 128 with selection, as previously described to program storage 129.As shown in figure 38,
Storage address is loaded into the program counter 3802 of sequencer 128.Sequencer 128 would generally be with program storage 129
Sequence of addresses is incremented by proper order, except non-sequencer 128 suffers from nand architecture instruction, such as circulation or branch instruction, and in this situation
Under, program counter 3802 can be updated to the destination address of control instruction by sequencer 128, that is, is updated to positioned at control instruction
The address of the nand architecture instruction of target.Therefore, the address 131 for being loaded into program counter 3802 can specify currently seized for
The nand architecture for the nand architecture program that neural processing unit 126 executes instructs the address in program storage 129.Program counter
3802 numerical value can be taken by framework program through the neural network unit program counter field 3912 of status register 127
, as described in subsequent figure 39.Progress of the framework program according to nand architecture program can so be made, decision stores data at random
The position of 124 reading/writing data of device 122 and/or weight random access memory.
Sequencer 128 further includes cycle counter 3804, this cycle counter 3804 can arrange in pairs or groups nand architecture recursion instruction into
Row running, for example, in Figure 26 A address 10 be recycled to address 11 in 1 instruction and Figure 28 be recycled to 1 instruction.In Figure 26 A and figure
In 28 example, numerical value specified by the nand architecture initialization directive of load address 0 in cycle counter 3804, such as load number
Value 400.Sequencer 128 suffers from recursion instruction and jumps to target instruction target word and (be located at the multiplication of address 1 in such as Figure 26 A each time
The maxwacc for being located at address 1 in accumulated instruction or Figure 28 is instructed), sequencer 128 will make cycle counter 3804 successively decrease.
Once cycle counter 3804 is reduced to zero, sequencer 128 goes to sort in next nand architecture instruction.In another implementation
In example, the loop count specified in a recursion instruction can be loaded when suffering from recursion instruction for the first time in cycle counter,
To save the demand for utilizing nand architecture initialization directive loop initialization counter 3804.Therefore, the number of cycle counter 3804
Value would indicate that the circulation group of nand architecture program waits the number executed.The numerical value of cycle counter 3804 can be penetrated by framework program
The cycle count field 3914 of status register 127 obtains, as shown in subsequent figure 39.So framework program can be made according to non-frame
124 read/write number of memory 122 and/or weight random access memory is deposited data in the progress of structure program, decision at random
According to position.In one embodiment, sequencer includes three additional cycle counters with the nest set in nand architecture program of arranging in pairs or groups
Circulation, the numerical value of these three cycle counters also can pass through status register 127 and reads.With one with instruction in recursion instruction
Which is available to current recursion instruction use in this four cycle counters.
Sequencer 128 further includes the number of iterations counter 3806.The collocation nand architecture instruction of the number of iterations counter 3806, example
If the maxwacc of address 2 in the multiply-accumulate instruction of address 2 in Fig. 4, Fig. 9, Figure 20 and Figure 26 A and Figure 28 is instructed, these
Instruction will referred to as " execution " instruct hereafter.In previous cases, each execute instruction respectively specifies that execution counts 511,
511,1023,2 and 3.When sequencer 128 suffers from specified when executing instruction of a non-zero iteration count, 128 meeting of sequencer
The number of iterations counter 3806 is loaded with this designated value.In addition, sequencer 128 can generate micro- operation appropriate 3418 with control figure
Logic in 34 in 126 pipeline stages 3401 of neural processing unit executes, and the number of iterations counter 3806 is made to successively decrease.If
The number of iterations counter 3806 is greater than zero, and sequencer 128 can generate micro- operation 3418 appropriate again and control neural processing unit
Logic in 126 simultaneously makes the number of iterations counter 3806 successively decrease.Sequencer 128 can continue to operate by this method, until iteration time
The numerical value of counter 3806 is zeroed.Therefore, the numerical value of the number of iterations counter 3806 is that nand architecture executes instruction interior specify
Wait to execute operation times (these operations are such as multiply-accumulate for accumulated value and data/weight text progress, are maximized,
Add up operation etc.).The numerical value of the number of iterations counter 3806 can penetrate the number of iterations of status register 127 using framework program
Count area 3916 obtains, as described in subsequent figure 39.Can so make framework program according to nand architecture program progress, determine for
Data deposit the position of 124 reading/writing data of memory 122 and/or weight random access memory at random.
Figure 39 is a block diagram, shows the control of neural network unit 121 and several fields of status register 127.This
A little fields include the ground for including the weight random access memory column that the neural execution of processing unit 126 nand architecture program is most recently written
Location 2602, the address 2604 for the weight random access memory column that the nand architecture program that neural processing unit 126 executes is read recently,
Neural processing unit 126 executes the address 2606 for the data random access memory column that nand architecture program is most recently written, and mind
The address 2608 that the data random access memory column that nand architecture program is read recently are executed through processing unit 126, such as earlier figures
Shown in 26B.In addition, these fields further include 3912 field of neural network unit program counter, 3914 field of cycle counter,
With 3916 field of the number of iterations counter.It has been observed that framework program can delay the reading data in status register 127 to media
Storage 118 and/or general caching device 116, such as reading through MFNN instruction 1500 includes neural network unit program counter
3912, the numerical value of cycle counter 3914 and 3916 field of the number of iterations counter.The numerical value of program counter field 3912 is anti-
Reflect the numerical value of program counter 3802 in Figure 38.The number of the numerical value reflection cycle counter 3804 of cycle counter field 3914
Value.The numerical value of the numerical value reflection the number of iterations counter 3806 of the number of iterations counter field 3916.In one embodiment, fixed
Sequence device 128 is when needing adjustment programme counter 3802, cycle counter 3804 or the number of iterations counter 3806 every time, all
It will be updated program counter field 3912, the numerical value of cycle counter field 3914 and the number of iterations counter field 3916, such as
This, when framework program is read, the numerical value of these fields will be numerical value instantly.In another embodiment, when neural network list
When member 121 executes framework instruction with reading state buffer 127, neural network unit 121 only obtains program counter 3802,
The numerical value of cycle counter 3804 and the number of iterations counter 3806 is simultaneously provided back into framework instruction (such as to be provided to media slow
Storage 118 or general caching device 116).
It is possible thereby to find, the numerical value of the field of the status register 127 of Figure 39 can be understood as nand architecture instruction by mind
During being executed through network unit, the information of implementation progress.About nand architecture program implementation progress it is certain it is specific towards,
Such as 3802 numerical value of program counter, 3804 numerical value of cycle counter, 3806 numerical value of the number of iterations counter, nearest read/write
124 address 125 of weight random access memory field 2602/2604, and the data of read/write are deposited at random recently
The field 2606/2608 of 122 address 123 of access to memory, is described in previous chapters and sections.It is implemented in the frame of processor 100
Structure program can read the nand architecture program progress value of Figure 39 from status register 127 and use such information for doing decision, example
It such as penetrates such as to compare and be instructed with branch instruction framework to carry out.For example, framework program can determine to deposit data at random
Access to memory 122 and/or weight random access memory 124 carry out the column of data/weight read/write, to control data
The inflow and outflow of the data of random access memory 122 or weight random access memory 124, in particular for large data
The overlapping of group and/or different nand architecture instructions executes.These can refer to front and back herein using the example that framework program carries out decision
The description of chapters and sections.
For example, as described in Figure 26 A above, the result of convolution algorithm is write back number by framework program setting nand architecture program
According to the column (such as 8 top of column) of 2402 top of convolution kernel in random access memory 122, and when neural network unit 121 is using most
When result is written in the address of nearly write-in 122 column 2606 of data random access memory, framework program can be deposited from data random access
Reservoir 122 reads this result.
In another example, as described in Figure 26 B above, framework program utilizes 127 field of status register from Figure 38
Validation of information nand architecture program the data array 2404 of Figure 24 is divided into the data block of 5 512x 1600 to execute convolution fortune
The progress of calculation.Framework program deposits first 512 x, 1600 data block write-in weight of this 1600 data array of 2560x at random
Access to memory 124 simultaneously starts nand architecture program, and cycle count is 1600 and the initialization of weight random access memory 124
Output is classified as 0.When neural network unit 121 executes nand architecture program, framework program can read status register 127 to confirm power
Weight random access memory 124 is most recently written column 2602, and such framework program just can be read to be had by what nand architecture program was written
Effect convolution algorithm as a result, and override this effective convolution operation result using next 1600 data block of 512x after reading, such as
This, after neural network unit 121 completes nand architecture program for the execution of first 1600 data block of 512x, processor 100
Nand architecture program can be updated immediately if necessary and be again started up nand architecture program to execute next 1600 data of 512x
Block.
In another example, it is assumed that framework program makes neural network unit 121 execute a series of typical neural networks to multiply
Method adds up run function, wherein weight is stored in weight random access memory 124 and result can be written back into data and deposit at random
Access to memory 122.It in the case, would not be again to it after a column of framework program reading weight random access memory 124
It is read out.In this way, in current weight by nand architecture program reading/use after, so that it may started using framework program
By the weight on new weight manifolding weight random access memory 124, with provide nand architecture program example next time (such as
Next neural net layer) it uses.In the case, framework program can read status register 127 and be deposited at random with obtaining weight
New weight group is written to determine it in weight random access memory 124 in the address of the nearest reading column 2604 of access to memory
Position.
In another example, it is assumed that framework program knows in nand architecture program that including one, there is big the number of iterations to count
Execute instruction, such as the multiply-accumulate instruction of the nand architecture of address 2 in Figure 20.In the case, framework program needs to know iteration
Counting how many times 3916 can be known and generally also need how many a time-frequency periods that could complete the instruction of this nand architecture to determine framework
Next program to be taken the whichever of one of two or more movements.For example, if needing long time complete
At execution, framework program will abandon control and give another framework program, such as operating system.Similarly, it is assumed that framework program
Know that nand architecture program includes the circulation group with sizable cycle count, such as the nand architecture program of Figure 28.Herein
In the case of, framework program, which may require that, knows cycle count 3914, can know and generally also need how many a time-frequency periods could
The instruction of this nand architecture is completed next to be taken the whichever of one of two or more movements to determine.
In another example, it is assumed that framework program performs similarly to neural network unit 121 described in Figure 27 and Figure 28
Common source operation, wherein the data of wanted common source be previously stored weight random access memory 124 and result can be written back into weight with
Machine accesses memory 124.But, different from the example of Figure 27 and Figure 28, it is assumed that it is random that the result of this example can be written back into weight
The top 400 for accessing memory 124 arranges, such as column 1600 to 1999.In the case, nand architecture program is completed to read four
After 124 data of weight random access memory for arranging its wanted common source, nand architecture program would not be read out again.Therefore,
Once current four column data all by nand architecture program reading/use after, i.e., start using framework program new data is (such as non-
The weight of the example next time of framework program, for example, typical multiply-accumulate run function for example is executed to acquirement data and is transported
The nand architecture program of calculation) overriding weight random access memory 124 data.In the case, framework program can read state
Buffer 127 is to obtain the addresses of the nearest reading column 2604 of weight random access memory, to determine new weight group write-in
The position of weight random access memory 124.
Time recurrence (recurrent) neural network accelerates
Conventional feed forward neural network does not have the memory that storage network is previously entered.Feedforward neural network is normally used for
Executing and inputting multiple inputs of network at any time in task is respective independence, and multiple outputs are also task so.It compares
Under, time recurrent neural network, which typically facilitates the input sequence for executing and being input to neural network at any time in task, to be had
The task of importance.(sequence herein is commonly known as time step.) therefore, time recurrent neural network includes a concept
On memory or internal state, to load network in response to the letter for being previously entered performed calculating and generating in sequence
Breath, the output of time recurrent neural network are associated with the input of this internal state Yu next time step.Following task, such as language
Sound identification, language model, text generate, language translation, and it is to pass the time that image description, which generates, and some form of handwriting identification
Return neural network that can execute good example.
The example of three kinds of known time recurrent neural networks is Elman time recurrent neural network, and the Jordan time passs
Neural network and shot and long term is returned to remember (LSTM) neural network.Elman time recurrent neural network includes content node to remember
The hiding layer state of time recurrent neural network in current time step, this state in next time step can as
The input of hidden layer.Jordan time recurrent neural network is similar to Elman time recurrent neural network, in addition to therein interior
Hold the output layer state of node meeting memory time recurrent neural network rather than hides layer state.Shot and long term Memory Neural Networks include
The shot and long term memory layer being made of shot and long term memory cell.Each shot and long term memory cell has the current state of current time step
With current output and one is new or new state and new output of follow-up time step.Shot and long term memory cell includes defeated
Enter lock and output lock, and forget lock, forgeing lock can make neuron lose the state that it is remembered.These three time recurrence minds
It can be described in more detail through network in following sections.
As described herein, for time recurrent neural network, such as Elman or Jordan time recurrent neural network,
Neural network unit execute every time all can use time step, obtain one group of input layer value, and execute it is necessary calculating make it
It is propagated through time recurrent neural network, to generate output layer nodal value and hidden layer and content layer nodal value.Therefore,
Input layer value can be associated with calculating and hide, the time step of output and content layer nodal value;And it hides, output and content layer
Nodal value can be associated with the time step for generating these nodal values.Input layer value is that time recurrent neural network is emulated
Systematic sampling value, such as image, phonetic sampling, the snapshot of commercial market data.For shot and long term Memory Neural Networks, mind
Each execution through network unit can all use a period of time intermediate step, obtain one group of memory cell input value and execute necessary calculating to produce
Raw memory cell output valve (and memory cell state and input lock, forget lock and output lock numerical value), this is it can be appreciated that be
Memory cell input value is propagated through shot and long term memory layer memory cell.Therefore, memory cell input value, which can be associated with, calculates memory cell shape
State and input lock forget lock and export the time step of lock numerical value;And memory cell state and input lock, forget lock and output
Lock numerical value can be associated with the time step for generating these nodal values.
Content layer nodal value, also referred to as state node, are the state values of neural network, this state value is based on being associated with previously
The input layer value of time step, the input layer value without being only associated with current time step.Neural network unit
For (such as the hidden layer nodal value for Elman or Jordan time recurrent neural network of calculating performed by time step
Calculate) be previous time steps generate content layer nodal value a function.Therefore, network-like state value when time step starts
The output layer nodal value that (content node value) generates during will affect intermediate step at this time.In addition, at the end of time step
Network-like state value when the input node value and time step that network-like state value will receive intermediate step at this time start influences.It is similar
Ground, for shot and long term memory cell, memory cell state value is associated with the memory cell input value of previous time steps, rather than only
It is associated with the memory cell input value of current time step.Because calculating that neural network unit executes time step (such as
The calculating of next memory cell state) be previous time steps generate memory cell state value function, when time step starts
Network-like state value (memory cell state value) will affect the memory cell output valve generated in intermediate step at this time, and intermediate step knot at this time
Network-like state value when beam, which will receive the memory cell input value of intermediate step at this time and former network state value, to be influenced.
Figure 40 is a block diagram, shows an example of Elman time recurrent neural network.The Elman time recurrence of Figure 40
Neural network includes input layer or neuron, is denoted as D0, D1 to Dn, referred to collectively as multiple input layer D and it is individual
It is commonly referred to as input layer D;Node layer/neuron is hidden, is denoted as Z0, Z1 to Zn, referred to collectively as multiple hiding node layer Z
And it is commonly referred to as hiding node layer Z individually;Node layer/neuron is exported, Y0, Y1 to Yn, referred to collectively as multiple output layers are denoted as
Node Y and individually be commonly referred to as output node layer Y;And content node layer/neuron, it is denoted as C0, C1 to Cn, it is referred to collectively as more
A content node layer C and be commonly referred to as content node layer C individually.In the example of the Elman time recurrent neural network of Figure 40, respectively
The output that there is a hiding node layer Z an input to be linked to each input layer D, and there is an input to be linked to each content
The output of node layer C;The output that there is each output node layer Y an input to be linked to each hiding node layer Z;And each content
The output that there is node layer C an input to be linked to corresponding hiding node layer Z.
In many aspects, the running of Elman time recurrent neural network is similar to traditional feed forward-fuzzy control.?
That is each input connection of this node can all have an associated weight for given node;Node is defeated one
Entering the numerical value for linking and receiving can be with associated multiplied by weight to generate a product;This node can will be associated with all input connections
Product addition is to generate one total (may can also include a shift term in this sum);In general, can also be executed to this sum
For run function to generate the output valve of node, this output valve is sometimes referred to as the initiation value of node thus.For traditional feedforward network
For, data are always flowed along the direction of input layer to output layer.That is, input layer provides a numerical value to hidden layer
(usually having multiple hidden layers), and hidden layer can generate its output valve and be provided to output layer, and output layer can be generated and can be taken
Output.
But, different from traditional feedforward network, Elman time recurrent neural network further includes some feedback connections,
It is exactly the connection in Figure 40 from hiding node layer Z to content node layer C.The running of Elman time recurrent neural network is as follows, when
Input layer D can provide a numerical value in new time step one input value of offer to hiding node layer Z, content node C
To hidden layer Z, this numerical value is to hide node layer Z in response to being previously entered, that is, current time step, output valve.From this
For in meaning, the content node C of Elman time recurrent neural network is depositing for the input value based on previous time steps
Reservoir.Figure 41 and Figure 42 will be associated with the neural network list of the calculating of the Elman time recurrent neural network of Figure 40 to execution
The running embodiment of member 121 is illustrated.
In order to illustrate the present invention, Elman time recurrent neural network is one comprising at least one input node layer, one
Concealed nodes layer, the time recurrent neural network of an output node layer and a content node layer.One given time was walked
Suddenly, content node layer can store concealed nodes layer and generate in previous time step and feed back to the result of content node layer.This
The result for feeding back to content layer can be the implementing result of run function or concealed nodes layer executes accumulating operation and is not carried out
The result of run function.
Figure 41 is a block diagram, and display is associated with the Elman time recurrent neural of Figure 40 when the execution of neural network unit 121
When the calculating of network, in the data random access memory 122 and weight random access memory 124 of neural network unit 121
Data configuration an example.Assume that the Elman time recurrent neural network of Figure 40 is defeated with 512 in the example of Figure 41
Ingress D, 512 concealed nodes Z, 512 content node C, with 512 output node Y.In addition, also assuming that this Elman time
Recurrent neural network is connection completely, i.e., all 512 input node D link each concealed nodes Z as input, all
512 content node C link each concealed nodes Z as input, and all 512 concealed nodes Z link each output
Node Y is as input.In addition, this neural network unit 121 is configured to 512 neural processing units 126 or neuron, such as adopt
Width configuration.Finally, to be associated with the weight of the connection of content node C to concealed nodes Z be numerical value 1 to the hypothesis of this example, because without
These weighted values for being one need to be stored.
As shown in the figure, the lower section 512 of weight random access memory 124 arranges (column 0 to 511) loading and is associated with input section
The weighted value of connection between point D and concealed nodes Z.More precisely, as shown in the figure, the loading of column 0 is associated with by input node
The weight of the input connection of D0 to concealed nodes Z, is associated between input node D0 and concealed nodes Z0 that is, text 0 can load
Connection weight, text 1 can load the weight for the connection being associated between input node D0 and concealed nodes Z1, and text 2 can fill
Carry the weight for being associated with connection between input node D0 and concealed nodes Z2, and so on, text 511 can load be associated with it is defeated
The weight of connection between ingress D0 and concealed nodes Z511;The loading of column 1 is associated with by the defeated of input node D1 to concealed nodes Z
Enter the weight of connection, that is, text 0 can load the weight for the connection being associated between input node D1 and concealed nodes Z0, text 1
The weight for the connection being associated between input node D1 and concealed nodes Z1 can be loaded, text 2, which can load, is associated with input node D1
The weight of connection between concealed nodes Z2, and so on, text 511, which can load, is associated with input node D1 and concealed nodes
The weight of connection between Z511;Until column 511, column 511 load the input company being associated with by input node D511 to concealed nodes Z
The weight of knot, that is, text 0 can load the weight for the connection being associated between input node D511 and concealed nodes Z0,1 meeting of text
The weight for the connection being associated between input node D511 and concealed nodes Z1 is loaded, text 2 can load and be associated with input node
The weight of connection between D511 and concealed nodes Z2, and so on, text 511, which can load, to be associated with input node D511 and hides
The weight of connection between node Z511.This configuration is similar to purposes corresponds to fig. 4 to fig. 6 A the embodiment described above.
As shown in the figure, subsequent 512 column (column 512 to 1023) of weight random access memory 124 are with similar side
Formula loads the weight for the connection being associated between concealed nodes Z and output node Y.
Data random access memory 122 loads Elman time recurrent neural network nodal value and supplies a series of time steps
It uses.Furthermore, it is understood that data random access memory 122 loads the nodal value for being supplied to timing intermediate step with three column for group.
As shown in the figure, by taking one with the data random access memory 122 of 64 column as an example, this data random access memory 122
The nodal value used for 20 different time steps can be loaded.In the example of Figure 41, column 0 to 2 are loaded to be used for time step 0
Nodal value, column 3 to 5 load the nodal value that uses for time step 1, and so on, column 57 to 59 are loaded for time step 19
The nodal value used.First row in each group loads the numerical value of the input node D of intermediate step at this time.Secondary series in each group loads
The numerical value of the concealed nodes Z of intermediate step at this time.Third in each group equips the numerical value for carrying the output node Y of intermediate step at this time.Such as
As shown in the figure, each luggage of data random access memory 122 carries the section of its corresponding neuron or neural processing unit 126
Point value.That is, the loading of row 0 is associated with node D0, the nodal value of Z0 and Y0, calculating is held by neural processing unit 0
Row;The loading of row 1 is associated with node D1, and the nodal value of Z1 and Y1, calculating is as performed by neural processing unit 1;Class according to this
It pushes away, the loading of row 511 is associated with node D511, and the nodal value of Z511 and Y511, calculating is held by neural processing unit 511
Row, this part correspond at Figure 42 and can be described in more detail subsequent.
As pointed by Figure 41, for a given time step, positioned at three column memory of each group secondary series hide
The numerical value of node Z can be the numerical value of the content node C of next time step.That is, neural processing unit 126 is for the moment
The numerical value of calculating and the node Z being written in intermediate step can become this neural processing unit 126 and be used in next time step
The numerical value of node C used in the numerical value of calculate node Z (together with the numerical value of the input node D of this next time step).It is interior
The initial value (in numerical value of the time step 0 to calculate node C used in the numerical value of the node Z in column 1) for holding node C assumes
It is zero.This can be described in more detail in the related Sections of the subsequent nand architecture program corresponding to Figure 42.
Preferably, the numerical value (column 0,3 in the example of Figure 41, and so on to the numerical value of column 57) of input node D is by holding
Row instructs 1400 write-ins/filling data random access memory 122 through MYNN in the framework program of processor 100, and is
By being implemented in nand architecture program reading/use of neural network unit 121, such as the nand architecture program of Figure 42.On the contrary, hidden
Hiding/output node Z/Y numerical value (column 1 and 2,4 and 5 in the example of Figure 41, and so on to the numerical value of column 58 and 59) is then
Nand architecture program by being implemented in neural network unit 121 is written/inserts data random access memory 122, and is by holding
Row instructs 1500 readings/use through MFNN in the framework program of processor 100.The example of Figure 41 assumes that this framework program can be held
Row following steps: the numerical value of input node D is inserted data random access memory by (1) the time step different for 20
122 (column 0,3, and so on to column 57);(2) start the nand architecture program of Figure 42;(3) whether detecting nand architecture program has executed
Finish;(4) numerical value (column 2,5, and so on to column 59) of output node Y is read from data random access memory 122;And
(5) step (1) to (4) are repeated several times until completion task, such as the language of cellie is carried out recognizing required meter
It calculates.
In another executive mode, framework program can execute following steps: (1) to single a time step, with input
The numerical value of node D inserts data random access memory 122 (such as column 0);(2) start nand architecture program (Figure 42 nand architecture program
Amendment after version, be not required to recycle, and only access single group three of data random access memory 122 column);(3) it detects non-
Whether framework program is finished;(4) numerical value (such as column 2) of output node Y is read from data random access memory 122;With
And (5) repeat step (1) to (4) several times until completing task.This two kinds of mode whichever be it is excellent can be according to time recurrent neural
Depending on the sampling mode of the input value of network.For example, if this task is allowed in multiple time steps and takes to input
Sample (such as about 20 time steps) simultaneously executes calculating, and first way is with regard to ideal, because mode may be brought more thus
More computing resource efficiency and/or preferable efficiency, but, if this task is only allowed in single a time step and executes sampling,
It just needs using the second way.
3rd embodiment is similar to the aforementioned second way, but, is different from the second way and uses single group of three columns
According to random access memory 122, the nand architecture program of this mode uses three column memory of multiple groups, that is, in each time step
Using different groups of three column memories, this part is similar to first way.In this 3rd embodiment, preferably, framework program
It include a step before the step (2), in this step, framework program can be updated it before nand architecture program starts, such as
The column of data random access memory 122 in the instruction of address 1 are updated to point to next group of three column memories.
Figure 42 is a table, and display is stored in a program of the program storage 129 of neural network unit 121, this program
It is executed by neural network unit 121, and reaches Elman time recurrent neural using data and weight according to the configuration of Figure 41
Network.Some instructions in the nand architecture program of Figure 42 (and Figure 45, Figure 48, Figure 51, Figure 54 and Figure 57) in detail as it is aforementioned (such as
Multiply-accumulate (MULT-ACCUM) is recycled (LOOP), initialization (INITIALIZE) instruction), following paragraphs assumes these instructions
It is consistent with preceding description content, unless otherwise noted.
The example program of Figure 42 includes 13 nand architecture instructions, is located at address 0 to 12.The instruction of address 0
(INITIALIZE NPU, LOOPCNT=20) removes accumulator 202 and initializes cycle counter 3804 to numerical value 20,
To execute 20 circulation groups (instruction of address 4 to 11).Preferably, this initialization directive can also make at neural network unit 121
It is configured in width, in this way, neural network unit 121 will be configured to 512 neural processing units 126.As described in following sections,
In the execution process instruction of address 1 to 3 and address 7 to 11, this 512 neural processing units 126 are opposite as 512
The hiding node layer Z answered is operated, and in the execution process instruction of address 4 to 6, this 512 neural processing units 126
It is operated as 512 corresponding output node layer Y.
The instruction of address 1 to 3 is not belonging to the circulation group of program and can only execute primary.These instructions, which calculate, hides node layer
The initial value of Z and the column 1 for being written into data random access memory 122 are executed for the first time of the instruction of address 4 to 6 to be made
With to calculate first time step (the output node layer Y of time step 0).In addition, these instruction meters by address 1 to 3
The numerical value for calculating and being written the hiding node layer Z of the column 1 of data random access memory 122 will become the numerical value of content node layer C
Use is executed for the first time of the instruction of address 7 and 8, to calculate the numerical value of hiding node layer Z for the second step (time time
Step 1) uses.
In the implementation procedure of the instruction of address 1 and 2, each nerve processing in this 512 neural processing units 126 is single
Member 126 can execute 512 multiplyings, will be located at 512 input node D numerical value of 122 column 0 of data random access memory
It is multiplied by the weight of the row of this corresponding neural processing unit 126 in the column 0 to 511 of weight random access memory 124, to generate
512 product accumulations are in the accumulator 202 of corresponding neural processing unit 126.In the implementation procedure of the instruction of address 3, this
The numerical value of 512 accumulators 202 of 512 neural processing units can be passed and data random access memory 122 is written
Column 1.That is, the output order of address 3 can be by the tired of each neural processing unit 512 in 512 neural processing units
Adding the column 1 of 202 numerical value of device write-in data random access memory 122, this numerical value is initial hidden layer Z numerical value, then, this
Instruction can remove accumulator 202.
The ground that operation performed by the instruction of the address 1 to 2 of the nand architecture program of Figure 42 is instructed similar to the nand architecture of Fig. 4
Operation performed by the instruction of location 1 to 2.Furthermore, it is understood that the instruction (MULT_ACCUM DR ROW 0) of address 1 can indicate this
Each neural processing unit 126 in 512 neural processing units 126 is by the opposite of the column 0 of data random access memory 122
Text is answered to read in its multitask buffer 208, it is more that the corresponding text of the column 0 of weight random access memory 124 is read in it
Data literal is multiplied with weight text and generates product and accumulator 202 is added in this product by task buffer device 705.Address 2
Instruction (MULT-ACCUM ROTATE, WR ROW+1, COUNT=511) indicate it is each in this 512 neural processing units
Text from adjacent nerve processing unit 126 is transferred to its multitask buffer 208 (using by mind by neural processing unit 126
The rotator for 512 texts that 512 208 collectives of multitask buffer running through network unit 121 is constituted, these buffers
The buffer that the column of data random access memory 122 are read in the instruction instruction of as address 1), weight arbitrary access is deposited
The corresponding text of the next column of reservoir 124 reads in its multitask buffer 705, and data literal is multiplied generation with weight text
Simultaneously accumulator 202 is added in this product by product, and is executed foregoing operation 511 times.
In addition, in Figure 42 address 3 single nand architecture output order (OUTPUT PASSTHRU, DR OUT ROW 1, CLR
ACC) operation that run function instructs can be merged with the write-in output order of address 3 and 4 in Fig. 4 (although the program of Figure 42 passes
202 numerical value of accumulator is passed, and the program of Fig. 4 is then to execute run function to 202 numerical value of accumulator).That is, in Figure 42
Program in, be implemented in the run function of 202 numerical value of accumulator, if any, in output order specify (also address 6 with
Specified in 11 output order), rather than specified as the program of Fig. 4 is shown in a different nand architecture run function instruction.
Another embodiment of the nand architecture program of Fig. 4 (and Figure 20, Figure 26 A and Figure 28), also i.e. operation that run function instruct and
Write-in output order (address 3 and 4 of such as Fig. 4) merges into single nand architecture output order as shown in figure 42 and also belongs to the present invention
Scope.The example of Figure 42 assumes that the node of hidden layer (Z) will not execute run function to accumulator value.But, hidden layer
(Z) embodiment for executing run function to accumulator value also belongs to the scope of the present invention, these embodiments can utilize address 3 and 11
Instruction carry out operation, such as S type, tanh, correction function etc..
Instruction compared to address 1 to 3 can only execute once, the instruction of address 4 to 11 then be located at program circulation in and
Several numbers can be performed, this number as specified by cycle count (such as 20).It is held for 19 times before the instruction of address 7 to 11
Row calculates the numerical value for hiding node layer Z and is written into the second of instruction of the data random access memory 122 for address 4 to 6
The output node layer Y (time step 1 to 19) to calculate remaining time step is used to 20 execution.(the finger of address 7 to 11
Last/the 20th column for executing the numerical value of the hiding node layer Z of calculating and being written into data random access memory 122 enabled
61, but, these numerical value are simultaneously not used by.)
Address 4 and 5 instruction (MULT-ACCUM DR ROW+1, WR ROW 512and MULT-ACCUM ROTATE,
WR ROW+1, COUNT=511) first time execute in (correspond to time step 0), in this 512 neural processing units 126
Each neural processing unit 126 can execute 512 multiplyings, by 512 of the column 1 of data random access memory 122
It is random that the numerical value (these numerical value are executed by the single time of instruction of address 1 to 3 and generated and be written) of concealed nodes Z is multiplied by weight
The weight that the row of this neural processing unit 126 is corresponded in the column 512 to 1023 of memory 124 is accessed, to generate 512 products
It adds up in the accumulator 202 of corresponding neural processing unit 126.Instruction (OUTPUT ACTIVATION in address 6
FUNCTION, DR OUT ROW+1, CLR ACC) first time execute, can this 512 accumulating values be executed with starting letters
Number (such as S type, tanh, correction function) exports the numerical value of node layer Y to calculate, and implementing result can be written data and deposit at random
The column 2 of access to memory 122.
(correspond to time step 1), this 512 neural processing units in second of execution of the instruction of address 4 and 5
Each neural processing unit 126 in 126 can execute 512 multiplyings, by the column 4 of data random access memory 122
The numerical value (these numerical value are executed by the first time of the instruction of address 7 to 11 and generated and be written) of 512 concealed nodes Z is multiplied by power
The weight of the row of this neural processing unit 126 is corresponded in the column 512 to 1023 of weight random access memory 124, to generate 512
A product accumulation is in the accumulator 202 of corresponding neural processing unit 126, and in the executing for second of the instruction of address 6, meeting
Run function is executed for this 512 accumulating values to calculate the numerical value of output node layer Y, this result write-in data are deposited at random
The column 5 of access to memory 122;(correspond to time step 2, this 512 nerves in the third time of the instruction of address 4 and 5 executes
Each neural processing unit 126 in processing unit 126 can execute 512 multiplyings, by data random access memory 122
The numerical value of 512 concealed nodes Z of column 7 (these numerical value are executed by second of instruction of address 7 to 11 and are generated and write
Enter) it is multiplied by the weight that the row of this neural processing unit 126 is corresponded in the column 512 to 1023 of weight random access memory 124, with
512 product accumulations are generated in the accumulator 202 of corresponding neural processing unit 126, and the third time of the instruction in address 6 is held
In row, can this 512 accumulating values be executed with run function to calculate the numerical value of output node layer Y, data are written in this result
The column 8 of random access memory 122;The rest may be inferred, (corresponds to time step in the 20th execution of the instruction of address 4 and 5
It is rapid 19), each neural processing unit 126 in this 512 neural processing units 126 can execute 512 multiplyings, by data
512 concealed nodes Z of the column 58 of random access memory 122 numerical value (these numerical value by address 7 to 11 instruction the tenth
Execute and generate and write-in for nine times) it is multiplied by the column 512 to 1023 of weight random access memory 124 and correspond to this neural processing
The weight of the row of unit 126, to generate 512 product accumulations in the accumulator 202 of corresponding neural processing unit 126, and
During the 20th time of the instruction of address 6 executes, can this 512 accumulating values be executed with run function to calculate output node layer
The column 59 of data random access memory 122 are written in the numerical value of Y, implementing result.
Each nerve processing in the first time of the instruction of address 7 and 8 executes, in this 512 neural processing units 126
The numerical value of 512 content node C of the column 1 of data random access memory 122 is added to its accumulator 202 by unit 126, this
Produced by single execution of a little numerical value as the instruction of address 1 to 3.Furthermore, it is understood that instruction (the ADD_D_ACC DR of address 7
ROW+0 each neural processing unit 126 in this 512 neural processing units 126) can be indicated data random access memory
122 read in its multitask buffer 208 when the corresponding text of forefront (being column 0 during executing first time), and will
Accumulator 202 is added in this text.The instruction (ADD_D_ACC ROTATE, COUNT=511) of address 8 indicates at this 512 nerves
Text from adjacent nerve processing unit 126 is transferred to its multitask and delayed by each neural processing unit 126 in reason unit 126
Storage 208 (utilizes 512 texts being made of the running of 512 208 collectives of multitask buffer of neural network unit 121
Rotator, these multitask buffers are the caching that the column of data random access memory 122 are read in the instruction instruction of address 7
Device), accumulator 202 is added in this text, and execute foregoing operation 511 times.
Each nerve processing in second of execution of the instruction of address 7 and 8, in this 512 neural processing units 126
The numerical value of 512 content node C of the column 4 of data random access memory 122 can will be added to its accumulator by unit 126
202, produced by these numerical value are executed by the first time of the instruction of address 9 to 11 and be written;In the third of the instruction of address 7 and 8
In secondary execution, each neural processing unit 126 in this 512 neural processing units 126 can will be stored data random access
The numerical value of 512 content node C of the column 7 of device 122 is added to its accumulator 202, these numerical value by address 9 to 11 instruction
Produced by second executes and be written;The rest may be inferred, in the 20th execution of the instruction of address 7 and 8, this 512 nerves
Each neural processing unit 126 in processing unit 126 can be by will be in 512 of the column 58 of data random access memory 122
The numerical value for holding node C is added to its accumulator 202, produced by the 19th execution of these numerical value as the instruction of address 9 to 11
And it is written.
It has been observed that it is one that the example of Figure 42, which assumes that the weight for being associated with the connection of content node C to hiding node layer Z has,
Value.But, in another embodiment, it is then with non-zero power that these, which are located at the connection in Elman time recurrent neural network,
Weight values, these weights are placed in weight random access memory 124 (such as column 1024 to 1535) before the program of Figure 42 executes,
The program instruction of address 7 is MULT-ACCUM DR ROW+0, WR ROW 1024, and the program instruction of address 8 is MULT-
ACCUM ROTATE, WR ROW+1, COUNT=511.Preferably, the instruction of address 8 does not access weight random access memory
124, but the numerical value of multitask buffer 705 is read in the instruction for rotating address 7 from weight random access memory 124.?
Not accessing to weight random access memory 124 in the time-frequency period that 511 execution addresses 8 instruct can retain more
Bandwidth is used for framework program access weight random access memory 124.
Address 9 and 10 instruction (MULT-ACCUM DR ROW+2, WR ROW 0and MULT-ACCUM ROTATE,
WR ROW+1, COUNT=511) first time execute in (correspond to time step 1), in this 512 neural processing units 126
Each neural processing unit 126 can execute 512 multiplyings, by 512 of the column 3 of data random access memory 122
The numerical value of input node D is multiplied by the row that this neural processing unit 126 is corresponded in the column 0 to 511 of weight random access memory 124
Weight to generate 512 products, together with address 7 and 8 instruction for cumulative fortune performed by 512 content node C numerical value
It calculates, adds up and calculate the numerical value of hiding node layer Z, the finger in address 11 in the accumulator 202 of corresponding neural processing unit 126
It enables in the first time execution of (OUTPUT PASSTHRU, DR OUT ROW+2, CLR ACC), this 512 neural processing units 126
512 202 numerical value of accumulator be passed and be written the column 4 of data random access memory 122, and accumulator 202 can be clear
It removes;(correspond to time step 2, this 512 neural processing units 126 in second of execution of the instruction of address 9 and 10
In each neural processing unit 126 can execute 512 multiplyings, by the 512 of the column 6 of data random access memory 122
The numerical value of a input node D, which is multiplied by the column 0 to 511 of weight random access memory 124, corresponds to this neural processing unit 126
Row weight, to generate 512 products, together with address 7 and 8 instruction for performed by 512 content node C numerical value
Accumulating operation adds up and calculates the numerical value of hiding node layer Z in the accumulator 202 of corresponding neural processing unit 126, in address
During second of 11 instruction executes, 512 202 numerical value of accumulator of this 512 neural processing units 126 are passed and are written
The column 7 of data random access memory 122, and accumulator 202 can then be removed;The rest may be inferred, the instruction in address 9 and 10
Execute for the 19th time in (correspond to time step 19, each nerve processing in this 512 neural processing units 126 is single
Member 126 can execute 512 multiplyings, by the numerical value of 512 input node D of the column 57 of data random access memory 122
It is multiplied by the weight that the row of this neural processing unit 126 is corresponded in the column 0 to 511 of weight random access memory 124, to generate
512 products, together with address 7 and 8 instruction for accumulating operation performed by 512 content node C numerical value, add up in phase
The accumulator 202 of corresponding nerve processing unit 126 is to calculate the numerical value for hiding node layer Z, and the tenth of the instruction in address 11
In nine execution, 512 202 numerical value of accumulator of this 512 neural processing units 126 are passed and data random access are written
The column 58 of memory 122, and accumulator 202 can then be removed.As previously mentioned, the 20th time in the instruction of address 9 and 10 is held
Produced by row and the numerical value of the hiding node layer Z of write-in can't be used.
The instruction (LOOP 4) of address 12 can make cycle counter 3804 successively decrease and in the new number of cycle counter 3804
Value returns to the instruction of address 4 in the case where being greater than zero.
Figure 43 is the example that a block diagram shows Jordan time recurrent neural network.The Jordan time of Figure 43 passs
Return neural network to be similar to the Elman time recurrent neural network of Figure 40, there is input layer/neuron D, hidden layer section
Point/neuron Z exports node layer/neuron Y, with content node layer/neuron C.But, it is passed in the Jordan time of Figure 43
Returning in neural network, content node layer C is linked using the output feedback from its corresponding output node layer Y as its input, and
The output from hiding node layer Z is as its input connection in the non-Elman time recurrent neural network such as Figure 40.
In order to illustrate the present invention, Jordan time recurrent neural network is one comprising at least one input node layer, one
A concealed nodes layer, the time recurrent neural network of an output node layer and a content node layer.It is walked in a given time
Rapid beginning, content node layer can store output node layer and generate in previous time step and be fed back to the knot of content node layer
Fruit.This result for being fed back to content layer can be the result of run function or output node layer executes accumulating operation and is not carried out
The result of run function.
Figure 44 is a block diagram, and display is associated with the Jordan time recurrence mind of Figure 43 when the execution of neural network unit 121
When calculating through network, the data random access memory 122 and weight random access memory 124 of neural network unit 121
One example of interior data configuration.Assume that the Jordan time recurrent neural network of Figure 43 has 512 in the example of Figure 44
Input node D, 512 concealed nodes Z, 512 content node C, with 512 output node Y.In addition, also assuming that this Jordan
Time recurrent neural network is connection completely, i.e., all 512 input node D link each concealed nodes Z as input, entirely
512, portion content node C links each concealed nodes Z as input, and all 512 concealed nodes Z link it is each defeated
Egress Y is as input.Although the example of the Jordan time recurrent neural network of Figure 44 can impose 202 numerical value of accumulator and open
Dynamic function is to generate the numerical value of output node layer Y, and but, this example assumes that the number of accumulator 202 before run function can will be imposed
Value is transferred to content node layer C, rather than really output node layer Y numerical value.In addition, neural network unit 121 is provided with 512
Neural processing unit 126 or neuron, such as take wide configuration.Finally, the hypothesis of this example is associated with by content node C to hidden
The weight for hiding the connection of node Z all has numerical value 1;Because without storing these weighted values for being one.
Such as the example of Figure 41, as shown in the figure, the lower section 512 of weight random access memory 124 arranges (column 0 to 511)
The weighted value for the connection being associated between input node D and concealed nodes Z can be loaded, and after weight random access memory 124
Continuous 512 column (column 512 to 1023) can load the weighted value for the connection being associated between concealed nodes Z and output node Y.
Data random access memory 122 loads Jordan time recurrent neural network nodal value and is similar to figure for a series of
Time step in 41 example uses;But, the memory loads arranged in the example of Figure 44 with one group four provide given time
The nodal value of step.As shown in the figure, in the embodiment of the data random access memory 122 with 64 column, data are random
Nodal value needed for access memory 122 can load 15 different time steps.In the example of Figure 44, column 0 to 3, which load, to be supplied
The nodal value that time step 0 uses, column 4 to 7 load the nodal value used for time step 1, and so on, column 60 to 63 load
The nodal value used for time step 15.The first row of this four column storage stack loads the input node D's of intermediate step at this time
Numerical value.The secondary series of this four column storage stack loads the numerical value of the concealed nodes Z of intermediate step at this time.This four column storage stack
Third equip the numerical value for carrying the content node C of intermediate step at this time.4th column of this four column storage stack are then to load at this time
The numerical value of the output node Y of intermediate step.As shown in the figure, it is corresponding to carry its for each luggage of data random access memory 122
The nodal value of neuron or neural processing unit 126.That is, the loading of row 0 is associated with node D0, the node of Z0, C0 and Y0
Value, calculating is executed by neural processing unit 0;The loading of row 1 is associated with node D1, the nodal value of Z1, C1 and Y1, and calculating is
It is executed by neural processing unit 1;The rest may be inferred, and the loading of row 511 is associated with node D511, the node of Z511, C511 and Y511
Value, calculating is executed by neural processing unit 511.This part corresponds at Figure 44 and can be described in more detail subsequent.
The numerical value of the content node C of given time step is in generation in intermediate step at this time and as next time in Figure 44
The input of step.It, can be at that is, the numerical value of node C that neural processing unit 126 is calculated in intermediate step at this moment and is written
Numerical value of the neural processing unit 126 in next time step for node C used in the numerical value of calculate node Z thus
(together with the numerical value of the input node D of this next time step).(instant intermediate step 0 calculates column 1 to the initial value of content node C
The numerical value of node C used in the numerical value of node Z) it is assumed to zero.This part is in the subsequent nand architecture program corresponding to Figure 45
Chapters and sections can be described in more detail.
As described in Figure 41 above, preferably, the numerical value (column 0,4 in the example of Figure 44, and so on extremely of input node D
The numerical value of column 60) by being implemented in the framework program of processor 100 through MTNN instruction 1400 write-ins/filling data random access
Memory 122, and be nand architecture program reading/use by being implemented in neural network unit 121, such as the nand architecture of Figure 45
Program.On the contrary, the numerical value of concealed nodes Z/ content node C/ output node Y (is respectively column 1/2/3,5/ in the example of Figure 44
6/7, and so on to the numerical value of column 61/62/63) by being implemented in nand architecture program write-in/filling of neural network unit 121
Data random access memory 122, and be by be implemented in the framework program of processor 100 through MFNN instruction 1500 read/
It uses.The example of Figure 44 assumes that this framework program can execute following steps: (1) the time step different for 15 will input
The numerical value of node D inserts data random access memory 122 (column 0,4, and so on to column 60);(2) start the non-frame of Figure 45
Structure program;(3) whether detecting nand architecture program is finished;(4) output node Y is read from data random access memory 122
Numerical value (column 3,7, and so on to column 63);And (5) repeat step (1) to (4) several times until completion task, such as
The language of cellie is carried out recognizing required calculating.
In another executive mode, framework program can execute following steps: (1) to single a time step, with input
The numerical value of node D inserts data random access memory 122 (such as column 0);(2) start nand architecture program (Figure 45 nand architecture program
Amendment after version, be not required to recycle, and only access data deposit single group four of memory 122 column at random);(3) non-frame is detected
Whether structure program is finished;(4) numerical value (such as column 3) of output node Y is read from data random access memory 122;And
(5) step (1) to (4) are repeated several times until completing task.This two kinds of mode whichever be it is excellent can be according to time recurrent neural net
Depending on the sampling mode of the input value of network.For example, input is taken if this task is allowed in multiple time steps
Sample (such as about 15 time steps) simultaneously executes calculating, and first way is with regard to ideal, because mode can be brought more thus
Computing resource efficiency and/or preferable efficiency, but, if this task, which is only allowed in single a time step, executes sampling,
It just needs using the second way.
3rd embodiment is similar to the aforementioned second way, but, is different from the second way and uses single group of four numbers
It is arranged according to random access memory 122, the nand architecture program of this mode uses four column memory of multiple groups, that is, in each time step
Suddenly different groups of four column memories are used, this part is similar to first way.In this 3rd embodiment, preferably, framework journey
Sequence includes a step before step (2), and in this step, framework program can be updated it before nand architecture program starts,
Such as the column of data random access memory 122 in the instruction of address 1 are updated to point to next group of four column memories.
Figure 45 be a table, display be stored in neural network unit 121 program storage 129 program, this program by
Neural network unit 121 executes, and uses data and weight according to the configuration of Figure 44, to reach Jordan time recurrent neural
Network.The nand architecture program of Figure 45 is similar to the nand architecture program of Figure 42, and the difference of the two can refer to saying for this paper related Sections
It is bright.
The example program of Figure 45 includes 14 nand architecture instructions, is located at address 0 to 13.The instruction of address 0 is one
Initialization directive, to remove accumulator 202 and initialize cycle counter 3804 to numerical value 15, to execute 15 circulation groups
(instruction of address 4 to 12).Preferably, this initialization directive and neural network unit 121 can be made to be in wide configuration and be configured to
512 neural processing units 126.As described herein, in the execution process instruction of address 1 to 3 and address 8 to 12, this 512
A nerve processing unit 126 is corresponding and is operated as 512 hiding node layer Z, and the instruction execution in address 4,5 and 7
In the process, this 512 neural processing units 126 are corresponding and operated as 512 output node layer Y.
The instruction of address 1 to 5 and address 7 is identical as the instruction of address 1 to 6 in Figure 42 and has identical function.Address 1 to
3 instruction calculate the initial value for hiding node layer Z and be written into the column 1 of data random access memory 122 for address 4,5 with
The first time of 7 instruction executes use, to calculate first time step (the output node layer Y of time step 0).
During executing the first time of the output order of address 6, this 512 instructions by address 4 and 5 are cumulative to be generated
202 numerical value of accumulator (these following numerical value can be used by the output order of address 7 to calculate and be written and export node layer Y
Numerical value) can be passed and be written the column 2 of data random access memory 122, these numerical value are step (time first time
The middle content node layer C numerical value generated of step 0) simultaneously (is used in the second time step in time step 1);Output in address 6
During second of instruction executes, cumulative 202 numerical value of accumulator generated of this 512 instructions by address 4 and 5 (is connect down
Come, these numerical value can use the numerical value to calculate and be written output node layer Y by the output order of address 7) it can be passed and write
Enter the column 6 of data random access memory 122, these numerical value are the second time step (content generated in time step 1)
Node layer C numerical value simultaneously (is used in third time step in time step 2;The rest may be inferred, the tenth of the output order of address 6 the
During five times execute, cumulative 202 numerical value of accumulator (next these numbers generated of this 512 instructions by address 4 and 5
Value can be used by the output order of address 7 to calculate and be written the numerical value of output node layer Y) it can be passed and that data are written is random
The column 58 of memory 122 are accessed, these numerical value are the 15th time step (the content node layer generated in time step 14)
C numerical value (and being read by the instruction of address 8, but not used).
The instruction of address 8 to 12 is roughly the same with the instruction of address 7 to 11 in Figure 42 and has identical function, and the two only has
There is a discrepancy.This discrepancy that is, in Figure 45 the instruction (ADD_D_ACC DR ROW+1) of address 8 data random access can be made to deposit
The columns of reservoir 122 increases by one, and the instruction (ADD_D_ACC DR ROW+0) of address 7 can make data random access in Figure 42
The columns of memory 122 increases by zero.This difference inducement is special in the difference of the data configuration in data random access memory 122
It is not that the configuration of one group of four column includes an independent column for content node layer C numerical value use (such as column 2,6,10 etc.) in Figure 44, and is schemed
The configuration of one group of three column does not have this then and independently arranges in 41, but allows the numerical value of content node layer C and the numerical value of hiding node layer Z
Shared same row (such as column Isosorbide-5-Nitrae, 7 etc.).15 times of the instruction of address 8 to 12 execute the numerical value that can calculate hidden layer node Z
And data random access memory 122 (write-in column 5,9,13, and so on until column 57) is written into for address 4,5 and 7
Instruction second to 16 time execute using with calculate the second to 15 time step output node layer Y (time step 1 to
14).(instruction of address 8 to 12 it is last/the 15th time execute calculate hide node layer Z numerical value and be written into data with
Machine accesses the column 61 of memory 122, but these numerical value and is not used by.)
The recursion instruction of address 13 can make cycle counter 3804 successively decrease and big in new 3804 numerical value of cycle counter
The instruction of address 4 is returned in the case where zero.
In another embodiment, the design of Jordan time recurrent neural network loads output node Y using content node C
Run function value, this run function value, that is, run function execute after accumulated value.In this embodiment, because of output node Y
Numerical value it is identical as the numerical value of content node C, the nand architecture of address 6 is instructed and is not included in nand architecture program.It thus can be with
Reduce the columns used in data random access memory 122.More precisely, each load contents node C number in Figure 44
The column (such as column 2,6,59) of value are not present in the present embodiment.In addition, each time step of this embodiment only needs data
Three column of random access memory 122, and the 20 time steps that can arrange in pairs or groups, rather than 15, the instruction of nand architecture program in Figure 45
Address also will do it adjustment appropriate.
Shot and long term memory cell
Shot and long term memory cell is concept known by the art for time recurrent neural network.For example,
Long Short-Term Memory, Sepp Hochreiter and J ü rgen Schmidhuber, Neural
Computation, November 15,1997, Vol.9, No.8, Pages 1735-1780;Learning to Forget:
Continual Prediction with LSTM, Felix A.Gers, J ü rgen Schmidhuber, and Fred
Cummins, Neural Computation, October 2000, Vol.12, No.10, Pages 2451-2471;These documents
It can be obtained from Massachusetts science and engineering publishing house periodical (MIT Press Journals).Shot and long term memory cell can be configured as a variety of
Different types.The shot and long term memory cell 4600 of Figure 46 as described below is with network address http://deeplearning.net/
Entitled shot and long term memory network (the LSTM Networks for for mood analysis of tutorial/lstm.html
Sentiment Analysis) study course described in shot and long term memory cell be model, the copy of this study course is in October, 2015
Downloading (hereinafter referred to as " shot and long term memory study course ") on the 19th is simultaneously provided in the old report book of US application case data exposure of this case.This
Shot and long term memory cell 4600, which can be used for generally describing 121 embodiment of neural network unit as described herein, to be performed effectively
It is associated with the ability of the calculating of shot and long term memory.It is worth noting that, the embodiment of these neural network units 121, including figure
49 the embodiment described can perform effectively the other shot and long terms memory being associated with other than shot and long term memory cell described in Figure 46
The calculating of born of the same parents.
Preferably, neural network unit 121 can be used to for one there is shot and long term memory cell layer to link other levels
Time recurrent neural network executes calculating.For example, in the memory study course of this shot and long term, network includes mean value common source layer to connect
Output (H) and the logistic regression layer of the shot and long term memory cell of shot and long term memory layer are received to receive the output of mean value common source layer.
Figure 46 is a block diagram, shows an embodiment of shot and long term memory cell 4600.
As shown in the figure, this shot and long term memory cell 4600 includes memory cell input (X), and memory cell exports (H), inputs lock
(I), lock (O) is exported, forgotten lock (F), memory cell state (C) and candidate memory cell state (C ').Input lock (I) can gate memory
The signal that born of the same parents input (X) to memory cell state (C) transmits, and memory cell state (C) can be gated to memory cell output by exporting lock (O)
(H) signal transmitting.This memory cell state (C) can be fed back to the candidate memory cell state (C ') of a period of time intermediate step.Forget lock (F)
This candidate memory cell state (C ') can be gated, this candidate memory cell state can be fed back and become the memory cell of next time step
State (C).
The embodiment of Figure 46 calculates aforementioned various different numerical value using following equalities:
(1) I=SIGMOID (Wi*X+Ui*H+Bi)
(2) F=SIGMOID (Wf*X+Uf*H+Bf)
(3) C '=TANH (Wc*X+Uc*H+Bc)
(4) C=I*C '+F*C
(5) O=SIGMOID (Wo*X+Uo*H+Bo)
(6) H=O*TANH (C)
Wi and Ui is associated with the weighted value of input lock (I), and Bi is associated with the deviant of input lock (I).Wf and Uf
It is associated with the weighted value for forgeing lock (F), and Bf is associated with the deviant for forgeing lock (F).Wo and Uo are associated with output lock
(O) weighted value, and Bo is associated with the deviant of output lock (O).It has been observed that equation (1), (2) and (5) calculate separately input
Lock (I) forgets lock (F) and output lock (O).Equation (3) calculates candidate memory cell state (C '), and equation (4) is calculated with current
Memory cell state (C) is the candidate memory cell state (C ') of input, the memory of current time step of current memory cell state (C) i.e.
Born of the same parents' state (C).Equation (6) calculates memory cell output (H).But the present invention is not limited thereto.Input is calculated using his mode of planting
Lock forgets lock, exports lock, candidate memory cell state, the embodiment of memory cell state and the shot and long term memory cell of memory cell output
Also covered by the present invention.
In order to illustrate the present invention, shot and long term memory cell includes memory cell input, memory cell output, memory cell state, candidate
Memory cell state inputs lock, output lock and forgetting lock.For each time step, lock is inputted, exports lock, forget lock and is waited
Select memory memory cell input that memory cell state is current time step and the memory cells of previous time steps export with it is related
The function of weight.The memory cell state of intermediate step is the memory cell state of previous time steps at this time, and candidate memory cell state is defeated
Enter lock and exports the function of lock.In this sense, memory cell state can feed back the note for calculating next time step
Recall born of the same parents' state.The memory cell output of intermediate step at this time is the function of the calculated memory cell state of intermediate step and output lock at this time.
Shot and long term Memory Neural Networks are the neural networks with a shot and long term memory cell layer.
Figure 47 is a block diagram, and display is associated with the shot and long term memory nerve net of Figure 46 when the execution of neural network unit 121
When 4600 layers of shot and long term memory cell of calculating of network, the data random access memory 122 of neural network unit 121 and weight with
Machine accesses an example of the data configuration in memory 124.In the example of Figure 47, neural network unit 121 is configured to 512
Neural processing unit 126 or neuron, such as wide configuration is adopted, but, only 128 neural processing units 126 (are handled as neural
Unit 0 to 127) caused by numerical value can be used, this is because this example shot and long term memory layer there was only 128 shot and long terms
Memory cell 4600.
As shown in the figure, weight random access memory 124 can load the corresponding nerve processing of neural network unit 121
The weighted value of unit 0 to 127, deviant be worth between two parties.The row 0 to 127 of weight random access memory 124 loads nerve net
The weighted value of the corresponding neural processing unit 0 to 127 of network unit 121, deviant be worth between two parties.Each column in column 0 to 14 are then
Being that loading 128 is following corresponds to previous equations (1) to the numerical value of (6) to be supplied to neural processing unit 0 to 127, these are counted
Value are as follows: Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, C ', TANH (C), C, Wo, Uo, Bo.Preferably, weighted value and deviant-
Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo (being located at column 0 to 8 and column 12 to 14)-are by being implemented in processor 100
Framework program instruct 1400 write-ins/filling weight random access memory 124 through MTNN, and by being implemented in neural network list
Nand architecture program reading/use of member 121, such as the nand architecture program of Figure 48.Preferably, value-C ' between two parties, TANH (C), C (is located at
Column 9 to 11)-the nand architecture program by being implemented in neural network unit 121 is written/insert weight random access memory 124 simultaneously
It is read out/uses, the details will be described later.
As shown in the figure, data random access memory 122 loads input (X), exports (H), inputs lock (I), forgets lock
(F) it is used with output lock (O) numerical value for a series of time steps.Furthermore, it is understood that this five column of memory one group of loading X, H, I, F
It is used with the numerical value of O for a given time step.By taking one with the data random access memory 122 of 64 column as an example, such as scheme
Shown in, this data random access memory 122 can load the memory cell numerical value used for 12 different time steps.In Figure 47
Example in, column 0 to 4 load the memory cell numerical value that uses for time step 0, and column 5 to 9 load the note used for time step 1
Recall born of the same parents' numerical value, and so on, column 55 to 59 load the memory cell numerical value used for time step 11.In this five column storage stack
First row load the X numerical value of intermediate step at this time.Secondary series in this five column storage stack loads the H number of intermediate step at this time
Value.Third in this five column storage stack equips the I numerical value for carrying intermediate step at this time.The 4th column in this five column storage stack
Load the F numerical value of intermediate step at this time.The 5th in this five column storage stack equips the O numerical value for carrying intermediate step at this time.In such as figure
It is shown, what each luggage load in data random access memory 122 was used for corresponding neuron or neural processing unit 126
Numerical value.That is, row 0 loads the numerical value for being associated with shot and long term memory cell 0, and its calculating is held by neural processing unit 0
Row;Row 1 loads the numerical value for being associated with shot and long term memory cell 1, and its calculating is as performed by neural processing unit 1;Class according to this
It pushes away, row 127 loads the numerical value for being associated with shot and long term memory cell 127, and its calculating is as performed by neural processing unit 127, in detail
As described in subsequent figure 48.
(it is located at column 0,5,9, and so on to column 55) preferably, X numerical value by being implemented in the framework program of processor 100
1400 write-ins/filling data random access memory 122 is instructed through MTNN, and by being implemented in the non-of neural network unit 121
Framework program is read out/uses, nand architecture program as shown in figure 48.Preferably, I numerical value, F numerical value and O numerical value (are located at column
2/3/4,7/8/9,12/13/14, and so on to column 57/58/59) by being implemented in the nand architecture program of neural processing unit 121
Write-in/filling data random access memory 122, the details will be described later.Preferably, H numerical value (is located at column 1,6,10, and so on extremely
Column 56) data random access memory 122 is written/inserted by the nand architecture program for being implemented in neural processing unit 121 and carried out
Reading/use, and the framework program by being implemented in processor 100 is read out through MFNN instruction 1500.
The example of Figure 47 assumes that this framework program can execute following steps: (1) the time step different for 12, will be defeated
Enter numerical value filling data random access memory 122 (column 0,5, and so on to column 55) of X;(2) start the nand architecture of Figure 48
Program;(3) whether detecting nand architecture program is finished;(4) numerical value of output H is read from data random access memory 122
(column 1,6, and so on to column 59);And (5) repeat step (1) to (4) several times until completion task, such as make to mobile phone
The language of user carries out recognizing required calculating.
In another executive mode, framework program can execute following steps: (1) to single a time step, to input X
Numerical value insert data random access memory 122 (such as column 0);(2) start nand architecture program (Figure 48 nand architecture program is repaired
Version after just, is not required to recycle, and only accesses single group five column of data random access memory 122);(3) nand architecture is detected
Whether program is finished;(4) numerical value (such as column 1) of output H is read from data random access memory 122;And (5) weight
Multiple step (1) to (4) are several times until completing task.This two kinds of mode whichever are the excellent input X that layer can be remembered according to shot and long term
Depending on the sampling mode of numerical value.For example, if this task be allowed in multiple time steps input is sampled it is (such as big
About 12 time steps) and calculating is executed, first way is with regard to ideal, because mode may bring more computing resources thus
Efficiency and/or preferable efficiency, but, if this task is only allowed in single a time step and executes sampling, it is necessary to use
The second way.
3rd embodiment is similar to the aforementioned second way, but, is different from the second way and uses single group of five columns
According to random access memory 122, the nand architecture program of this mode uses five column memory of multiple groups, that is, in each time step
Using five different column storage stacks, this part is similar to first way.In this 3rd embodiment, preferably, framework
Program includes a step before the step (2), and in this step, framework program can be updated it before nand architecture program starts,
Such as the column of data random access memory 122 in the instruction of address 0 are updated to point to next group of five column memories.
Figure 48 be a table, display be stored in neural network unit 121 program storage 129 program, this program by
Neural network unit 121 executes and uses data and weight according to the configuration of Figure 47, is associated with shot and long term memory cell layer to reach
Calculating.The example program of Figure 48 includes that 24 nand architecture instructions are located at address 0 to 23.The instruction of address 0
(INITLALIZE NPU, CLRACC, LOOPCNT=12, DR IN ROW=-1, DR OUT ROW=2) can remove accumulator
202 and by cycle counter 3804 initialize to numerical value 12, to execute 12 circulation groups (instruction of address 1 to 22).This is initial
Change instruction and can be numerical value -1 by the row initialization to be read of data random access memory 122, and the of the instruction of address 1
After primary execution, it is zero that this numerical value, which will increase,.This initialization directive can simultaneously fall in lines the to be written of data random access memory 122
(such as buffer 2606 of Figure 26 and Figure 39) is initialized as column 2.Preferably, this initialization directive and neural network unit can be made
121 in wide configuration, in this way, neural network unit 121 will be configured with 512 neural processing units 126.Such as following sections
Described, in the execution process instruction of address 0 to 23, the 126 therein 128 nerve processing of this 512 neural processing units are single
Member 126 is corresponding and is operated as 128 shot and long term memory cells 4600.
In the first time of the instruction of address 1 to 4 executes, this 128 neural 126 (i.e. neural processing units 0 of processing unit
Each neural processing unit 126 into 127) can be for step (time first time of corresponding shot and long term memory cell 4600
Step 0) calculates input lock (I) numerical value and I numerical value is written to the corresponding text of the column 2 of data random access memory 122;?
126 meeting of each neural processing unit during second of the instruction of address 1 to 4 executes, in this 128 neural processing units 126
For corresponding shot and long term memory cell 4600 the second time step (time step 1) calculate I numerical value simultaneously by I numerical value be written number
According to the corresponding text of the column 7 of random access memory 122;The rest may be inferred, in the 12nd execution of the instruction of address 1 to 4
In, each neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding shot and long term memory cell 4600
The 12nd time step (time step 11) calculate I numerical value and by I numerical value write-in data random access memory 122 column 57
Corresponding text, as shown in figure 47.
Furthermore, it is understood that the multiply-accumulate instruction of address 1 can read data random access memory 122 when forefront rear
Next column (executing first is column 0, and executing second is column 5, and so on, executing the 12nd is column 55), this
Memory cell input (X) value of column comprising being associated with current time step, this instructs and can read weight random access memory 124
In include Wi numerical value column 0, and aforementioned reading numerical values are multiplied to produce the first product accumulation to just by the initial of address 0
Change the accumulator 202 that instruction or the instruction of address 22 are removed.Then, the multiply-accumulate instruction of address 2 can read next data
The column of random access memory 122 (executing first is column 1, and executing second is column 6, and so on, it is held the 12nd
Row be column 56), this column comprising be associated with current time step memory cell output (H) value, this instruct and can read weight with
Machine accesses the column 1 in memory 124 comprising Ui numerical value, and aforementioned value is multiplied to produce the second product accumulation to accumulator
202.The H numerical value for being associated with current time step is random by data by the instruction (and instruction of address 6,10 and 18) of address 2
It accesses memory 122 to read, be generated in previous time steps, and by the write-in data random access storage of the output order of address 22
Device 122;But, in first time executes, the column 1 of data random access memory can be written in the instruction of address 2 with an initial value
As H numerical value.Preferably, framework program can deposit at random initial H numerical value write-in data before the nand architecture program of starting Figure 48
The column 1 (such as using MTNN instruction 1400) of access to memory 122;But, the present invention is not limited thereto, includes in nand architecture program
There is initialization directive that the other embodiments of the column 1 of initial H numerical value write-in data random access memory 122 are also belonged to this hair
Bright scope.In one embodiment, this initial H numerical value is zero.Next, the finger that weight text is added to accumulator of address 3
It enables (ADD_W_ACC WR ROW 2) to can read the column 2 in weight random access memory 124 comprising Bi numerical value and is added into
Accumulator 202.Finally, the output order (OUTPUT SIGMOID, DR OUT ROW+0, CLR ACC) of address 4 can be to accumulator
202 numerical value execute a S type run function and by implementing result write-in data random access memory 122 current output arrange (
First execution is column 2, and executing second is column 7, and so on, as column 57 are executed the 12nd) and remove cumulative
Device 202.
Each nerve processing in the first time of the instruction of address 5 to 8 executes, in this 128 neural processing units 126
Unit 126 can for corresponding shot and long term memory cell 4600 first time step (time step 0) calculate its forget lock (F) number
It is worth and F numerical value is written the corresponding text of the column 3 of data random access memory 122;The second of the instruction of address 5 to 8
In secondary execution, each neural processing unit 126 in this 128 neural processing units 126 can be remembered for corresponding shot and long term
(time step 1) calculates it and forgets lock (F) numerical value and data random access is written in F numerical value the second time step of born of the same parents 4600
The corresponding text of the column 8 of memory 122;The rest may be inferred, and in the 12nd execution of the instruction of address 5 to 8, this 128
Each neural processing unit 126 in a nerve processing unit 126 can be directed to the 12nd of corresponding shot and long term memory cell 4600
(time step 11) calculates its column forgotten lock (F) numerical value and F numerical value is written to data random access memory 122 to time step
58 corresponding text, as shown in figure 47.The mode that the instruction of address 5 to 8 calculates F numerical value is similar to the finger of aforementioned addresses 1 to 4
Enable, but, the instruction of address 5 to 7 can be respectively from the column 3 of weight random access memory 124, and column 4 and column 5 read Wf, Uf and
Bf numerical value is to execute multiplication and/or add operation.
In 12 execution of the instruction of address 9 to 12, at each nerve in this 128 neural processing units 126
Reason unit 126 can calculate its candidate memory cell state (C ') for the corresponding time step of corresponding shot and long term memory cell 4600
Numerical value and by C ' numerical value write-in weight random access memory 124 column 9 corresponding text.The instruction of address 9 to 12 calculates
The mode of C ' numerical value is similar to the instruction of aforementioned addresses 1 to 4, and but, the instruction of address 9 to 11 can be respectively from weight arbitrary access
The column 6 of memory 124, column 7 and column 8 read Wc, Uc and Bc numerical value to execute multiplication and/or add operation.In addition, address 12
Output order can execute tanh run function rather than the output order of such as address 4 (execute) S type run function.
Furthermore, it is understood that the multiply-accumulate instruction of address 9 can read data random access memory 122 when forefront (
Executing for the first time is column 0, and executing at second is column 5, and so on, executing at the 12nd time is column 55), this is current
Memory cell input (X) value of column comprising being associated with current time step, this instructs and can read weight random access memory 124
In include Wc numerical value column 6, and by aforementioned value be multiplied to produce the first product accumulation to just by address 8 instruction remove
Accumulator 202.Next, the multiply-accumulate instruction of address 10 can read a time column for data random access memory 122
(executing in first time is column 1, and executing at second is column 6, and so on, executing at the 12nd time is column 56), this
Memory cell output (H) value of column comprising being associated with current time step, this instructs and can read weight random access memory 124
In include Uc numerical value column 7, and aforementioned value is multiplied to produce the second product accumulation to accumulator 202.Next, address
11 instruction that accumulator is added in weight text can read the column 8 in weight random access memory 124 comprising Bc numerical value simultaneously
It is added into accumulator 202.Finally, output order (OUTPUT TANH, WR OUT ROW 9, the CLR ACC) meeting pair of address 12
202 numerical value of accumulator executes tanh run function and implementing result is written to the column 9 of weight random access memory 124,
And remove accumulator 202.
In 12 execution of the instruction of address 13 to 16, at each nerve in this 128 neural processing units 126
Reason unit 126 can calculate new memory cell state (C) number for the corresponding time step of corresponding shot and long term memory cell 4600
It is worth and this new C numerical value is written the corresponding text of the column 11 of weight random access memory 122, each nerve processing is single
Member 126 can also calculate tanh (C) and be written into the corresponding text of the column 10 of weight random access memory 124.Further
For, the multiply-accumulate instruction of address 13 can read data random access memory 122 when the next column at forefront rear is (first
Secondary execute is column 2, and executing at second is column 7, and so on, executing at the 12nd time is column 57), this column is comprising closing
It is coupled to input lock (I) numerical value of current time step, this instructs and reads in weight random access memory 124 comprising candidate note
Recall the column 9 (being just written by the instruction of address 12) of born of the same parents' state (C ') numerical value, and aforementioned value is multiplied to produce first and is multiplied
Accumulation adds to the accumulator 202 just removed by the instruction of address 12.Next, the multiply-accumulate instruction of address 14 can read number
According to random access memory 122 next column (executing in first time is column 3, and executing second is column 8, and so on,
Execute at the 12nd time is column 58), forgetting lock (F) numerical value of this column comprising being associated with current time step, this instructs and reads
Current memory cell state (C) numerical value calculated in previous time steps is contained in weighting weight random access memory 124 (by ground
The last execute of the instruction of location 15 is written) column 11, and aforementioned value is multiplied to produce the second product and is added
Accumulator 202.It adds up next, the output order (OUTPUT PASSTHRU, WR OUT ROW 11) of address 15 can transmit this
202 numerical value of device and the column 11 for being written into weight random access memory 124.It is to be appreciated that the instruction of address 14 is by counting
It is generated in the last time execution according to the instruction that the C numerical value that the column 11 of random access memory 122 are read is address 13 to 15
And the C numerical value being written.The output order of address 15 can't remove accumulator 202, in this way, its numerical value can be by the finger of address 16
It enables and using.Finally, the output order (OUTPUT TANH, WR OUT ROW 10, CLR ACC) of address 16 can be to accumulator 202
Numerical value executes tanh run function and by the column 10 of its implementing result write-in weight random access memory 124 for address 21
Instruction use with calculate memory cell output (H) value.The instruction of address 16 can remove accumulator 202.
In the first time of the instruction of address 17 to 20 executes, at each nerve in this 128 neural processing units 126
Reason unit 126 can for corresponding shot and long term memory cell 4600 first time step (time step 0) calculate its export lock (O)
Numerical value and by O numerical value write-in data random access memory 122 column 4 corresponding text;Instruction in address 17 to 20
In second of execution, each neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding shot and long term
(time step 1) calculates it and exports lock (O) numerical value and deposit O numerical value write-in data at random second time step of memory cell 4600
The corresponding text of the column 9 of access to memory 122;The rest may be inferred, in the 12nd execution of the instruction of address 17 to 20, this
Each neural processing unit 126 in 128 neural processing units 126 can be directed to the tenth of corresponding shot and long term memory cell 4600
(time step 11) calculates it and exports lock (O) numerical value and data random access memory 122 is written in O numerical value two time steps
The corresponding text of column 58, as shown in figure 47.Address 17 to 20 instruction calculate O numerical value mode be similar to aforementioned addresses 1 to
4 instruction, but, the instruction of address 17 to 19 can be respectively from the column 12 of weight random access memory 124, column 13 and column 14
Wo, Uo and Bo numerical value are read to execute multiplication and/or add operation.
In the first time of the instruction of address 21 to 22 executes, at each nerve in this 128 neural processing units 126
Manage unit 126 can for the first time step of corresponding shot and long term memory cell 4600 (it is defeated that time step 0) calculates its memory cell
Out (H) value and by H numerical value write-in data random access memory 122 column 6 corresponding text;Finger in address 21 to 22
During second enabled executes, each neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding length
(time step 1) calculates its memory cell output (H) value and data is written in H numerical value the second time step of short-term memory born of the same parents 4600
The corresponding text of the column 11 of random access memory 122;The rest may be inferred, in the 12nd execution of the instruction of address 21 to 22
In, each neural processing unit 126 in this 128 neural processing units 126 can be directed to corresponding shot and long term memory cell 4600
The 12nd time step (time step 11) calculate its memory cell output (H) value and by H numerical value write-in data random access deposit
The corresponding text of the column 60 of reservoir 122, as shown in figure 47.
Furthermore, it is understood that the multiply-accumulate instruction of address 21 can read data random access memory 122 when forefront rear
Third column (executing in first time is column 4, and executing second is column 9, and so on, be in the 12nd execution
Column 59), output lock (O) numerical value of this column comprising being associated with current time step, this instructs and reads weight random access memory
Column 10 (being written by the instruction of address 16) in device 124 comprising tanh (C) numerical value, and aforementioned value is multiplied to produce one
Product accumulation is to just by the accumulator 202 for instructing removing of address 20.Then, the output order of address 22 can transmit accumulator
202 numerical value and be written into data random access memory 122 following second output column 11 (first time execute be
Column 6, executing at second is column 11, and so on, executing at the 12nd time is column 61), and remove accumulator 202.
It is to be appreciated that (being executed i.e. in first time by the H numerical value that the instruction write-in data random access memory 122 of address 22 arranges
For column 6, executing at second is column 11, and so on, executing at the 12nd time is column 61) it is address 2,6,10 and 18
Instruction subsequent execution in the H numerical value that consumes/read.But, the H numerical value for column 61 being written in the 12nd execution can't
It is consumed/is read by the execution of the instruction of address 2,6,10 and 18;For a preferred embodiment, this numerical value can be by framework
Program is consumed/is read.
The instruction (LOOP 1) of address 23 can make cycle counter 3804 successively decrease and in the new number of cycle counter 3804
Value returns to the instruction of address 1 in the case where being greater than zero.
Figure 49 is a block diagram, shows the embodiment of neural network unit 121, the neural processing unit group of this embodiment
It is interior that there is the masking of output buffering and feedback capability.Figure 49 is shown at the nerve that single is made of four neural processing units 126
Manage cell group 4901.Although Figure 49 only shows single neural processing unit group 4901, it is to be appreciated, however, that neural
Each neural processing unit 126 in network unit 121 can be all contained in a neural processing unit group 4901, therefore,
N/J neural processing unit group 4901 is had altogether, and wherein N is that the quantity of neural processing unit 126 is (for example, just wide
For configuration be 512, with regard to narrow configuration for be 1024) and J be the neural processing unit 126 in single a group 4901 quantity
It (is for example, four) for the embodiment of Figure 49.By four nerves in neural processing unit group 4901 in Figure 49
Processing unit 126 is known as neural processing unit 0, neural processing unit 1, neural processing unit 2 and neural processing unit 3.
Each neural processing unit in the embodiment of Figure 49 is similar to the neural processing unit 126 of earlier figures 7, and schemes
In with identical label component it is also similar.But, multitask buffer 208 is adjusted to include four additional inputs
4905, multitask buffer 705 is adjusted can be from original to select input 213 adjusted comprising four additional inputs 4907
Selection is carried out in this input 211 and 207 and additional input 4905 and is provided to output 209, also, selects input 713 through adjusting
It is whole and can be carried out from the input 711 and 206 and additional input 4907 of script selection be provided to output 203.
As shown in the figure, the column buffer 1104 of Figure 11 is output buffer 1104 in Figure 49.Furthermore, it is understood that figure
Shown in the text 0,1,2 and 3 of output buffer 1104 receive and be associated with four of neural processing unit 0,1,2 and 3 and start
The corresponding output of function unit 212.The output buffer 1104 of this part includes that N number of text corresponds to neural processing unit group
Group 4901, these texts are known as an output buffering text group.In the embodiment of Figure 49, N tetra-.Output buffer 1104
This four texts feed back to multitask buffer 208 and 705, and as four additional inputs 4905 by multitask buffer
208 receive and are received as four additional inputs 4907 by multitask buffer 705.Output buffering text group is anti-
It is fed to the feedback action of its corresponding neural processing unit group 4901, enables the arithmetic instruction of nand architecture program from being associated with
Selection one or two in the text (i.e. output buffering text group) of the output buffer 1104 of neural processing unit group 4901
A text is as its input, and example please refers to the nand architecture program of subsequent figure 51, such as the finger of address 4,8,11,12 and 15 in figure
It enables.That is, 1104 text of output buffer being specified in nand architecture instruction can confirm that selection input 213/713 generates
Numerical value.This ability actually makes output buffer 1104 can be used as a classification scratch memory (scratch pad
Memory), nand architecture program can be allowed to reduce write-in data random access memory 122 and/or weight random access memory
124 and the subsequent number therefrom read, such as the numerical value for generating and using between two parties during reducing.Preferably, output buffering
Device 1104 or column buffer 1104, including an one-dimensional cache array, to store 1024 narrow texts or 512
A wide text.Preferably, the reading of output buffer 1104 can be executed within single a time-frequency period, and for defeated
The write-in of buffer 1104 can also execute within single a time-frequency period out.Different from data random access memory 122 with
Weight random access memory 124 can be accessed by framework program and nand architecture program, and output buffer 1104 can not be by frame
Structure program is accessed, and can only be accessed by nand architecture program.
Output buffer 1104 is received adjusted shields input (mask input) 4903.Preferably, shielding input
4903 include corresponding four texts to output buffer 1104 in four positions, this four character associatives are in neural processing unit group
The neural processing unit 126 of four of 4901.Preferably, if this is corresponded to the shielding input of the text of output buffer 1104
4903 are very, and the text of this output buffer 1104 will maintain its current value;Otherwise, the text of this output buffer 1104
The output that function unit 212 will be activated is updated.That is, if this is corresponded to the text of output buffer 1104
Shielding input 4903 is vacation, and the output of run function unit 212 will be written into the text of output buffer 1104.In this way,
The output order of nand architecture program is that output buffer 1104 optionally is written in the output of run function unit 212
Certain texts simultaneously make the current value of other texts of output buffer 1104 remain unchanged, and example please refers to subsequent figure 51
The instruction of nand architecture program, such as the instruction of address 6,10,13 and 14 in figure.That is, being specified in defeated in nand architecture program
The text of buffer 1104 is the numerical value for certainly resulting from shielding input 4903 out.
To simplify the explanation, input 1811 (such as Figure 18, Figure 19 of multitask buffer 208/705 are not shown in Figure 49
With shown in Figure 23).But, while feedback/shielding of dynamically configurable neural processing unit 126 and output buffer 1104 is supported
Embodiment also belong to the scope of the present invention.Preferably, output buffering text group is can corresponding earthquake in these embodiments
State configuration.
Although it is to be appreciated that neural processing unit 126 in the neural processing unit group 4901 of this embodiment
Quantity is four, and but, the present invention is not limited thereto, and the more or less embodiment of 126 quantity of neural processing unit is equal in group
Belong to scope of the invention.In addition, for one has the embodiment of shared run function unit 1112, as shown in figure 52,
In 126 quantity of neural processing unit and 212 group of run function unit in one neural processing unit group 4901
Neural 126 quantity of processing unit has synergy.The masking and feedback of output buffer 1104 in neural processing unit group
Ability is particularly helpful to promote the computational efficiency for being associated with shot and long term memory cell 4600, in detail as described in subsequent figure 50 and Figure 51.
Figure 50 is a block diagram, and display is remembered when the execution of neural network unit 121 is associated in Figure 46 by 128 shot and long terms
When the calculating for the level that born of the same parents 4600 are constituted, the data random access memory 122 of the neural network unit 121 of Figure 49, weight
One example of the data configuration in random access memory 124 and output buffer 1104.In the example of Figure 50, neural network
Unit 121 is configured to 512 neural processing units 126 or neuron, such as takes wide configuration.Such as the model of Figure 47 and Figure 48
Example only has 128 shot and long term memory cells 4600 in the shot and long term memory layer in the example of Figure 50 and Figure 51.But, scheming
In 50 example, the numerical value that all 512 neural processing units 126 generate (such as neural processing unit 0 to 127) can all be made
With.When executing the nand architecture program of Figure 51, each 4901 meeting collective, nerve processing unit group is as a shot and long term
Memory cell 4600 is operated.
As shown in the figure, data deposit at random memory 122 load memory cell input (X) and output (H) value for it is a series of when
Intermediate step uses.Furthermore, it is understood that having a pair of two column memories for a given time step and loading X numerical value and H number respectively
Value.By taking one with the data random access memory 122 of 64 column as an example, as shown in the figure, this data random access memory
The 122 memory cell numerical value loaded use for 31 different time steps.In the example of Figure 50, column 2 and 3 were loaded for the time
The numerical value that step 0 uses, column 4 and 5 load the numerical value used for time step 1, and so on, column 62 and 63 were loaded for the time
The numerical value that step 30 uses.This loads the X numerical value of intermediate step at this time to the first row in two column memories, and secondary series is then
Load the H numerical value of intermediate step at this time.As shown in the figure, four row of each group is corresponded to nerve in data random access memory 122
The memory loads for managing cell group 4901 correspond to the numerical value that shot and long term memory cell 4600 uses for it.That is, row 0 to 3
The numerical value for being associated with shot and long term memory cell 0 is loaded, calculating is executed by neural processing unit 0-3, i.e., neural processing unit group
Group 0 executes;Row 4 to 7 loads the numerical value for being associated with shot and long term memory cell 1, and calculating is executed by neural processing unit 4-7, i.e.,
Neural processing unit group 1 executes;The rest may be inferred, and row 508 to 511 loads the numerical value for being associated with shot and long term memory cell 127, meter
It is executed at last by neural processing unit 508-511, i.e., neural processing unit group 127 executes, in detail as shown in subsequent figure 51.Such as figure
Shown in, column 1 are simultaneously not used by, and column 0 load initial memory cell output (H) value can be by framework for a preferred embodiment
Program inserts zero, and but, the present invention is not limited thereto, is exported using the initial memory cell of nand architecture program instruction filling column 0
(H) numerical value also belongs to scope of the invention.
Preferably, X numerical value (being located at column 2,4,6, the rest may be inferred to column 62) is saturating by being implemented in the framework program of processor 100
Cross MTNN instruction 1400 write-ins/filling data random access memory 122, and the non-frame by being implemented in neural network unit 121
Structure program is read out/uses, such as nand architecture program shown in Figure 50.Preferably, (being located at column 3,5,7, the rest may be inferred for H numerical value
To column 63) data random access memory 122 to be written/inserted by the nand architecture program for being implemented in neural network unit 121 go forward side by side
Row reading/use, the details will be described later.Preferably, H numerical value and the framework program by being implemented in processor 100 are instructed through MFNN
1500 are read out.It should be noted that the nand architecture program of Figure 51 assumes to correspond to neural processing unit group 4901
Four line storage of each group (such as row 0-3, row 4-7, row 5-8, and so on into row 508-511), in four X numbers of a given column
Value inserts identical numerical value (such as being inserted by framework program).Similarly, the nand architecture program of Figure 51 can corresponded to nerve
In four line storage of each group for managing cell group 4901, calculates and identical numerical value is written to four H numerical value of a given column.
As shown in the figure, weight random access memory 124 loads needed for the neural processing unit of neural network unit 121
Weight, offset with memory cell state (C) value.(the example in corresponding four line storage of each group to neural processing unit group 121
The rest may be inferred by such as row 0-3, row 4-7, row 5-8 to row 508-511): (1) row number is equal to 3 row divided by 4 remainder, can be at it
Column 0,1,2 and 6 load the numerical value of Wc, Uc, Bc, with C respectively;(2) row number is equal to 2 row divided by 4 remainder, understands in its column 3,
4 and 5 load the numerical value of Wo, Uo and Bo respectively;(3) row number is equal to 1 row divided by 4 remainder, can distinguish in its column 3,4 and 5
Load the numerical value of Wf, Uf and Bf;And (4) row number is equal to 0 row divided by 4 remainder, can load respectively in its column 3,4 with 5
The numerical value of Wi, Ui and Bi.Preferably, these weights and deviant-Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo
(in column 0 to 5)-instruct 1400 write-ins/filling weight arbitrary access through MTNN by the framework program for being implemented in processor 100
Memory 124, and the nand architecture program by being implemented in neural network unit 121 is read out/uses, such as the nand architecture journey of Figure 51
Sequence.Preferably, C value between two parties is written/inserts weight arbitrary access by the nand architecture program for being implemented in neural network unit 121 and deposits
Reservoir 124 is simultaneously read out/uses, and the details will be described later.
The example of Figure 50 assumes that framework program can execute following steps: (1) the time step different for 31 will input
The numerical value of X inserts data random access memory 122 (column 2,4, and so on to column 62);(2) start the nand architecture journey of Figure 51
Sequence;(3) whether detecting nand architecture program is finished;(4) numerical value (column of output H are read from data random access memory 122
3,5, and so on to column 63);And (5) repeat step (1) to (4) and use several times until completing task, such as to mobile phone
The language of person carries out recognizing required calculating.
In another executive mode, framework program can execute following steps: (1) to single a time step, to input X
Numerical value insert data random access memory 122 (such as column 2);(2) start nand architecture program (Figure 51 nand architecture program is repaired
Version after just, is not required to recycle, and only accesses the single of data random access memory 122 and arrange two);(3) nand architecture is detected
Whether program is finished;(4) numerical value (such as column 3) of output H is read from data random access memory 122;And (5) weight
Multiple step (1) to (4) are several times until completing task.This two kinds of mode whichever are the excellent input X that layer can be remembered according to shot and long term
Depending on the sampling mode of numerical value.For example, if this task be allowed in multiple time steps input is sampled it is (such as big
About 31 time steps) and calculating is executed, first way is with regard to ideal, because mode may bring more computing resources thus
Efficiency and/or preferable efficiency, but, if this task is only allowed in single a time step and executes sampling, it is necessary to use
The second way.
3rd embodiment is similar to the aforementioned second way, but, uses different from the second way single to two columns
According to random access memory 122, the nand architecture program of this mode uses multipair memory column, that is, makes in each time step
With difference to memory column, this part is similar to first way.Preferably, the framework program of this 3rd embodiment is in step
It (2) include a step before, in this step, framework program can be updated it before nand architecture program starts, such as by ground
The column of data random access memory 122 in the instruction of location 1 are updated to point to lower a pair of two column memory.
As shown in the figure, for the neural processing unit 0 to 511 of neural network unit 121, in the nand architecture program of Figure 51
After the instruction execution of middle different address, output buffer 1104 loads memory cell output (H), and candidate memory cell state (C ') is defeated
Enter lock (I), forget lock (F), exports lock (O), the value between two parties of memory cell state (C) and tanh (C), each output buffering text
In sub-block group (such as the group of corresponding four texts to neural processing unit group 4901 of output buffer 1104, such as text
The rest may be inferred by 0-3,4-7,5-8 to 508-511), and text number is [3] OUTBUF divided by the textual representation that 4 remainder is 3, text
Word number is [2] OUTBUF divided by the textual representation that 4 remainder is 2, and text number is divided by the textual representation that 4 remainder is 1
OUTBUF [1], and text number is [0] OUTBUF divided by the textual representation that 4 remainder is 0.
As shown in the figure, in the nand architecture program of Figure 51 after the instruction execution of address 2, for each neural processing unit
For group 4901, the initial of corresponding shot and long term memory cell 4600 can be all written in all four texts of output buffer 1104
Memory cell exports (H) value.After the instruction execution of address 6, for each neural processing unit group 4901, output buffering
Candidate memory cell state (the C ') value of corresponding shot and long term memory cell 4600 can be written in OUTBUF [3] text of device 1104, and defeated
Other three texts of buffer 1104 can then maintain its preceding numerical values out.After the instruction execution of address 10, for each mind
For processing unit group 4901, corresponding shot and long term memory cell can be written in OUTBUF [0] text of output buffer 1104
Forgetting lock (F) number of corresponding shot and long term memory cell 4600 can be written in 4600 input lock (I) numerical value, OUTBUF [1] text
Value, OUTBUF [2] text can be written output lock (O) numerical value of corresponding shot and long term memory cell 4600, and OUTBUF [3] text
It is then to maintain its preceding numerical values.After the instruction execution of address 13, for each neural processing unit group 4901, output
New memory cell state (C) value that corresponding shot and long term memory cell 4600 can be written in OUTBUF [3] text of buffer 1104 is (right
For output buffer 1104, includes the C numerical value of slot (slot) 3, the column 6 of weight random access memory 124 are written, in detail
As described in subsequent figure 51), and other three texts of output buffer 1104 are then to maintain its preceding numerical values.Finger in address 14
It enables after executing, for each neural processing unit group 4901, OUTBUF [3] text of output buffer 1104 can be write
Enter tanh (C) numerical value of corresponding shot and long term memory cell 4600, and other three texts of output buffer 1104 are then to maintain
Its preceding numerical values.After the instruction execution of address 16, for each neural processing unit group 4901, output buffer
New memory cell output (H) value of corresponding shot and long term memory cell 4600 can be all written in 1104 all four texts.Aforementionedly
The execution process (the namely execution of excluded address 2, this is because address 2 is not belonging to a part of program circulation) of location 6 to 16
It can repeat 30 times, the program circulation of address 3 is returned to as address 17.
Figure 51 be a table, display be stored in neural network unit 121 program storage 129 program, this program by
The neural network unit 121 of Figure 49 executes and uses data and weight according to the configuration of Figure 50, is associated with shot and long term note to reach
Recall the calculating of born of the same parents' layer.The example program of Figure 51 includes that 18 nand architecture instructions are located at address 0 to 17.The instruction of address 0 is
One initialization directive, to remove accumulator 202 and by the initialization of cycle counter 3804 to numerical value 31, to execute 31 times
Circulation group (instruction of address 1 to 17).This initialization directive simultaneously can be by (the example to be written of falling in lines of data random access memory 122
Such as the buffer 2606 of Figure 26/Figure 39) it is initialized as numerical value 1, and after the first time of the instruction of address 16 executes, this numerical value
It will increase to 3.Preferably, this initialization directive and neural network unit 121 can be made to be in wide configuration, in this way, neural network list
Member 121 will be configured with 512 neural processing units 126.Execution process instruction as described in following sections, in address 0 to 17
In, 128 neural processing unit groups 4901 that this 512 neural processing units 126 are constituted are used as 128 corresponding length
Phase memory cell 4600 is operated.
The instruction of address 1 and 2 is not belonging to the circulation group of program and can only execute primary.These instructions can generate initial memory
Born of the same parents export (H) value (such as 0) and are written into all texts of output buffer 1104.The instruction of address 1 can be random from data
The column 0 of access memory 122 read initial H numerical value and are placed on the accumulator 202 removed by the instruction of address 0.Address 2
Instruction (OUTPUT PASSTHRU, NOP, CLR ACC) 202 numerical value of accumulator can be transferred to output buffer 1104, such as scheme
Shown in 50." NOP " mark in the output order (and other output orders of Figure 51) of address 2 indicates that output valve can only be write
Enter output buffer 1104, without being written into memory, that is, will not be written into data random access memory 122 or
Weight random access memory 124.The instruction of address 2 simultaneously can remove accumulator 202.
The instruction of address 3 to 17 is located in circulation group, executes the numerical value (such as 31) that number is cycle count.
Executing each time for the instruction of address 3 to 6 can calculate tanh (the C ') numerical value of current time step and be written into
Text OUTBUF [3], this text will be used by the instruction of address 11.More precisely, the multiply-accumulate instruction meeting of address 3
It is read from the current reading of data random access memory 122 column (such as column 2,4,6 the rest may be inferred to column 62) and is associated with this time
The memory cell of step inputs (X) value, reads Wc numerical value from the column 0 of weight random access memory 124, and aforementioned value is multiplied
The accumulator 202 removed by the instruction of address 2 is added to generate a product.
The multiply-accumulate instruction (MULT-ACCUM OUTBUF [0], WR ROW 1) of address 4 can be from text OUTBUF [0]
H numerical value (i.e. all four neural processing units 126 of neural processing unit group 4901) is read, is deposited from weight arbitrary access
The column 1 of reservoir 124 read Uc numerical value, and aforementioned value are multiplied to produce one second product, accumulator 202 is added.
Address 5 can deposit weight text addition accumulator instruction (ADD_W_ACC WR ROW 2) from weight arbitrary access
The column 2 of reservoir 124 read Bc numerical value and are added into accumulator 202.
The output order (OUTPUT TANH, NOP, MASK [0:2], CLR ACC) of address 6 can hold 202 numerical value of accumulator
Row tanh run function, and text OUTBUF [3] only are written into (that is, only neural processing unit group in implementing result
This result can be written in the neural processing unit 126 that number removes that 4 remainder is 3 in group 4901), also, accumulator 202 can be clear
It removes.That is, the output order of address 6 can cover text OUTBUF [0], OUTBUF [1] and OUTBUF [2] are (such as instruction art
Language MASK [0:2] is represented) and its current value is maintained, as shown in figure 50.In addition, the output order of address 6 can't be written
Memory (as represented by instructions nomenclature NOP).
The instruction of address 7 to 10 executes input lock (I) numerical value that can calculate current time step each time, forgets lock
(F) numerical value and output and are respectively written into text OUTBUF [0], OUTBUF [1] lock (O) numerical value, and OUTBUF [2], these
Numerical value will be used by the instruction of address 11,12 and 15.More precisely, the multiply-accumulate instruction of address 7 can be random from data
Current readings for accessing memory 122 arranges the memory that (such as column 2,4,6 the rest may be inferred to column 62) reading is associated with intermediate step at this time
Born of the same parents input (X) value, read Wi, Wf and Wo numerical value from the column 3 of weight random access memory 124, and aforementioned value is multiplied to
It generates a product and the accumulator 202 removed by the instruction of address 6 is added.More precisely, in neural processing unit group 4901
In, it numbers except the neural processing unit 126 that 4 remainder is 0 can calculate the product of X and Wi, number removes 4 remainder as 1 nerve
Processing unit 126 can calculate the product of X and Wf, and number except the neural processing unit 126 that 4 remainder is 2 can calculate X and Wo
Product.
The multiply-accumulate instruction of address 8 can read H numerical value (i.e. neural processing unit group 4901 from text OUTBUF [0]
All four neural processing units 126), read Ui, Uf and Uo numerical value, and general from the column 4 of weight random access memory 124
Aforementioned value is multiplied to produce one second product and accumulator 202 is added.More precisely, in neural processing unit group 4901
In, it numbers except the neural processing unit 126 that 4 remainder is 0 can calculate the product of H and Ui, number removes 4 remainder as 1 nerve
Processing unit 126 can calculate the product of H and Uf, and number except the neural processing unit 126 that 4 remainder is 2 can calculate H and Uo
Product.
Address 9 can deposit weight text addition accumulator instruction (ADD_W_ACC WR ROW 2) from weight arbitrary access
The column 5 of reservoir 124 read Bi, Bf and Bo numerical value and are added into accumulator 202.More precisely, in neural processing unit group
In group 4901, number except 4 remainder be 0 neural processing unit 126 can execute the additional calculation of Bi numerical value, number except 4 it is remaining
Number can execute the additional calculation of Bf numerical value for 1 neural processing unit 126, and number the neural processing unit for being 2 except 4 remainder
126 can execute the additional calculation of Bo numerical value.
The output order (OUTPUT SIGMOID, NOP, MASK [3], CLR ACC) of address 10 can be to 202 numerical value of accumulator
Execute S type run function simultaneously I, F and the O numerical value calculated is respectively written into text OUTBUF [0], OUTBUF [1] and
OUTBUF [2], this instructs and can remove accumulator 202, and is not written into memory.That is, the output order meeting of address 10
Masking text OUTBUF [3] (such as instructions nomenclature MASK [3] represented) and the current value (namely C ') for maintaining this text,
As shown in figure 50.
The instruction of address 11 to 13 executes the new memory cell state (C) that can calculate current time step generation each time
It is worth and the column 6 for being written into weight random access memory 124 is used for next time step (namely for the finger of address 12
Enable and being used when circulation executes next time), more precisely, this numerical value is written column 6 and corresponds to neural processing unit group 4901
Four row texts in label except 4 remainder be 3 text.In addition, the execution each time of the instruction of address 14 all can be by tanh (C)
Numerical value is written OUTBUF [3] and uses for the instruction of address 15.
More precisely, the multiply-accumulate instruction (MULT-ACCUM OUTBUF [0], OUTBUF [3]) of address 11 can be from
Text OUTBUF [0] reads input lock (I) numerical value, reads candidate memory cell state (C ') value from text OUTBUF [3], and will before
It states numerical value and is multiplied to produce the accumulator 202 that the addition of one first product is removed by the instruction of address 10.More precisely, at nerve
Each neural processing unit 126 in the neural processing unit 126 of four of reason cell group 4901 can all calculate I numerical value and C ' number
First product of value.
The multiply-accumulate instruction (MULT-ACCUM OUTBUF [1], WR ROW 6) of address 12 can indicate neural processing unit
126 read forgetting lock (F) numerical value from text OUTBUF [1], and it is corresponding to read its from the column 6 of weight random access memory 124
Text, and the instruction for being multiplied to produce the second product and address 11 results from the first product addition in accumulator 202.More
It speaks by the book, for neural 4901 internal label of processing unit group removes 4 remainder as 3 neural processing unit 126, from
The text that column 6 are read is calculated current memory cell state (C) value of previous time steps, and the first product adds with the second product
Total is memory cell state (C) new thus.But, for other three neural processing units of neural processing unit group 4901
For 126, the text read from column 6 is the numerical value for being not required to comprehend, this is because accumulated value caused by these numerical value will not be by
It uses, namely output buffer 1104 will not be put by the instruction of address 13 and 14 and can be removed by the instruction of address 14.?
That is label removes caused by the neural processing unit 126 that 4 remainder is 3 newly only in neural processing unit group 4901
Memory cell state (C) value will be used, i.e., by address 13 and 14 instruction use.With regard to the second to three of the instruction of address 12
Ten it is primary execute for, the C numerical value read from the column 6 of weight random access memory 124 be in the previous execution of circulation group by
The numerical value of the instruction write-in of address 13.But, for the instruction of address 12 first time execute for, the C numerical value of column 6 be then by
The initial value that framework program is written before the nand architecture program of starting Figure 51 or by version after an adjustment of nand architecture program.
The output order (OUTPUT PASSTHRU, WR ROW 6, MASK [0:2]) of address 13 can only transmit accumulator 202
Numerical value, i.e., calculated C numerical value, until text OUTBUF [3] is (that is, label only in neural processing unit group 4901
Except 4 remainder be 3 neural processing unit 126 output buffer 1104 can be written in its calculated C numerical value), and weight with
The column 6 of machine access memory 124 are then with the write-in of updated output buffer 1104, as shown in figure 50.That is, address
13 output order can cover text OUTBUF [0], OUTBUF [1] and OUTBUF [2] and maintain its current value (i.e. I, F with
O numerical value).It has been observed that only column 6 are corresponding to label in four row texts of neural processing unit group 4901 except 4 remainder is 3
C numerical value in text can be used, that is, be used by the instruction of address 12;Therefore, nand architecture program will not comprehend weight with
Machine, which accesses, is located at row 0-2, row 4-6 in the column 6 of memory 124, and so on to the numerical value of row 508-510, as shown in figure 50 (i.e.
I, F and O numerical value).
The output order (OUTPUT TANH, NOP, MASK [0:2], CLR ACC) of address 14 can be to 202 numerical value of accumulator
Tanh run function is executed, and text OUTBUF [3] are written into the tanh calculated (C) numerical value, this instructs and understands clear
Except accumulator 202, and it is not written into memory.The output order of address 14 can cover text such as the output order of address 13
OUTBUF [0], OUTBUF [1] and OUTBUF [2] and maintain its script numerical value, as shown in figure 50.
The instruction of address 15 to 16 executes memory cell output (H) value that can calculate current time step generation each time
And it is written into the current output column rear secondary series of data random access memory 122, numerical value will be read by framework program
It takes and is used for time step (namely by the instruction use of address 3 and 7 in circulation next time executes) next time.More accurately
It says, the multiply-accumulate instruction of address 15 can read output lock (O) numerical value from text OUTBUF [2], read from text OUTBUF [3]
Tanh (C) numerical value is taken, and is multiplied to produce a product and the accumulator 202 removed by the instruction of address 14 is added.It is more accurate
Say that each neural processing unit 126 in the neural processing unit 126 of four of neural processing unit group 4901 can all calculate in ground
The product of numerical value O and tanh (C).
The output order of address 16 can transmit 202 numerical value of accumulator and write calculated H numerical value in first time executes
Fall in lines 3, executed at second in column 5 are written into calculated H numerical value, and so on the 30th it is primary execute in will calculate
H numerical value column 63 are written, as shown in figure 50, these following numerical value can be used by the instruction of address 4 and 8.In addition, such as Figure 50 institute
Show, these H numerical value calculated can be placed into output buffer 1104 for the subsequent use of instruction of address 4 and 8.Address 16
Output order simultaneously can remove accumulator 202.In one embodiment, the design of shot and long term memory cell 4600 refers to the output of address 16
It enables the output order of address 22 (and/or in Figure 48) that there is a run function, such as S type or hyperbolic tangent function, rather than transmits
202 numerical value of accumulator.
The recursion instruction of address 17 can make cycle counter 3804 successively decrease and big in new 3804 numerical value of cycle counter
The instruction of address 3 is returned in the case where zero.
Thus can find because the feedback of the output buffer 1104 in 121 embodiment of neural network unit of Figure 49 with
Screening ability, nand architecture instruction of the instruction number compared to Figure 48 in the circulation group of the nand architecture program of Figure 51 are substantially reduced
34%.In addition, because of the feedback and screening ability of the output buffer 1104 in 121 embodiment of neural network unit of Figure 49,
Memory in the data random access memory 122 of Figure 51 nand architecture program configures arranged in pairs or groups time number of steps substantially
Three times of Figure 48.Aforementioned improvement facilitates certain frameworks calculated using 121 executive chairman's short-term memory born of the same parents' layer of neural network unit
Program application is less equal than 128 application especially for 4600 quantity of shot and long term memory cell in shot and long term memory cell layer.
The embodiment of Figure 47 to Figure 51 assumes that the weight in each time step remains unchanged with deviant.But, this hair
Bright to be not limited to this, other weights are also belonged to the scope of the present invention with the deviant embodiment that intermediate step changes at any time, wherein weight
Random access memory 124 inserts single group of weight and deviant not as shown in Figure 47 to Figure 50, but in each time step
Suddenly 124 address of weight random access memory of different groups of weights and deviant and the nand architecture program of Figure 48 to Figure 51 is inserted
It can adjust therewith.
Substantially, in the embodiment of earlier figures 47 to Figure 51, weight, offset is stored in value (such as C, C ' numerical value) between two parties
Weight random access memory 124, and input and be then stored in data random access memory with output valve (such as X, H numerical value)
122.This feature is conducive to that data random access memory 122 is dual-port and weight random access memory 124 is single port
Embodiment, this is because having more flows from nand architecture program and framework program to data random access memory 122.
But, because weight random access memory 124 is larger, in another embodiment of the invention then be exchange storage nand architecture with
The memory (i.e. exchange data random access memory 122 and weight random access memory 124) of numerical value is written in framework program.
That is, W, U, B, C ', tanh (C) and C numerical value be stored in data random access memory 122 and X, H, I, F and O numerical value then
It is stored in weight random access memory 124 (embodiment after the adjustment of Figure 47);And W, U, B, with C numerical value are stored in data
Random access memory 122 and X and H numerical value is then stored in weight random access memory 124 and (implements after the adjustment of Figure 50
Example).Because weight random access memory 124 is larger, these embodiments can handle more time step in a batch.It is right
For the application for executing the framework program calculated using neural network unit 121, this feature be conducive to it is certain can be from more
Application that time step is got profit and can be for the offer of memory (such as weight random access memory 124) that single port design enough
Enough bandwidths.
Figure 52 is a block diagram, shows the embodiment of neural network unit 121, the neural processing unit group of this embodiment
It is interior that there is the masking of output buffering and feedback capability, and shared run function unit 1112.The neural network unit 121 of Figure 52
Similar to the neural network unit 121 of Figure 47, and the component in figure with identical label is also similar.But, the four of Figure 49
A run function unit 212 is then this single as replaced single shared run function unit 1112 in the present embodiment
Run function unit can receive four outputs 217 from four accumulators 202 and generate four outputs to text OUTBUF
[0], [1] OUTBUF, OUTBUF [2] and OUTBUF [3].The function mode of the neural network unit 212 of Figure 52 is similar to above
Figure 49 to Figure 51 the embodiment described, and its mode for operating shared run function unit 1112 be similar to above Figure 11 to scheming
13 the embodiment described.
Figure 53 is a block diagram, display when the execution of neural network unit 121 be associated in Figure 46 one have it is 128 long
When the calculating of the level of short-term memory born of the same parents 4600, the data random access memory 122 of the neural network unit 121 of Figure 49, power
Another embodiment of data configuration in weight random access memory 124 and output buffer 1104.The example of Figure 53 is similar to
The example of Figure 50.But, in Figure 53, Wi, Wf and Wo value are located at column 0 (rather than as Figure 50 is located at column 3);Ui, Uf and Uo value position
In column 1 (rather than as Figure 50 is located at column 4);Bi, Bf and Bo value are located at column 2 (rather than as Figure 50 is located at column 5);C value be located at column 3 (and
Non- such as Figure 50 is located at column 6).In addition, the content of the output buffer 1104 of Figure 53 is similar to Figure 50, but, because of Figure 54 and figure
The difference of 51 nand architecture program, tertial content (i.e. I, F, O and C ' numerical value) is that occur after the instruction execution of address 7
In output buffer 1104 (rather than if Figure 50 is the instruction of address 10);The content (i.e. I, F, O and C numerical value) of 4th column is on ground
Output buffer 1104 (rather than if Figure 50 is the instruction of address 13) is appeared in after the instruction execution of location 10;The content of 5th column
(i.e. I, F, O and tanh (C) numerical value) is that output buffer 1104 is appeared in after the instruction execution of address 11 (rather than such as Figure 50
It is the instruction of address 14);And the content (i.e. H numerical value) of the 6th column is that output buffering is appeared in after the instruction execution of address 13
Device 1104 (rather than if Figure 50 is the instruction of address 16), the details will be described later.
Figure 54 be a table, display be stored in neural network unit 121 program storage 129 program, this program by
The neural network unit 121 of Figure 49 executes and uses data and weight according to the configuration of Figure 53, is associated with shot and long term note to reach
Recall the calculating of born of the same parents' layer.The example program of Figure 54 is similar to the program of Figure 51.More precisely, in Figure 54 and Figure 51, address 0 to 5
Instruction it is identical;The instruction of address 7 and 8 is identical to the instruction of address 10 and 11 in Figure 51 in Figure 54;And address 10 in Figure 54
Instruction to 14 is identical to the instruction of address 13 to 17 in Figure 51.
But, in Figure 54 the instruction of address 6 can't remove accumulator 202 (in comparison, in Figure 51 address 6 instruction
Accumulator 202 can then be removed).In addition, the instruction of address 7 to 9 is not present in the nand architecture program of Figure 54 in Figure 51.Most
Afterwards, for the instruction of address 12 in the instruction of address 9 in Figure 54 and Figure 51, in addition to weight is read in the instruction of address 9 in Figure 54
The column 3 of random access memory 124 and in Figure 51 the instruction of address 12 then be read weight random access memory column 6 outside,
Other parts are all the same.
Because of the difference of the nand architecture program of the nand architecture program and Figure 51 of Figure 54, the weight that the configuration of Figure 53 uses is random
The columns of access memory 124 can reduce three, and the instruction number in program circulation can also reduce three.The nand architecture journey of Figure 54
Circulation packet size in sequence substantially only has the half of the circulation packet size in the nand architecture program of Figure 48, and substantially only schemes
80% of circulation packet size in 51 nand architecture program.
Figure 55 is a block diagram, shows the part of the neural processing unit 126 of another embodiment of the present invention.More accurately
It says, for single neural processing unit 126 in multiple neural processing units 126 of Figure 49, multitask is shown in figure
The input 207,211 and 4905 associated with it of buffer 208 and the input 206,711 associated with it of multitask buffer 705
With 4907.Other than the input of Figure 49, the multitask buffer 208 and multitask buffer 705 of neural processing unit 126 are not
It receives and numbers (index_within_group) input 5599 in a group.Specific nerve is pointed out in number input 5599 in group
Number of the processing unit 126 in its neural processing unit group 4901.So that it takes up a position, for example, with each neural processing unit
The tool of group 4901 is there are four for the embodiment of neural processing unit 126, in each neural processing unit group 4901,
In a neural processing unit 126 number in its group in input 5599 and receives value of zero, one of neural processing unit
126 receive numerical value one in its group in number input 5599, one of nerve processing unit 126 is numbered defeated in its group
Enter reception numerical value two in 5599, and one of neural processing unit 126 numbers in input 5599 in its group and receives numerical value
Three.In other words, number 5599 numerical value of input are exactly this neural processing unit in neural 126 received group, institute of processing unit
126 number in neural network unit 121 is divided by the remainder of J, and wherein J is at the nerve in neural processing unit group 4901
Manage the quantity of unit 126.So that it takes up a position, for example, neural processing unit 73 numbers input 5599 in its group receives numerical value
One, neural processing unit 353 numbers input 5599 in its group and receives numerical value three, and neural processing unit 6 is in its group
Number input 5599 receives numerical value two.
In addition, being expressed as herein " SELF ", multitask buffer 208 can select when the specified default value of control input 213
Select the output of output buffer 1,104 4905 for corresponding to 5599 numerical value of number input in group.Therefore, when nand architecture instruct with
Specified data of the reception from output buffer 1104 of the numerical value of SELF (are denoted as in the instruction of the address 2 and 7 Figure 57
OUTBUF [SELF]), the multitask buffer 208 of each nerve processing unit 126 can receive its phase from output buffer 1104
Corresponding text.So that it takes up a position, for example, when the nand architecture that neural network unit 121 executes address 2 and 7 in Figure 57 instructs, at nerve
The multitask buffer 208 for managing unit 73 can input in four inputs 4905 to receive from defeated in selection second (number 1)
The text 73 of buffer 1104 out, the multitask buffer 208 of neural processing unit 353 can select the in four inputs 4905
Four (number 3) inputs are to receive the text 353 from output buffer 1104, and the multitask of neural processing unit 6 caches
Device 208 selection third (number 2) can input in four inputs 4905 to receive the text 6 from output buffer 1104.
Although not being used in the nand architecture program of Figure 57, but, SELF numerical value (OUTBUF [SELF]) can also be used in nand architecture instruction
It is specified to receive the data from output buffer 1104 and control 713 specified default values of input is made to make each neural processing unit
126 multitask buffer 705 receives its corresponding text from output buffer 1104.
Figure 56 is a block diagram, and display is associated with the Jordan time recurrent neural net of Figure 43 when the execution of neural network unit
The calculating of network and when using the embodiment of Figure 55, the data random access memory 122 and weight of neural network unit 121 are random
Access an example of the data configuration in memory 124.Weight configuration in figure in weight random access memory 124 is identical to
The example of Figure 44.The example of numerical value in figure in data random access memory 122 being similarly configured in Figure 44, in addition at this
In example, each time step has corresponding a pair of two column memory to load input layer D value and output node layer Y
Value, rather than as the example of Figure 44 uses the memory of one group of four column.That is, in this example, hidden layer Z numerical value with it is interior
Hold layer C numerical value and is not written into data random access memory 122.But by output buffer 1104 as hidden layer Z numerical value with
The classification scratch memory of content layer C numerical value, in detail as described in the nand architecture program of Figure 57.Aforementioned OUTBUF [SELF] output buffering
The feedback characteristic of device 1104, can making the running of nand architecture program, more quick (this is will be for data random access memory
122 write-ins twice executed and twi-read act, with the write-in twice executed for output buffer 1104 and twi-read
Act to replace) and the space of the data random access memory 122 that each time step uses is reduced, and make the present embodiment
When the data that data random access memory 122 is loaded can be used for being approximately twice possessed by the embodiment of Figure 44 and Figure 45
Intermediate step, as shown in the figure, i.e. 32 time steps.
Figure 57 be a table, display be stored in neural network unit 121 program storage 129 program, this program by
Neural network unit 121 executes and uses data and weight according to the configuration of Figure 56, to reach Jordan time recurrent neural net
Network.The nand architecture program of Figure 57 is similar to the nand architecture program of Figure 45, as described below at difference.
There are the example program of Figure 57 12 nand architecture instructions to be located at address 0 to 11.The initialization directive meeting of address 0
It removes accumulator 202 and the numerical value of cycle counter 3804 is initialized as 32, hold circulation group (instruction of address 2 to 11)
Row 32 times.The zero of accumulator 202 (being removed by the instruction of address 0) can be put into output buffer by the output order of address 1
1104.Thus can be observed, in the implementation procedure of the instruction of address 2 to 6, this 512 neural processing units 126 are corresponding simultaneously
It is operated as 512 hiding node layer Z, and in the implementation procedure of the instruction of address 7 to 10, it corresponds to and as 512
Output node layer Y is operated.That is, 32 execution of the instruction of address 2 to 6 can calculate 32 corresponding time steps
Hiding node layer Z numerical value, and corresponding 32 execution for putting it into instruction of the output buffer 1104 for address 7 to 9 make
With, to calculate the output node layer Y of this 32 corresponding time steps and be written into data random access memory 122, and
Corresponding 32 times for providing the instruction of address 10 execute use, and the content node layer C of this 32 corresponding time steps is put
Enter output buffer 1104.(the content node layer C for being put into the 32nd time step in output buffer 1104 can't be made
With.)
Instruction (ADD_D_ACC OUTBUF [SELF] and ADD_D_ACC ROTATE, COUNT=511) in address 2 and 3
First time execute, each neural processing unit 126 in 512 neural processing units 126 can be by output buffer 1104
512 content node C values be added to its accumulator 202, these content nodes C value is produced by the instruction execution of address 0 to 1
Raw and write-in.In second of execution of the instruction of address 2 and 3, at each nerve in this 512 neural processing units 126
512 content node C values of output buffer 1104 can be added to its accumulator 202, these content nodes C by reason unit 126
It is worth as produced by the instruction execution of address 7 to 8 and 10 and write-in.More precisely, the instruction of address 2 can indicate at each nerve
The multitask buffer 208 for managing unit 126 selects its corresponding 1104 text of output buffer, as previously mentioned, and being added into
Accumulator 202;The instruction of address 3 can indicate that neural processing unit 126 rotates content node C in the rotator of 512 texts
The rotator of value, this 512 texts is transported by the collective for the multitask buffer 208 being connected in this 512 neural processing units
Work is constituted, and allows each neural processing unit 126 that this 512 content node C values are added to its accumulator 202.Ground
The instruction of location 3 can't remove accumulator 202, and input layer D value (can be multiplied by its phase by the instruction of such address 4 and 5
Respective weights) plus the cumulative content node layer C value out of instruction by address 2 and 3.
Instruction (MULT-ACCUM DR R0W+2, WR ROW 0 and MULT-ACCUM ROTATE, WR in address 4 and 5
ROW+1, COUNT=511) execute for each time, 126 meeting of each neural processing unit in this 512 neural processing units 126
Execute 512 multiplyings, by be associated in data random access memory 122 current time step column (such as: for when
It is column 0 for intermediate step 0, is column 2 for time step 1, and so on, for being for time step 31
For column 62) 512 input node D values, be multiplied by the column 0 to 511 of weight random access memory 124 correspond to this nerve
The weight of the row of processing unit 126, to generate 512 products, and together with the instruction of this address 2 and 3 for this 512 content sections
The accumulation result that point C value executes is added to the accumulator 202 of corresponding neural processing unit 126 together to calculate concealed nodes Z
Number of plies value.
In each execution of the instruction (OUTPUT PASSTHRU, NOP, CLR ACC) of address 6, at this 512 nerves
512 202 numerical value of accumulator of reason unit 126 transmit and are written the corresponding text of output buffer 1104, and accumulator
202 can be removed.
Instruction (MULT-ACCUM OUTBUF [SELF], WR ROW 512 and MULT-ACCUM in address 7 and 8
ROTATE, WR ROW+1, COUNT=511) implementation procedure in, at each nerve in this 512 neural processing units 126
Reason unit 126 can execute 512 multiplyings, by 512 concealed nodes Z values in output buffer 1104 (by address 2 to 6
Instruction it is corresponding time execute produced by and be written), it is right in the column 512 to 1023 of weight random access memory 124 to be multiplied by
It should be in the weight of the row of this neural processing unit 126, to generate 512 product accumulations to corresponding neural processing unit 126
Accumulator 202.
In each execution of the instruction (OUTPUT ACTIVATION FUNCTION, DR OUT ROW+2) of address 9, meeting
Run function (such as hyperbolic tangent function, S type function, correction function) is executed to calculate output node Y for this 512 accumulated values
Value, this output node Y value can be written into data random access memory 122 corresponding to current time step column (such as: it is right
It is column 1 for time step 0, is column 3 for time step 1, and so on, for time step 31 i.e.
For column 63).The instruction of address 9 can't remove accumulator 202.
In each execution of the instruction (OUTPUT PASSTHRU, NOP, CLRACC) of address 10, the instruction of address 7 and 8
It is cumulative go out 512 numerical value can be placed into instruction of the output buffer 1104 for address 2 and 3 execute use next time, and
Accumulator 202 can be removed.
The recursion instruction of address 11 can make the number decrements of cycle counter 3804, and if new cycle counter 3804
Numerical value is still greater than zero, would indicate that the instruction for returning to address 2.
As described in the chapters and sections corresponding to Figure 44, Jordan time recurrent neural is being executed using the nand architecture program of Figure 57
In the example of network, although run function can be imposed for 202 numerical value of accumulator to generate output node layer Y value, but, this model
Official holiday is scheduled on impose run function before, 202 numerical value of accumulator is just transferred to content node layer C, rather than transmits real output layer
Node Y value.But, it is passed for run function is applied to 202 numerical value of accumulator with the Jordan time for generating content node layer C
Return for neural network, the instruction of address 10 will be removed from the nand architecture program of Figure 57.In embodiment as described herein
In, Elman or Jordan time recurrent neural network has single a concealed nodes layer (such as Figure 40 and Figure 42), however, it is desirable to
Understanding, the embodiment of these processors 100 and neural network unit 121 can be used similar to manner described herein,
Efficiently perform the calculating for being associated with the time recurrent neural network with multiple hidden layers.
As described in corresponding to the chapters and sections of Fig. 2 above, each nerve processing unit 126 is as in an artificial neural network
Neuron is operated, and neural processing unit 126 all in neural network unit 121 can be with the side of extensive parallel processing
Formula effectively calculates the neuron output value of a level of this network.The parallel processing of this neural network unit, especially
The rotator constituted using neural processing unit multitask buffer collective not traditionally calculates the mode of neuronal layers output
Institute's energy intuition is expected.Furthermore, it is understood that traditional approach, which is usually directed to, is associated with single a neuron or a very small mind
Calculating (for example, executing multiplication and additional calculation using parallel arithmetical unit) through first subclass, then continues to execute association
In the calculating of next neuron of same level, and so on continue to execute in serial fashion, until completing for this level
In all neuron calculating.In comparison, the present invention is within each time-frequency period, all minds of neural network unit 121
A small set in calculating needed for generating all neurons outputs is associated with through processing unit 126 (neuron) meeting parallel execution
(such as single a multiplication and accumulation calculating).- M is the number of nodes-linked in current level after about M time-frequency end cycle
Neural network unit 121 will calculate the output of all neurons.In the configuration of many artificial neural networks, because existing big
Neural processing unit 126 is measured, neural network unit 121 can be in M time-frequency end cycle for all minds of flood grade
Its neuron output value is calculated through member.As described herein, this is calculated for all types of artificial neural networks all
Has efficiency, these artificial neural networks are including but not limited to feedforward and time recurrent neural network, such as Elman, Jordan and length
Short-term memory network.Finally, although neural network unit 121 is configured to 512 neural processing units in the embodiments herein
For 126 (such as take wide text configure) to execute the calculating of time recurrent neural network, but, the present invention is not limited thereto, will
Neural network unit 121 is configured to 1024 neural processing units 126 (such as narrow text is taken to configure) to execute time recurrence
The embodiment of the calculating of neural network unit, and such as the aforementioned neural processing unit with other quantity other than 512 and 1024
126 neural network unit 121, also belongs to the scope of the present invention.
Only as described above, only presently preferred embodiments of the present invention, when the model that cannot be limited the present invention with this and implement
It encloses, i.e., all still belongs to according to simple equivalent changes and modifications made by scope of the present invention patent and invention description content generally
In the range of the invention patent covers.For example, software can execute the function of apparatus and method of the present invention, system
It makes, shape, emulate, describe and/or tests.This can be by general program language (such as C, C++), hardware description language
(HDL) reach comprising Verilog HDL, VHDL etc. or other existing programs.This software can be set in any of
Computer can utilize medium, such as tape, semiconductor, disk, CD (such as CD-ROM, DVD-ROM), network connection, it is wireless or
It is other medium of communication.The embodiment of apparatus and method described herein may be included with semiconductor intelligence wealth core, such as micro- place
It manages core (such as with the embodiment of hardware description language) and is converted to hardware through the production of integrated circuit.In addition, herein
Described apparatus and method also may include the combination of hardware and software.Therefore, any embodiment as described herein, not to
It limits the scope of the invention.In addition, present invention can apply to the micro processor, apparatus of general general purpose computer.Finally, affiliated skill
Art field have usually intellectual utilize disclosed herein idea and embodiment based on, to design and adjust out difference
Structure reached identical purpose, also without departing from the scope of the present invention.
Claims (25)
1. a kind of neural network device characterized by comprising
One output buffer, to load N number of text, which distributes to the output of N/J mutual exclusion and buffers text group
Interior, which buffers text group with J text in N number of text, and J is greater than twice that 2, N is at least J;
One array being made of N number of processing unit, N number of processing unit are distributed to the processing unit group of N/J mutual exclusion, should
Processing unit group has J processing unit in N number of processing unit, and respectively it is slow to correspond to N/J output for the processing unit group
One of text group is rushed, respectively the processing unit includes:
First and second multitask buffer, respectively the multitask buffer include:
At least J+1 inputs, and one first input in this J+1 input receives an operand, this J+1 input from a memory
In it is other J input receive it is corresponding output buffering text group the J texts;
One output;And
One control input, to control the selection inputted for this J+1 to be provided to the output;
One accumulator, the corresponding output buffering text of one be provided to an output in N number of output buffering text;And
One arithmetical unit has the first, the second to input with third, this first and second input respectively to receive this first and the
The output of two multitask buffers, the third input the output to receive the accumulator, the arithmetical unit for this first, the
Two, which execute an operation with third input, is added to the accumulator to generate a result;
Wherein, which includes a shielding input, can maintain its script number to control which text in N number of text
It value or is updated with the output of its corresponding accumulator;
Wherein, respectively processing unit group in the N/J processing unit group with J processing unit is as a time recurrence
One shot and long term memory cell of neural network is operated, and the first processing units in the J processing unit calculate the shot and long term
The one of memory cell inputs lock, and the second processing unit in the J processing unit calculates a forgetting lock of the shot and long term memory cell,
And the third processing unit in the J processing unit calculates an output lock of the shot and long term memory cell.
2. neural network device according to claim 1, which is characterized in that shielding input is specified to utilize this J processing
The first, the second input lock calculated separately out with third processing unit in unit, the forgetting lock and the output lock update
The first, the second and third text in the J text of the corresponding output buffering text group.
3. neural network device according to claim 2, which is characterized in that in the J processing unit this first, the second
The input lock is calculated simultaneously with third processing unit, the forgetting lock and the output lock.
4. neural network device according to claim 2, which is characterized in that the fourth process in the J processing unit
Unit calculates a candidate state of the shot and long term memory cell.
5. neural network device according to claim 4, which is characterized in that shielding input is specified to be remembered using the shot and long term
The candidate state for recalling born of the same parents updates corresponding one the 4th text exported in the J text for buffering text group, but maintains
This it is corresponding output buffering text group the J text in this first, the second and third text current value.
6. neural network device according to claim 4, which is characterized in that one of the J processing unit utilizes
The input lock, the forgetting lock, the candidate state of the shot and long term memory cell are calculated with a current state of the shot and long term memory cell
A new state and its run function for the shot and long term memory cell.
7. neural network device according to claim 6, which is characterized in that further include:
One memory, one of the J processing unit read the current state of the shot and long term memory cell from the memory,
And the memory is written in the new state of the shot and long term memory cell by the output buffer.
8. neural network device according to claim 6, which is characterized in that one of the J processing unit utilizes
The run function of the new state of the output lock and the shot and long term memory cell calculates a new output of the shot and long term memory cell.
9. neural network device according to claim 8, which is characterized in that further include:
One memory, the J processing unit read a current output of the shot and long term memory cell, and the output from the memory
The memory is written in the new output of the shot and long term memory cell by buffer.
10. neural network device according to claim 1, which is characterized in that in the J processing unit this first, the
Two are currently exported with corresponding weight using the one of the shot and long term memory cell with third processing unit and are remembered using the shot and long term
The new input and corresponding weight for recalling born of the same parents, calculate separately the input lock, the forgetting lock, with the output lock.
11. neural network device according to claim 10, which is characterized in that in the J processing unit this first, the
Two read the current output from the output buffer with third processing unit.
12. neural network device according to claim 10, which is characterized in that further include:
One memory, the first, the second with third processing unit to read this from the memory new defeated for this in the J processing unit
Enter.
13. neural network device according to claim 10, which is characterized in that further include:
One memory, this in the J processing unit the first, the second read the weight from the memory with third processing unit.
14. a kind of method for operating a neural network device, which is characterized in that the device have an output buffer and one by
The array that N number of processing unit is constituted, to load N number of text, which distributes to N/J mutual exclusion the output buffer
In output buffering text group, which buffers text group with J text in N number of text, and J is greater than 2, N and is at least J
Twice, which distributes to the processing unit group of N/J mutual exclusion, the processing unit group have N number of processing
J processing unit in unit, respectively the processing unit group corresponds to one of N/J output buffering text group, this is defeated
Buffer includes a shielding input out, can maintain its script numerical value or to control which text in N number of text with its phase
The output of corresponding accumulator is updated, and respectively the processing unit has first and second multitask buffer, an accumulator and one
Arithmetical unit, respectively the multitask buffer has an output, which has an output to be provided to N number of output buffering text
In a corresponding output buffer text, which has the first, the second to input with third, first and second input point
Not to the output for receiving first and second multitask buffer, which inputs the output to receive the accumulator, should
Arithmetical unit the first, the second executes an operation with third input and to generate a result is added to the accumulator for this, this first
It is inputted with respectively multitask buffer in the second multitask buffer including at least J+1, which inputs with a control, should
One first input in J+1 input receives an operand from a memory, and other J inputs in this J+1 input receive
The J text of corresponding output buffering text group, control input is to control the selection inputted for this J+1 to mention
It is supplied to the output, this method comprises:
Using respectively processing unit group in the N/J processing unit group with J processing unit, as a time recurrence
One shot and long term memory cell of neural network is operated, to execute following steps:
An input lock of the shot and long term memory cell is calculated using the first processing units in the J processing unit;
Calculate the shot and long term memory cell using the second processing unit in the J processing unit one forgets lock;And
An output lock of the shot and long term memory cell is calculated using the third processing unit in the J processing unit.
15. according to the method for claim 14, which is characterized in that further include:
It is inputted using the shielding, this specified using in the J processing unit the first, the second is calculated separately with third processing unit
The input lock out, the forgetting lock and the output lock update the in the J text of the corresponding output buffering text group
One, second with third text.
16. according to the method for claim 15, which is characterized in that further include:
Using in the J processing unit this first, the second with third processing unit calculate the input lock simultaneously, the forgetting lock with
The output lock.
17. according to the method for claim 15, which is characterized in that further include:
A candidate state of the shot and long term memory cell is calculated using the fourth processing unit in the J processing unit.
18. according to the method for claim 17, which is characterized in that further include:
Specified candidate state using the shot and long term memory cell, which is inputted, using the shielding updates the corresponding output buffering text
One the 4th text in the J text of group, but the corresponding output is maintained to buffer being somebody's turn to do in the J text of text group
The first, the second with the current value of third text.
19. according to the method for claim 17, which is characterized in that further include:
Using one of the J processing unit, the input lock, the forgetting lock, the candidate of the shot and long term memory cell are utilized
State calculates the new state and its run function of the shot and long term memory cell with a current state of the shot and long term memory cell.
20. according to the method for claim 19, which is characterized in that further include:
The current state of the shot and long term memory cell is read from a memory using one of the J processing unit;And
The memory is written into the new state of the shot and long term memory cell using the output buffer.
21. according to the method for claim 19, which is characterized in that further include:
Using one of the J processing unit, opened using this of the output lock and the new state of the shot and long term memory cell
Dynamic function calculates a new output of the shot and long term memory cell.
22. according to the method for claim 21, which is characterized in that further include:
A current output of the shot and long term memory cell is read from a memory using the J processing unit;And
The memory is written into the new output of the shot and long term memory cell using the output buffer.
23. according to the method for claim 14, which is characterized in that further include:
Using in the J processing unit this first, the second with third processing unit, it is current using the one of the shot and long term memory cell
Output calculates separately the input with corresponding weight and using a newly input of the shot and long term memory cell and corresponding weight
Lock, the forgetting lock, with the output lock.
24. according to the method for claim 23, which is characterized in that further include:
Using this in the J processing unit, the first, the second with third processing unit to read this from the output buffer current defeated
Out.
25. media can be used for the non-instantaneous computer that computer installation uses in a kind of encoded computer program, feature exists
In, comprising:
Program code can be used in the computer for being included in the media, and to describe a neural network device, which be can be used
Program code includes:
First program code, to describe an output buffer, the output buffer is to load N number of text, N number of text point
It being assigned in the output buffering text group of N/J mutual exclusion, output buffering text group has J text in N number of text,
J is greater than twice that 2, N is at least J;
Second program code, to describe the array being made of N number of processing unit, which is distributed to N/J
The processing unit group of mutual exclusion, the processing unit group have J processing unit in N number of processing unit, respectively the processing list
First group corresponds to one of N/J output buffering text group, and respectively the processing unit includes:
First and second multitask buffer, respectively the multitask buffer include:
At least J+1 inputs, and one first input in this J+1 input receives an operand, this J+1 input from a memory
In it is other J input receive it is corresponding output buffering text group the J texts;
One output;And
One control input, to control the selection inputted for this J+1 to be provided to the output;
One accumulator, the corresponding output buffering text of one be provided to an output in N number of output buffering text;And
One arithmetical unit has the first, the second to input with third, this first and second input respectively to receive this first and the
The output of two multitask buffers, the third input the output to receive the accumulator, the arithmetical unit for this first, the
Two, which execute an operation with third input, is added to the accumulator to generate a result;
Wherein, which includes a shielding input, can maintain its script number to control which text in N number of text
It value or is updated with the output of its corresponding accumulator;
Wherein, respectively processing unit group in the N/J processing unit group with J processing unit is as a time recurrence
One shot and long term memory cell of neural network is operated, and the first processing units in the J processing unit calculate the shot and long term
The one of memory cell inputs lock, and the second processing unit in the J processing unit calculates a forgetting lock of the shot and long term memory cell,
And the third processing unit in the J processing unit calculates an output lock of the shot and long term memory cell.
Applications Claiming Priority (48)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201562239254P | 2015-10-08 | 2015-10-08 | |
US62/239,254 | 2015-10-08 | ||
US201562262104P | 2015-12-02 | 2015-12-02 | |
US62/262,104 | 2015-12-02 | ||
US201662299191P | 2016-02-24 | 2016-02-24 | |
US62/299,191 | 2016-02-24 | ||
US15/090,705 US10353861B2 (en) | 2015-10-08 | 2016-04-05 | Mechanism for communication between architectural program running on processor and non-architectural program running on execution unit of the processor regarding shared resource |
US15/090,794 US10353862B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit that performs stochastic rounding |
US15/090,794 | 2016-04-05 | ||
US15/090,672 | 2016-04-05 | ||
US15/090,798 US10585848B2 (en) | 2015-10-08 | 2016-04-05 | Processor with hybrid coprocessor/execution unit neural network unit |
US15/090,807 US10380481B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit that performs concurrent LSTM cell calculations |
US15/090,708 US10346350B2 (en) | 2015-10-08 | 2016-04-05 | Direct execution by an execution unit of a micro-operation loaded into an architectural register file by an architectural instruction of a processor |
US15/090,814 US10552370B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with output buffer feedback for performing recurrent neural network computations |
US15/090,665 US10474627B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with neural memory and array of neural processing units that collectively shift row of data received from neural memory |
US15/090,696 US10380064B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit employing user-supplied reciprocal for normalizing an accumulated value |
US15/090,712 | 2016-04-05 | ||
US15/090,814 | 2016-04-05 | ||
US15/090,722 US10671564B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit that performs convolutions using collective shift register among array of neural processing units |
US15/090,678 US10509765B2 (en) | 2015-10-08 | 2016-04-05 | Neural processing unit that selectively writes back to neural memory either activation function output or accumulator value |
US15/090,708 | 2016-04-05 | ||
US15/090,696 | 2016-04-05 | ||
US15/090,705 | 2016-04-05 | ||
US15/090,801 | 2016-04-05 | ||
US15/090,796 US10228911B2 (en) | 2015-10-08 | 2016-04-05 | Apparatus employing user-specified binary point fixed point arithmetic |
US15/090,829 | 2016-04-05 | ||
US15/090,666 US10275393B2 (en) | 2015-10-08 | 2016-04-05 | Tri-configuration neural network unit |
US15/090,691 | 2016-04-05 | ||
US15/090,678 | 2016-04-05 | ||
US15/090,798 | 2016-04-05 | ||
US15/090,665 | 2016-04-05 | ||
US15/090,796 | 2016-04-05 | ||
US15/090,823 | 2016-04-05 | ||
US15/090,712 US10366050B2 (en) | 2015-10-08 | 2016-04-05 | Multi-operation neural network unit |
US15/090,672 US10353860B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with neural processing units dynamically configurable to process multiple data sizes |
US15/090,829 US10346351B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with output buffer feedback and masking capability with processing unit groups that operate as recurrent neural network LSTM cells |
US15/090,701 US10474628B2 (en) | 2015-10-08 | 2016-04-05 | Processor with variable rate execution unit |
US15/090,701 | 2016-04-05 | ||
US15/090,691 US10387366B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with shared activation function units |
US15/090,727 | 2016-04-05 | ||
US15/090,666 | 2016-04-05 | ||
US15/090,669 | 2016-04-05 | ||
US15/090,722 | 2016-04-05 | ||
US15/090,823 US10409767B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with neural memory and array of neural processing units and sequencer that collectively shift row of data received from neural memory |
US15/090,801 US10282348B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with output buffer feedback and masking capability |
US15/090,727 US10776690B2 (en) | 2015-10-08 | 2016-04-05 | Neural network unit with plurality of selectable output functions |
US15/090,807 | 2016-04-05 | ||
US15/090,669 US10275394B2 (en) | 2015-10-08 | 2016-04-05 | Processor with architectural neural network execution unit |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106599992A CN106599992A (en) | 2017-04-26 |
CN106599992B true CN106599992B (en) | 2019-04-09 |
Family
ID=58556056
Family Applications (6)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610866127.0A Active CN106599990B (en) | 2015-10-08 | 2016-09-29 | The neural pe array that neural network unit and collective with neural memory will be shifted from the data of neural memory column |
CN201610866130.2A Active CN106650923B (en) | 2015-10-08 | 2016-09-29 | Neural network unit with neural memory and neural processing unit and sequencer |
CN201610866030.XA Active CN106599989B (en) | 2015-10-08 | 2016-09-29 | Neural network unit and neural pe array |
CN201610866026.3A Active CN106598545B (en) | 2015-10-08 | 2016-09-29 | Processor and method for communicating shared resources and non-transitory computer usable medium |
CN201610866452.7A Active CN106599992B (en) | 2015-10-08 | 2016-09-29 | The neural network unit operated using processing unit group as time recurrent neural network shot and long term memory cell |
CN201610866129.XA Active CN106599991B (en) | 2015-10-08 | 2016-09-29 | The neural pe array that neural network unit and collective with neural memory will be shifted from the data of neural memory column |
Family Applications Before (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610866127.0A Active CN106599990B (en) | 2015-10-08 | 2016-09-29 | The neural pe array that neural network unit and collective with neural memory will be shifted from the data of neural memory column |
CN201610866130.2A Active CN106650923B (en) | 2015-10-08 | 2016-09-29 | Neural network unit with neural memory and neural processing unit and sequencer |
CN201610866030.XA Active CN106599989B (en) | 2015-10-08 | 2016-09-29 | Neural network unit and neural pe array |
CN201610866026.3A Active CN106598545B (en) | 2015-10-08 | 2016-09-29 | Processor and method for communicating shared resources and non-transitory computer usable medium |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610866129.XA Active CN106599991B (en) | 2015-10-08 | 2016-09-29 | The neural pe array that neural network unit and collective with neural memory will be shifted from the data of neural memory column |
Country Status (2)
Country | Link |
---|---|
CN (6) | CN106599990B (en) |
TW (7) | TWI591539B (en) |
Families Citing this family (49)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11615285B2 (en) | 2017-01-06 | 2023-03-28 | Ecole Polytechnique Federale De Lausanne (Epfl) | Generating and identifying functional subnetworks within structural networks |
US10481870B2 (en) | 2017-05-12 | 2019-11-19 | Google Llc | Circuit to perform dual input value absolute value and sum operation |
US10019668B1 (en) * | 2017-05-19 | 2018-07-10 | Google Llc | Scheduling neural network processing |
CN107315710B (en) | 2017-06-27 | 2020-09-11 | 上海兆芯集成电路有限公司 | Method and device for calculating full-precision numerical value and partial-precision numerical value |
CN107291420B (en) | 2017-06-27 | 2020-06-05 | 上海兆芯集成电路有限公司 | Device for integrating arithmetic and logic processing |
TWI680409B (en) * | 2017-07-08 | 2019-12-21 | 英屬開曼群島商意騰科技股份有限公司 | Method for matrix by vector multiplication for use in artificial neural network |
WO2019032870A1 (en) | 2017-08-09 | 2019-02-14 | Google Llc | Accelerating neural networks in hardware using interconnected crossbars |
US10079067B1 (en) * | 2017-09-07 | 2018-09-18 | Winbond Electronics Corp. | Data read method and a non-volatile memory apparatus using the same |
CN109472344A (en) * | 2017-09-08 | 2019-03-15 | 光宝科技股份有限公司 | The design method of neural network system |
US11507806B2 (en) | 2017-09-08 | 2022-11-22 | Rohit Seth | Parallel neural processor for Artificial Intelligence |
CN109697507B (en) * | 2017-10-24 | 2020-12-25 | 安徽寒武纪信息科技有限公司 | Processing method and device |
CN109960673B (en) * | 2017-12-14 | 2020-02-18 | 中科寒武纪科技股份有限公司 | Integrated circuit chip device and related product |
CN108288091B (en) * | 2018-01-19 | 2020-09-11 | 上海兆芯集成电路有限公司 | Microprocessor for booth multiplication |
US20190251429A1 (en) * | 2018-02-12 | 2019-08-15 | Kneron, Inc. | Convolution operation device and method of scaling convolution input for convolution neural network |
CN110197270B (en) * | 2018-02-27 | 2020-10-30 | 上海寒武纪信息科技有限公司 | Integrated circuit chip device and related product |
CN110197268B (en) * | 2018-02-27 | 2020-08-04 | 上海寒武纪信息科技有限公司 | Integrated circuit chip device and related product |
CN110197271B (en) * | 2018-02-27 | 2020-10-27 | 上海寒武纪信息科技有限公司 | Integrated circuit chip device and related product |
TWI664585B (en) * | 2018-03-30 | 2019-07-01 | 國立臺灣大學 | Method of Neural Network Training Using Floating-Point Signed Digit Representation |
US10522226B2 (en) * | 2018-05-01 | 2019-12-31 | Silicon Storage Technology, Inc. | Method and apparatus for high voltage generation for analog neural memory in deep learning artificial neural network |
TWI650769B (en) * | 2018-05-22 | 2019-02-11 | 華邦電子股份有限公司 | Memory device and programming method for memory cell array |
US11893471B2 (en) | 2018-06-11 | 2024-02-06 | Inait Sa | Encoding and decoding information and artificial neural networks |
US11972343B2 (en) | 2018-06-11 | 2024-04-30 | Inait Sa | Encoding and decoding information |
US11663478B2 (en) | 2018-06-11 | 2023-05-30 | Inait Sa | Characterizing activity in a recurrent artificial neural network |
JP2020004247A (en) * | 2018-06-29 | 2020-01-09 | ソニー株式会社 | Information processing apparatus, information processing method, and program |
US10418109B1 (en) | 2018-07-26 | 2019-09-17 | Winbond Electronics Corp. | Memory device and programming method of memory cell array |
CN108984426B (en) * | 2018-08-03 | 2021-01-26 | 北京字节跳动网络技术有限公司 | Method and apparatus for processing data |
US11449756B2 (en) * | 2018-09-24 | 2022-09-20 | Samsung Electronics Co., Ltd. | Method to balance sparsity for efficient inference of deep neural networks |
CN111078286B (en) * | 2018-10-19 | 2023-09-01 | 上海寒武纪信息科技有限公司 | Data communication method, computing system and storage medium |
CN109376853B (en) * | 2018-10-26 | 2021-09-24 | 电子科技大学 | Echo state neural network output axon circuit |
KR20200061164A (en) * | 2018-11-23 | 2020-06-02 | 삼성전자주식회사 | Neural network device for neural network operation, operating method of neural network device and application processor comprising neural network device |
US10867399B2 (en) | 2018-12-02 | 2020-12-15 | Himax Technologies Limited | Image processing circuit for convolutional neural network |
TWI694413B (en) * | 2018-12-12 | 2020-05-21 | 奇景光電股份有限公司 | Image processing circuit |
US11652603B2 (en) | 2019-03-18 | 2023-05-16 | Inait Sa | Homomorphic encryption |
US11569978B2 (en) | 2019-03-18 | 2023-01-31 | Inait Sa | Encrypting and decrypting information |
US20210182655A1 (en) * | 2019-12-11 | 2021-06-17 | Inait Sa | Robust recurrent artificial neural networks |
US11580401B2 (en) | 2019-12-11 | 2023-02-14 | Inait Sa | Distance metrics and clustering in recurrent neural networks |
US11816553B2 (en) | 2019-12-11 | 2023-11-14 | Inait Sa | Output from a recurrent neural network |
US11797827B2 (en) | 2019-12-11 | 2023-10-24 | Inait Sa | Input into a neural network |
US11651210B2 (en) | 2019-12-11 | 2023-05-16 | Inait Sa | Interpreting and improving the processing results of recurrent neural networks |
RU2732201C1 (en) * | 2020-02-17 | 2020-09-14 | Российская Федерация, от имени которой выступает ФОНД ПЕРСПЕКТИВНЫХ ИССЛЕДОВАНИЙ | Method for constructing processors for output in convolutional neural networks based on data-flow computing |
TWI722797B (en) | 2020-02-17 | 2021-03-21 | 財團法人工業技術研究院 | Computation operator in memory and operation method thereof |
US11586896B2 (en) | 2020-03-02 | 2023-02-21 | Infineon Technologies LLC | In-memory computing architecture and methods for performing MAC operations |
CN111898752B (en) * | 2020-08-03 | 2024-06-28 | 乐鑫信息科技(上海)股份有限公司 | Apparatus and method for performing LSTM neural network operations |
TWI742802B (en) | 2020-08-18 | 2021-10-11 | 創鑫智慧股份有限公司 | Matrix calculation device and operation method thereof |
TWI746126B (en) | 2020-08-25 | 2021-11-11 | 創鑫智慧股份有限公司 | Matrix multiplication device and operation method thereof |
TWI798798B (en) * | 2020-09-08 | 2023-04-11 | 旺宏電子股份有限公司 | In-memory computing method and in-memory computing apparatus |
TWI775170B (en) * | 2020-09-30 | 2022-08-21 | 新漢股份有限公司 | Method for cpu to execute artificial intelligent related processes |
US11657864B1 (en) * | 2021-12-17 | 2023-05-23 | Winbond Electronics Corp. | In-memory computing apparatus and computing method having a memory array includes a shifted weight storage, shift information storage and shift restoration circuit to restore a weigh shifted amount of shifted sum-of-products to generate multiple restored sum-of-products |
TWI830669B (en) * | 2023-02-22 | 2024-01-21 | 旺宏電子股份有限公司 | Encoding method and encoding circuit |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5138695A (en) * | 1989-10-10 | 1992-08-11 | Hnc, Inc. | Systolic array image processing system |
US5563982A (en) * | 1991-01-31 | 1996-10-08 | Ail Systems, Inc. | Apparatus and method for detection of molecular vapors in an atmospheric region |
US5956703A (en) * | 1995-07-28 | 1999-09-21 | Delco Electronics Corporation | Configurable neural network integrated circuit |
CN102665049A (en) * | 2012-03-29 | 2012-09-12 | 中国科学院半导体研究所 | Programmable visual chip-based visual image processing system |
CN103019656A (en) * | 2012-12-04 | 2013-04-03 | 中国科学院半导体研究所 | Dynamically reconfigurable multi-stage parallel single instruction multiple data array processing system |
CN104952448A (en) * | 2015-05-04 | 2015-09-30 | 张爱英 | Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks |
Family Cites Families (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE69132495T2 (en) * | 1990-03-16 | 2001-06-13 | Texas Instruments Inc., Dallas | Distributed processing memory |
TW279231B (en) * | 1995-04-18 | 1996-06-21 | Nat Science Council | This invention is related to a new neural network for prediction |
DE19625569A1 (en) * | 1996-06-26 | 1998-01-02 | Philips Patentverwaltung | Signal processor |
TW337568B (en) * | 1996-10-11 | 1998-08-01 | Apex Semiconductor Inc | Pseudo cache DRAM controller with packet command protocol |
US6216119B1 (en) * | 1997-11-19 | 2001-04-10 | Netuitive, Inc. | Multi-kernel neural network concurrent learning, monitoring, and forecasting system |
US6557096B1 (en) * | 1999-10-25 | 2003-04-29 | Intel Corporation | Processors with data typer and aligner selectively coupling data bits of data buses to adder and multiplier functional blocks to execute instructions with flexible data types |
US8660939B2 (en) * | 2000-05-17 | 2014-02-25 | Timothy D. Allen | Method for mortgage customer retention |
US6581131B2 (en) * | 2001-01-09 | 2003-06-17 | Hewlett-Packard Development Company, L.P. | Method and apparatus for efficient cache mapping of compressed VLIW instructions |
US6782375B2 (en) * | 2001-01-16 | 2004-08-24 | Providian Bancorp Services | Neural network based decision processor and method |
US7146486B1 (en) * | 2003-01-29 | 2006-12-05 | S3 Graphics Co., Ltd. | SIMD processor with scalar arithmetic logic units |
US7689641B2 (en) * | 2003-06-30 | 2010-03-30 | Intel Corporation | SIMD integer multiply high with round and shift |
US7421565B1 (en) * | 2003-08-18 | 2008-09-02 | Cray Inc. | Method and apparatus for indirectly addressed vector load-add -store across multi-processors |
CN1306395C (en) * | 2004-02-13 | 2007-03-21 | 中国科学院计算技术研究所 | Processor extended instruction of MIPS instruction set, encoding method and component thereof |
CN1658153B (en) * | 2004-02-18 | 2010-04-28 | 联发科技股份有限公司 | Compound dynamic preset number representation and algorithm, and its processor structure |
JP2006004042A (en) * | 2004-06-16 | 2006-01-05 | Renesas Technology Corp | Data processor |
CN100383781C (en) * | 2004-11-26 | 2008-04-23 | 北京天碁科技有限公司 | Cholesky decomposition algorithm device |
US7743233B2 (en) * | 2005-04-05 | 2010-06-22 | Intel Corporation | Sequencer address management |
US7512573B2 (en) * | 2006-10-16 | 2009-03-31 | Alcatel-Lucent Usa Inc. | Optical processor for an artificial neural network |
US8145887B2 (en) * | 2007-06-15 | 2012-03-27 | International Business Machines Corporation | Enhanced load lookahead prefetch in single threaded mode for a simultaneous multithreaded microprocessor |
TW200923803A (en) * | 2007-11-26 | 2009-06-01 | Univ Nat Taipei Technology | Hardware neural network learning and recall architecture |
CN101625735A (en) * | 2009-08-13 | 2010-01-13 | 西安理工大学 | FPGA implementation method based on LS-SVM classification and recurrence learning recurrence neural network |
US8380138B2 (en) * | 2009-10-21 | 2013-02-19 | Qualcomm Incorporated | Duty cycle correction circuitry |
US20120066163A1 (en) * | 2010-09-13 | 2012-03-15 | Nottingham Trent University | Time to event data analysis method and system |
EP2508980B1 (en) * | 2011-04-07 | 2018-02-28 | VIA Technologies, Inc. | Conditional ALU instruction pre-shift-generated carry flag propagation between microinstructions in read-port limited register file microprocessor |
US8880851B2 (en) * | 2011-04-07 | 2014-11-04 | Via Technologies, Inc. | Microprocessor that performs X86 ISA and arm ISA machine language program instructions by hardware translation into microinstructions executed by common execution pipeline |
CN102402415B (en) * | 2011-10-21 | 2013-07-17 | 清华大学 | Device and method for buffering data in dynamic reconfigurable array |
US9251116B2 (en) * | 2011-11-30 | 2016-02-02 | International Business Machines Corporation | Direct interthread communication dataport pack/unpack and load/save |
CN104115115B (en) * | 2011-12-19 | 2017-06-13 | 英特尔公司 | For the SIMD multiplication of integers accumulated instructions of multiple precision arithmetic |
US8669890B2 (en) * | 2012-01-20 | 2014-03-11 | Mediatek Inc. | Method and apparatus of estimating/calibrating TDC mismatch |
TWI602181B (en) * | 2012-02-29 | 2017-10-11 | 三星電子股份有限公司 | Memory system and method for operating test device to transmit fail address to memory device |
US9483263B2 (en) * | 2013-03-26 | 2016-11-01 | Via Technologies, Inc. | Uncore microcode ROM |
US9792121B2 (en) * | 2013-05-21 | 2017-10-17 | Via Technologies, Inc. | Microprocessor that fuses if-then instructions |
CN104216866B (en) * | 2013-05-31 | 2018-01-23 | 深圳市海思半导体有限公司 | A kind of data processing equipment |
EP2843550B1 (en) * | 2013-08-28 | 2018-09-12 | VIA Technologies, Inc. | Dynamic reconfiguration of mulit-core processor |
US9286268B2 (en) * | 2013-12-12 | 2016-03-15 | Brno University of Technology | Method and an apparatus for fast convolution of signals with a one-sided exponential function |
-
2016
- 2016-09-29 CN CN201610866127.0A patent/CN106599990B/en active Active
- 2016-09-29 CN CN201610866130.2A patent/CN106650923B/en active Active
- 2016-09-29 CN CN201610866030.XA patent/CN106599989B/en active Active
- 2016-09-29 CN CN201610866026.3A patent/CN106598545B/en active Active
- 2016-09-29 CN CN201610866452.7A patent/CN106599992B/en active Active
- 2016-09-29 CN CN201610866129.XA patent/CN106599991B/en active Active
- 2016-10-04 TW TW105132063A patent/TWI591539B/en active
- 2016-10-04 TW TW105132059A patent/TWI650707B/en active
- 2016-10-04 TW TW105132062A patent/TWI601062B/en active
- 2016-10-04 TW TW105132058A patent/TWI608429B/en active
- 2016-10-04 TW TW105132061A patent/TWI626587B/en active
- 2016-10-04 TW TW105132065A patent/TWI579694B/en active
- 2016-10-04 TW TW105132064A patent/TWI616825B/en active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5138695A (en) * | 1989-10-10 | 1992-08-11 | Hnc, Inc. | Systolic array image processing system |
US5563982A (en) * | 1991-01-31 | 1996-10-08 | Ail Systems, Inc. | Apparatus and method for detection of molecular vapors in an atmospheric region |
US5956703A (en) * | 1995-07-28 | 1999-09-21 | Delco Electronics Corporation | Configurable neural network integrated circuit |
CN102665049A (en) * | 2012-03-29 | 2012-09-12 | 中国科学院半导体研究所 | Programmable visual chip-based visual image processing system |
CN103019656A (en) * | 2012-12-04 | 2013-04-03 | 中国科学院半导体研究所 | Dynamically reconfigurable multi-stage parallel single instruction multiple data array processing system |
CN104952448A (en) * | 2015-05-04 | 2015-09-30 | 张爱英 | Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks |
Non-Patent Citations (2)
Title |
---|
EXTENSIBLE LINEAR FLOATING POINT SIMD NEUROCOMPUTER ARRAY PROCESSOR;Robert W. Means and Layne Lisenbee;《IEEE》;20020806;全文 |
Neural Acceleration for General-Purpose Approximate Programs;H.Esmaeilzadeh et al.;《2012 IEEE/ACM 45th Annual International Symposium on Microarchitecture》;20130404;全文 |
Also Published As
Publication number | Publication date |
---|---|
CN106598545A (en) | 2017-04-26 |
TWI579694B (en) | 2017-04-21 |
TWI608429B (en) | 2017-12-11 |
CN106599992A (en) | 2017-04-26 |
CN106599990A (en) | 2017-04-26 |
TW201714120A (en) | 2017-04-16 |
TWI626587B (en) | 2018-06-11 |
CN106599991B (en) | 2019-04-09 |
TWI650707B (en) | 2019-02-11 |
CN106598545B (en) | 2020-04-14 |
TWI616825B (en) | 2018-03-01 |
TW201714078A (en) | 2017-04-16 |
CN106650923B (en) | 2019-04-09 |
TWI601062B (en) | 2017-10-01 |
TW201714091A (en) | 2017-04-16 |
TW201714080A (en) | 2017-04-16 |
CN106599991A (en) | 2017-04-26 |
CN106599989A (en) | 2017-04-26 |
CN106599989B (en) | 2019-04-09 |
TW201714081A (en) | 2017-04-16 |
TW201714119A (en) | 2017-04-16 |
CN106650923A (en) | 2017-05-10 |
CN106599990B (en) | 2019-04-09 |
TWI591539B (en) | 2017-07-11 |
TW201714079A (en) | 2017-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106599992B (en) | The neural network unit operated using processing unit group as time recurrent neural network shot and long term memory cell | |
CN106528047B (en) | A kind of processor, neural network unit and its operation method | |
CN107844830A (en) | Neutral net unit with size of data and weight size mixing computing capability | |
CN108268932A (en) | Neural network unit | |
CN108268945A (en) | The neural network unit of circulator with array-width sectional | |
CN108268944A (en) | Neural network unit with the memory that can be remolded | |
CN108268946A (en) | The neural network unit of circulator with array-width sectional | |
CN108133263A (en) | Neural network unit | |
CN108133262A (en) | With for perform it is efficient 3 dimension convolution memory layouts neural network unit | |
CN108133264A (en) | Perform the neural network unit of efficient 3 dimension convolution | |
CN108564169A (en) | Hardware processing element, neural network unit and computer usable medium | |
CN108804139A (en) | Programmable device and its operating method and computer usable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP01 | Change in the name or title of a patent holder | ||
CP01 | Change in the name or title of a patent holder |
Address after: Room 301, 2537 Jinke Road, Zhangjiang High Tech Park, Pudong New Area, Shanghai 201203 Patentee after: Shanghai Zhaoxin Semiconductor Co.,Ltd. Address before: Room 301, 2537 Jinke Road, Zhangjiang High Tech Park, Pudong New Area, Shanghai 201203 Patentee before: VIA ALLIANCE SEMICONDUCTOR Co.,Ltd. |