CN107861757B - Arithmetic device and related product - Google Patents
Arithmetic device and related product Download PDFInfo
- Publication number
- CN107861757B CN107861757B CN201711244055.7A CN201711244055A CN107861757B CN 107861757 B CN107861757 B CN 107861757B CN 201711244055 A CN201711244055 A CN 201711244055A CN 107861757 B CN107861757 B CN 107861757B
- Authority
- CN
- China
- Prior art keywords
- instruction
- vector
- calculation
- operation instruction
- arithmetic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 239000013598 vector Substances 0.000 claims abstract description 224
- 238000004364 calculation method Methods 0.000 claims abstract description 119
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 description 59
- 239000011159 matrix material Substances 0.000 description 32
- 230000006870 function Effects 0.000 description 22
- 239000000758 substrate Substances 0.000 description 22
- 238000000034 method Methods 0.000 description 21
- 238000004806 packaging method and process Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 11
- 229910000679 solder Inorganic materials 0.000 description 8
- 239000000463 material Substances 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 230000017525 heat dissipation Effects 0.000 description 4
- 239000002184 metal Substances 0.000 description 4
- 229910052751 metal Inorganic materials 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 239000000945 filler Substances 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000009467 reduction Effects 0.000 description 2
- 230000003014 reinforcing effect Effects 0.000 description 2
- VYPSYNLAJGMNEJ-UHFFFAOYSA-N silicon dioxide Inorganic materials O=[Si]=O VYPSYNLAJGMNEJ-UHFFFAOYSA-N 0.000 description 2
- 238000005476 soldering Methods 0.000 description 2
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- 229910052581 Si3N4 Inorganic materials 0.000 description 1
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 1
- 239000004020 conductor Substances 0.000 description 1
- 230000021615 conjugation Effects 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 229910052732 germanium Inorganic materials 0.000 description 1
- GNPVGFCGXDBREM-UHFFFAOYSA-N germanium atom Chemical compound [Ge] GNPVGFCGXDBREM-UHFFFAOYSA-N 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 239000004519 grease Substances 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 229910052754 neon Inorganic materials 0.000 description 1
- GKAOGPIIYCISHV-UHFFFAOYSA-N neon atom Chemical compound [Ne] GKAOGPIIYCISHV-UHFFFAOYSA-N 0.000 description 1
- 238000012536 packaging technology Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- HQVNEWCFYHHQES-UHFFFAOYSA-N silicon nitride Chemical group N12[Si]34N5[Si]62N3[Si]51N64 HQVNEWCFYHHQES-UHFFFAOYSA-N 0.000 description 1
- 229910052814 silicon oxide Inorganic materials 0.000 description 1
- 239000002210 silicon-based material Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30181—Instruction operation extension or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
- Complex Calculations (AREA)
Abstract
The present invention provides an arithmetic device for executing an operation according to an extended instruction, the arithmetic device including: a memory, an arithmetic unit and a control unit; the extended instruction includes: an opcode and an operation field, a memory to store a vector; the control unit is used for acquiring an extended instruction, analyzing the extended instruction to obtain a vector operation instruction and a second operation instruction, determining the calculation sequence of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and reading an input vector corresponding to the input vector address from a memory; and the operation unit is used for executing the vector operation instruction and the second operation instruction to the input vector according to the calculation sequence to obtain the result of the extended instruction. The technical scheme provided by the invention has the advantages of low power consumption and low calculation overhead.
Description
Technical Field
The present invention relates to the field of communications technologies, and in particular, to an arithmetic device and a related product.
Background
In modern general-purpose and special-purpose processors, computational instructions (e.g., vector instructions) are increasingly being introduced to perform operations. The vector instruction is an instruction for causing the processor to perform a vector or matrix operation, and examples thereof include addition and subtraction of vectors, inner product of vectors, matrix multiplication, and matrix convolution. At least one input of the vector instruction is a vector or a matrix or the result of the operation is a vector or a matrix. The vector instruction can perform parallel calculation by calling a vector processing component in the processor, so that the operation speed is improved. In the existing vector instructions, the vectors or matrixes in the operands or results are generally fixed-scale, for example, the vector instruction in the vector extension structure Neon in the ARM processor can process 32-bit floating-point vectors with length of 4 or 16-bit fixed-point vectors with length of 8 at a time.
Therefore, the conventional vector operation instruction cannot realize the operation of variable-scale vectors or matrices, and the conventional vector operation instruction can only realize one operation, for example, one vector instruction can only realize one operation of multiplication and addition, and one vector instruction cannot realize more than two operations, so that the conventional vector operation has high operation overhead and high energy consumption.
Disclosure of Invention
The embodiment of the invention provides an arithmetic device and a related product, which can realize the purpose of realizing various operations by a single operation instruction, reduce the operation overhead and reduce the power consumption of a module.
In a first aspect, an embodiment of the present invention provides a method for implementing an extended instruction, where the method includes the following steps:
an arithmetic device for performing an operation in accordance with an extended instruction, the arithmetic device comprising: a memory, an arithmetic unit and a control unit;
the extended instruction includes: an opcode and an operation domain, the opcode comprising: identifying an identification of a vector computation instruction; the operation domain includes: an input vector address of the vector calculation instruction, an output vector address of the vector calculation instruction, an identifier of the second calculation instruction, input data of the second calculation instruction, a data type and a data length N;
a memory for storing vectors;
the control unit is used for acquiring an extended instruction, analyzing the extended instruction to obtain a vector operation instruction and a second operation instruction, determining the calculation sequence of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and reading an input vector corresponding to the input vector address from a memory;
and the operation unit is used for executing the vector operation instruction and the second operation instruction to the input vector according to the calculation sequence to obtain the result of the extended instruction.
Optionally, the operation device further includes:
and the register unit is used for storing the extended instruction to be executed.
Optionally, the control unit includes:
the instruction fetching module is used for acquiring an extended instruction from the register unit;
the decoding module is used for decoding the obtained extended instruction to obtain a vector operation instruction, a second operation instruction and a calculation sequence;
and the instruction queue is used for storing the decoded vector operation instruction and the second operation instruction according to the calculation sequence.
Optionally, the operation device further includes:
the dependency relationship processing unit is used for judging whether the expansion instruction and a previous expansion instruction access the same vector or not before the control unit acquires the expansion instruction, and if so, after the previous expansion instruction is completely executed, providing the vector operation instruction and a second operation instruction of the current expansion instruction to the operation unit; otherwise, the vector operation instruction and the second operation instruction of the vector operation instruction are provided to the operation unit.
Optionally, if the current extended instruction and the previous extended instruction access the same vector, the dependency processing unit stores the current extended instruction in a storage queue, and after the previous extended instruction is executed, provides the current extended instruction in the storage queue to the control unit.
Optionally, the memory is a scratch pad memory.
Optionally, the operation unit includes a vector addition circuit, a vector multiplication circuit, a size comparison circuit, a nonlinear operation circuit, a vector scalar multiplication circuit, and an activation circuit.
Optionally, the operation unit is of a multi-pipeline structure, wherein the vector multiplication circuit and the vector scalar multiplication circuit are in a first pipeline stage, the magnitude comparison component and the vector addition circuit are in a second pipeline stage, the nonlinear operation component and the activation circuit are in a third pipeline stage, output data of the first pipeline stage is input data of the second pipeline stage, and output data of the second pipeline stage is input data of the third pipeline stage.
Optionally, the control unit is specifically configured to identify whether output data of the vector operation instruction is the same as input data of the second calculation instruction, and if so, determine that the calculation order is positive order calculation; identifying whether the input data of the vector operation instruction is the same as the output data of the second calculation instruction, and if so, determining that the calculation sequence is reverse calculation; and identifying whether the input data of the vector operation instruction is associated with the output data of the second calculation instruction, if not, determining that the calculation order is out-of-order calculation.
In a second aspect, a chip is provided, where the chip integrates the arithmetic device provided in the first aspect.
In a third aspect, an electronic device is provided, which includes the chip provided in the second aspect.
It can be seen that the extended instruction provided by the embodiment of the present invention strengthens the function of the instruction, and replaces a plurality of original instructions with one instruction. Therefore, the number of instructions required by complex vector and matrix operation is reduced, and the use of vector instructions is simplified; compared with a plurality of instructions, the method does not need to store intermediate results, saves storage space, and avoids additional reading and writing expenses.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description are some embodiments of the present invention.
Fig. 1A is a schematic structural diagram of an arithmetic device provided in the present invention.
FIG. 1 is a flow chart of a method of implementing an expand instruction of the present invention.
FIG. 2 is a schematic diagram of a structure of an arithmetic unit according to the present invention.
Fig. 3A is a schematic structural diagram of a control unit provided in the present invention.
Fig. 3 is a schematic structural diagram of a neural network processor board card according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a neural network chip package structure according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a neural network chip according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a neural network chip package structure according to an embodiment of the present disclosure;
fig. 6A is a schematic diagram of another neural network chip package structure according to the flow of the present application.
The dashed components in the drawings represent optional.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of the invention and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. The "/" herein may mean "or".
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The following describes a method of vector dot product by taking CPU as an example, and for the vector dot product, the dot product of vector and vector is calculated, and the function is described: given a vector x, y and a scalar r of length n, the following vector-vector operation is performed, the calculation formula of which is as follows:
for vector DOT product, the instruction of the vector DOT product may be "DOT TYPE, N, X, Y, R"; where DOT represents a vector DOT product instruction, type represents the type of data that can be manipulated (e.g., real or complex), N represents the length of the vector, X represents the first address of vector X, Y represents the first address of vector Y, and R is a scalar. As described above, the vector dot product instruction can only implement one type of operation, i.e., the operation of implementing the vector dot product, and cannot implement multiple operations, e.g., the two operations of reading the vector dot product and the discrete data.
As shown in fig. 1A, the arithmetic device includes: a memory 111, a register 112 (optional), an arithmetic unit 114, a control unit 115, and a dependency processing unit 116 (optional);
as shown in FIG. 2, the operation unit 114 includes: conversion circuitry (optional), vector addition circuitry, vector multiplication circuitry, size comparison circuitry, non-linear operation circuitry, vector scalar multiplication circuitry and activation circuitry.
The arithmetic unit has a multi-flow water level structure, specifically as shown in fig. 2, the first flow water level includes but is not limited to: vector multiplication circuits and vector scalar multiplication circuits, and the like.
The second pipeline stage includes, but is not limited to: magnitude comparison calculators (e.g., comparators), vector addition circuits, and the like.
Third effluent stages include, but are not limited to: a non-linear operation part (specifically, an activation circuit or a transcendental function calculation circuit, etc.), and the like.
If the arithmetic unit comprises a switching circuit, the switching circuit can be in the first or third flow level.
The output data of the first pipeline stage is input data of the second pipeline stage, and the output data of the second pipeline stage is input data of the third pipeline stage. The input to the first cascade stage may be input data (e.g., an input vector) and the output of the third cascade stage may be a result of the computation.
The present invention also provides an extended instruction, including an operation code and an operation field, where the operation code includes: identifying an identity (e.g., a ROT) of the first operation instruction; the operation domain includes: the data processing method comprises the steps of inputting a data address of a first calculation instruction, outputting a data address of the first calculation instruction, identifying a second calculation instruction, inputting data of the second calculation instruction, a data type and a data length N.
Optionally, the extended instruction may further include: a third computation instruction and input data of the third computation instruction.
It should be noted that the calculation instruction may be a vector operation instruction or a matrix instruction, and the embodiments of the present invention do not limit the specific expression of the calculation instruction.
The arithmetic device may be configured to execute an extended instruction, and specifically includes:
a memory for storing vectors;
the control unit is used for acquiring an extended instruction, analyzing the extended instruction to obtain a vector operation instruction and a second operation instruction, determining the calculation sequence of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and reading an input vector corresponding to the input vector address from a memory;
and the operation unit is used for executing the vector operation instruction and the second operation instruction to the input vector according to the calculation sequence to obtain the result of the extended instruction.
And the register unit is used for storing the extended instruction to be executed.
Optionally, as shown in fig. 3A, the control unit 115 may include:
the instruction fetching module is used for acquiring an extended instruction from the register unit;
the decoding module is used for decoding the obtained extended instruction to obtain a vector operation instruction, a second operation instruction and a calculation sequence;
and the instruction queue is used for storing the decoded vector operation instruction and the second operation instruction according to the calculation sequence.
The dependency relationship processing unit 116 is configured to, before the control unit obtains the extended instruction, determine whether the extended instruction and a previous extended instruction access the same vector, if yes, wait for the previous extended instruction to finish executing, and provide a vector operation instruction and a second operation instruction of the current extended instruction to the operation unit; otherwise, the vector operation instruction and the second operation instruction of the vector operation instruction are provided to the operation unit.
The dependency processing unit 116 is further configured to store the current extended instruction in a storage queue when the current extended instruction and a previous extended instruction access the same vector, and provide the current extended instruction in the storage queue to the control unit after the previous extended instruction is executed.
Optionally, the memory is a scratch pad memory.
Referring to fig. 1, fig. 1 provides an implementation method of an extended instruction, where the extended instruction in the method may include: an opcode and an operand field, the first arithmetic instruction may be a vector arithmetic instruction, such as AXPY. The operation code includes: an identification (e.g., AXPY) identifying a first arithmetic instruction; the operation domain includes: an input data address of the first calculation instruction, an output data address of the first calculation instruction, an identifier of the second calculation instruction, input data of the second calculation instruction, a data type and a data length N (which is a user-set value, and the invention is not limited to the specific form of N); the method is executed by a computing device or a computing chip, and the computing device is shown in fig. 1A. The method is shown in figure 1 and comprises the following steps:
s101, an arithmetic device acquires an extended instruction, and analyzes the extended instruction to obtain a first calculation instruction and a second calculation instruction;
step S102, the computing device determines a calculation order according to the first calculation instruction and the second calculation instruction, and executes the first calculation instruction and the second calculation instruction according to the calculation order to obtain a result of the extended instruction.
The technical scheme provided by the invention provides an implementation method of the extended instruction, so that an arithmetic device can execute the calculation of two calculation instructions on the extended instruction, and a single extended instruction can realize two types of calculation, thereby reducing the calculation overhead and reducing the power consumption.
Optionally, the calculation sequence may specifically include: the method comprises the following steps of performing out-of-order calculation, forward order calculation or backward order calculation, wherein the out-of-order calculation means that the first calculation instruction and the second calculation instruction do not have the corresponding order requirement, the forward order calculation means executes the first calculation instruction first and then executes the second calculation instruction, and the backward order calculation means executes the second calculation instruction first and then executes the first calculation instruction.
The specific implementation manner of the above-mentioned arithmetic device determining the calculation order according to the first calculation instruction and the second calculation instruction may be that the arithmetic device identifies whether the output data of the first calculation instruction is the same as the input data of the second calculation instruction, if so, the calculation order is determined to be the forward calculation, otherwise, the arithmetic device identifies whether the input data of the first calculation instruction is the same as the output data of the second calculation instruction, if so, the calculation order is determined to be the reverse calculation, the arithmetic device identifies whether the input data of the first calculation instruction is associated with the output data of the second calculation instruction, if not, the calculation order is determined to be the out-of-order calculation.
Specifically, for example, F ═ a × B + C, the first calculation instruction is a matrix multiplication instruction, the second calculation instruction is a matrix addition instruction, and the matrix addition instruction of the second calculation instruction requires the result applied to the first calculation instruction, that is, the output data, and therefore the calculation order is determined to be a positive order calculation. In another example, if the first operation instruction is a matrix multiplication instruction and the second operation instruction is a transformation, such as transposition or conjugation, the first operation instruction uses the output of the second operation instruction, and thus the operation sequence is a reverse order. And if no corresponding correlation exists, namely the output data of the first calculation instruction is different from the input data of the second calculation instruction, and the input data of the first calculation instruction is also different from the input data of the second calculation instruction, determining that the correlation does not exist.
The vector instruction provided by the invention is expanded, the function of the instruction is strengthened, and one instruction replaces a plurality of original instructions. Therefore, the number of instructions required by complex vector and matrix operation is reduced, and the use of vector instructions is simplified; compared with a plurality of instructions, the method does not need to store intermediate results, saves storage space, and avoids additional reading and writing expenses.
If the first calculation instruction is a vector instruction, for an input vector or matrix in the vector instruction, the instruction adds a function of scaling the input vector or matrix, i.e. adds an operand representing a scaling coefficient in the operation domain, and when the vector is read in, the vector is firstly scaled according to the scaling coefficient (i.e. the second calculation instruction is a scaling instruction). If there are operations in a vector instruction that multiply multiple input vectors or matrices, the scaling coefficients corresponding to these input vectors or matrices may be combined into one.
If the first calculation instruction is a vector instruction, the instruction adds a function of transposing an input matrix in the vector instruction (i.e., the second calculation instruction is a transpose instruction). An operand indicating whether to transpose the matrix is added to the instruction, and the operand indicates whether to transpose the matrix before operation.
If the first compute instruction is a vector instruction, the instruction adds a function to the original output vector or matrix for the output vector or matrix in the vector instruction (i.e., the second compute instruction is an add instruction). The coefficient for scaling the original output vector or matrix is added to the instruction (i.e. a third calculation instruction is added, and the third calculation instruction may be a scaling instruction), and the instruction indicates that after the vector or matrix operation is performed, the result is added to the scaled original output to be used as a new output.
If the first computing instruction is a vector instruction, the instruction adds a function to read in fixed steps to the input vector in the vector instruction. An operand representing the read step size of the input vector is added to the instruction (i.e. the second calculation instruction is a read vector in a fixed step size), representing the difference between the addresses of two adjacent elements in the vector.
If the first computing instruction is a vector instruction, the instruction adds a function of writing the result in fixed steps to the result vector in the vector instruction (i.e., the second computing instruction writes the vector in fixed steps). An operand representing the read step size of the result vector is added to the instruction, representing the difference between the addresses of two adjacent elements in the vector. If a vector is both an input and a result, the same step size is used for the vector as an input and as a result.
If the first computing instruction is a vector instruction, the instruction adds the function of reading row or column vectors in fixed steps to the input matrix in the vector instruction (i.e., the second computing instruction is reading multiple vectors in fixed steps). An operand representing a matrix read step size is added to the instruction, representing the difference in the first address between matrix row or column vectors.
If the first computing instruction is a vector instruction, the instruction adds the function of reading row or column vectors at a fixed step size to the result matrix in the vector instruction (i.e., the second computing instruction is writing multiple vectors at a fixed step size). An operand representing a matrix read step size is added to the instruction, representing the difference in the first address between matrix row or column vectors. If a matrix is both an input and a result matrix, the same step size is used as input and as result.
The actual structure of the above-described extension instruction is described below with respect to some actual extension instructions.
Vector multiply add
Calculating the product of a vector and a scalar and adding the result to another vector
Description of the function:
given a vector x, y and a scalar a, the following vector-vector operation is performed
Y:=a*x+y
The instruction format is shown in Table 1-1:
tables 1 to 1:
the length of the vector in the instruction format shown in table 1-1 is variable, which can reduce the number of instructions and simplify the use of instructions.
The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
Vector dot product
Computing a dot product of a vector and a vector
Description of the function: given a vector x, y of fixed length n and a scalar r, the following vector-vector operation is performed
The instruction format is shown in tables 1-2:
tables 1 to 2:
the variable length of the vectors in the instruction format shown in tables 1-2 can reduce the number of instructions and simplify the use of instructions.
The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
Norm of vector
Computing Euclidean norms of vectors
Description of the function:
the instruction performs the following vector reduction operation:
the instruction format is shown in tables 1-3:
tables 1 to 3
The variable length of the vectors in the instruction formats shown in tables 1-3 can reduce the number of instructions and simplify the use of instructions. The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
Vector addition
Computing the sum of the addition of all elements of a vector
Description of the function:
the instruction performs the following vector reduction operation:
the instruction format is shown in tables 1-4:
tables 1 to 4:
the variable length of the vectors in the instruction formats shown in tables 1-4 can reduce the number of instructions and simplify the use of instructions.
The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
Maximum value of vector
Calculating the position of the largest element among all elements of the vector
Description of the function:
for a vector x of length n, the instruction writes the position of the largest element in the vector into scalar i
The instruction format is shown in tables 1-5:
tables 1 to 5:
the variable length of the vectors in the instruction formats shown in tables 1-5 can reduce the number of instructions and simplify the use of instructions. The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
Vector minimum
Calculating the position of the smallest element among all elements of the vector
Description of the function:
for a vector x of length n, the instruction writes the position of the smallest element in the vector into scalar i
The instruction format is shown in tables 1-6:
tables 1 to 6
The variable length of the vectors in the instruction formats shown in tables 1-6 can reduce the number of instructions and simplify the use of instructions. The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
Vector outer product
Computing the tensor product of two vectors (outer product)
Description of the function:
the instruction performs the following matrix vector operations
A:=α*x*yT+A
The instruction format is shown in tables 1-7:
tables 1 to 7
Scaling the result matrix by the scalar alpha in the instruction format shown in tables 1-7 increases the flexibility of the instruction and avoids the additional overhead of scaling with scaling instructions. The size of the vector and the matrix is variable, so that the number of instructions can be reduced, and the use of the instructions is simplified. Matrices of different storage formats (row major sequence and column major sequence) can be processed, avoiding the overhead of transforming the matrices. The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
For the arithmetic device shown in fig. 1, which implements a specific structure for calculating the extended instruction when the extended instruction is operated, that is, implements a combination of executing a plurality of calculation instructions by executing one extended instruction, it should be noted that the extended instruction is not split into a plurality of calculation instructions when the arithmetic device executes the extended instruction.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a neural network processor board card according to an embodiment of the present disclosure. As shown in fig. 3, the neural network processor board 10 includes a neural network chip package structure 11, a first electrical and non-electrical connection device 12, and a first substrate (substrate) 13.
The specific structure of the neural network chip package structure 11 is not limited in this application, and optionally, as shown in fig. 4, the neural network chip package structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, and a second substrate 113.
The specific form of the neural network chip 111 related to the present application is not limited, and the neural network chip 111 includes, but is not limited to, a neural network wafer integrating a neural network processor, and the wafer may be made of silicon material, germanium material, quantum material, molecular material, or the like. The neural network chip can be packaged according to practical conditions (such as a severer environment) and different application requirements, so that most of the neural network chip is wrapped, and the pins on the neural network chip are connected to the outer side of the packaging structure through conductors such as gold wires and the like for circuit connection with a further outer layer.
The specific structure of the neural network chip 111 is not limited in the present application, and optionally, please refer to fig. 1A, where fig. 1A is a schematic structural diagram of a computing device in the neural network chip according to an embodiment of the present application. As shown in fig. 1A, the computing apparatus includes: memory 111, register 112 (optional), arithmetic unit 114, control unit 115, and dependency processing unit 116 (optional). The specific functions or structures of the above units can be seen from the embodiment shown in fig. 1A.
The type of the first substrate 13 and the second substrate 113 is not limited in this application, and may be a Printed Circuit Board (PCB) or a Printed Wiring Board (PWB), and may be other circuit boards. The material of the PCB is not limited.
The second substrate 113 according to the present invention is used for carrying the neural network chip 111, and the neural network chip package structure 11 obtained by connecting the neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112 is used for protecting the neural network chip 111 and facilitating further packaging of the neural network chip package structure 11 and the first substrate 13.
The specific packaging method and the corresponding structure of the second electrical and non-electrical connecting device 112 are not limited, and an appropriate packaging method can be selected according to actual conditions and different application requirements, and can be simply improved, for example: flip Chip Ball Grid Array (FCBGAP) packages, Low-profile Quad Flat packages (LQFP), Quad Flat packages with Heat sinks (HQFP), Quad Flat packages (Quad Flat Non-lead Package, QFN), or small pitch Quad Flat packages (FBGA).
The Flip Chip (Flip Chip) is suitable for the conditions of high requirements on the area after packaging or sensitivity to the inductance of a lead and the transmission time of a signal. In addition, a Wire Bonding (Wire Bonding) packaging mode can be used, so that the cost is reduced, and the flexibility of a packaging structure is improved.
Ball Grid Array (Ball Grid Array) can provide more pins, and the average wire length of the pins is short, and has the function of transmitting signals at high speed, wherein, the package can be replaced by Pin Grid Array Package (PGA), Zero Insertion Force (ZIF), Single Edge Contact Connection (SECC), Land Grid Array (LGA) and the like.
Optionally, the neural network Chip 111 and the second substrate 113 are packaged in a Flip Chip Ball Grid Array (Flip Chip Ball Grid Array) packaging manner, and a schematic diagram of a specific neural network Chip packaging structure may refer to fig. 6. As shown in fig. 6, the neural network chip package structure includes: the neural network chip 21, the bonding pad 22, the solder ball 23, the second substrate 24, the connection point 25 on the second substrate 24, and the pin 26.
The bonding pads 22 are connected to the neural network chip 21, and the solder balls 23 are formed between the bonding pads 22 and the connection points 25 on the second substrate 24 by soldering, so that the neural network chip 21 and the second substrate 24 are connected, that is, the package of the neural network chip 21 is realized.
The pins 26 are used for connecting with an external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10), so as to realize transmission of external data and internal data, and facilitate processing of data by the neural network chip 21 or a neural network processor corresponding to the neural network chip 21. The type and number of the pins are not limited in the present application, and different pin forms can be selected according to different packaging technologies and arranged according to certain rules.
Optionally, the neural network chip packaging structure further includes an insulating filler, which is disposed in a gap between the pad 22, the solder ball 23 and the connection point 25, and is used for preventing interference between the solder ball and the solder ball.
Wherein, the material of the insulating filler can be silicon nitride, silicon oxide or silicon oxynitride; the interference includes electromagnetic interference, inductive interference, and the like.
Optionally, the neural network chip package structure further includes a heat dissipation device for dissipating heat generated when the neural network chip 21 operates. The heat dissipation device may be a metal plate with good thermal conductivity, a heat sink, or a heat sink, such as a fan.
For example, as shown in fig. 6A, the neural network chip package structure 11 includes: the neural network chip 21, the bonding pad 22, the solder ball 23, the second substrate 24, the connection point 25 on the second substrate 24, the pin 26, the insulating filler 27, the thermal grease 28 and the metal housing heat sink 29. The heat dissipation paste 28 and the metal case heat dissipation sheet 29 are used to dissipate heat generated during operation of the neural network chip 21.
Optionally, the neural network chip package structure 11 further includes a reinforcing structure connected to the bonding pad 22 and embedded in the solder ball 23 to enhance the connection strength between the solder ball 23 and the bonding pad 22.
The reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.
The specific form of the first electrical and non-electrical device 12 is not limited in the present application, and reference may be made to the description of the second electrical and non-electrical device 112, that is, the neural network chip package structure 11 is packaged by soldering, or a connection wire connection or a plug connection may be adopted to connect the second substrate 113 and the first substrate 13, so as to facilitate subsequent replacement of the first substrate 13 or the neural network chip package structure 11.
Optionally, the first substrate 13 includes an interface of a memory unit for expanding a storage capacity, for example: synchronous Dynamic Random Access Memory (SDRAM), Double Rate SDRAM (DDR), etc., which improve the processing capability of the neural network processor by expanding the Memory.
The first substrate 13 may further include a Peripheral component interconnect Express (PCI-E or PCIe) interface, a Small Form-factor pluggable (SFP) interface, an ethernet interface, a Controller Area Network (CAN) interface, and the like on the first substrate, for data transmission between the package structure and the external circuit, which may improve the operation speed and the convenience of operation.
The neural network processor is packaged into a neural network chip 111, the neural network chip 111 is packaged into a neural network chip packaging structure 11, the neural network chip packaging structure 11 is packaged into a neural network processor board card 10, and data interaction is performed with an external circuit (for example, a computer motherboard) through an interface (a slot or a plug core) on the board card, that is, the function of the neural network processor is directly realized by using the neural network processor board card 10, and the neural network chip 111 is protected. And other modules can be added to the neural network processor board card 10, so that the application range and the operation efficiency of the neural network processor are improved.
In one embodiment, the present disclosure discloses an electronic device comprising the above neural network processor board card 10 or the neural network chip package 11.
Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, cell phones, tachographs, navigators, sensors, cameras, servers, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, home appliances, and/or medical devices.
The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules illustrated are not necessarily required to practice the invention.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (11)
1. An arithmetic device for performing an operation in accordance with an extended instruction, the arithmetic device comprising: a memory, an arithmetic unit and a control unit;
the extended instruction includes: an opcode and an operation domain, the opcode comprising: identifying an identification of a vector computation instruction; the operation domain includes: an input vector address of the vector calculation instruction, an output vector address of the vector calculation instruction, an identifier of the second calculation instruction, input data of the second calculation instruction, a data type and a data length N;
a memory for storing vectors;
the control unit is used for acquiring an extended instruction, analyzing the extended instruction to obtain a vector operation instruction and a second operation instruction, determining the calculation sequence of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and reading an input vector corresponding to the input vector address from a memory;
the control unit is specifically configured to identify whether output data of the vector operation instruction is the same as input data of the second calculation instruction, and if so, determine that the calculation order is a positive order calculation; identifying whether the input data of the vector operation instruction is the same as the output data of the second calculation instruction, and if so, determining that the calculation sequence is reverse calculation; identifying whether input data of the vector operation instruction is associated with output data of the second calculation instruction or not, and if not, determining that the calculation sequence is unordered calculation;
the arithmetic unit is used for executing the vector arithmetic instruction and a second arithmetic instruction to the input vector according to the calculation sequence to obtain the result of the extended instruction, wherein the calculation sequence is positive order calculation, the vector arithmetic instruction is executed firstly, and then the second arithmetic instruction is executed; the calculation sequence is reverse order calculation, the second operation instruction is executed firstly, and then the vector operation instruction is executed; the calculation order is out-of-order calculation, and the vector operation instruction and the second operation instruction do not have the corresponding order requirement.
2. The arithmetic device of claim 1, further comprising:
and the register unit is used for storing the extended instruction to be executed.
3. The arithmetic device according to claim 2, wherein the control unit includes:
the instruction fetching module is used for acquiring an extended instruction from the register unit;
the decoding module is used for decoding the obtained extended instruction to obtain a vector operation instruction, a second operation instruction and a calculation sequence;
and the instruction queue is used for storing the decoded vector operation instruction and the second operation instruction according to the calculation sequence.
4. The arithmetic device of claim 3, further comprising:
the dependency relationship processing unit is used for judging whether the expansion instruction and a previous expansion instruction access the same vector or not before the control unit acquires the expansion instruction, and if so, after the previous expansion instruction is completely executed, providing the vector operation instruction and a second operation instruction of the current expansion instruction to the operation unit; otherwise, the vector operation instruction and the second operation instruction of the vector operation instruction are provided to the operation unit.
5. The computing device of claim 4, wherein the dependency processing unit is configured to store a current extended instruction in a store queue when the current extended instruction and a previous extended instruction access the same vector, and to provide the current extended instruction in the store queue to the control unit after the previous extended instruction is executed.
6. The computing device of any of claims 1-5, wherein the memory is a scratch pad memory.
7. The arithmetic device of claim 1, wherein the arithmetic unit comprises a vector addition circuit, a vector multiplication circuit, a size comparison circuit, a non-linear arithmetic circuit, and a vector scalar multiplication circuit.
8. The arithmetic device according to claim 7, wherein the arithmetic unit has a multi-pipeline structure, wherein the vector multiplication circuit and the vector scalar multiplication circuit are in a first pipeline stage, the magnitude comparison unit and the vector addition circuit are in a second pipeline stage, and the non-linear operation unit is in a third pipeline stage, wherein output data of the first pipeline stage is input data of the second pipeline stage, and output data of the second pipeline stage is input data of the third pipeline stage.
9. The arithmetic device according to claim 8, wherein the arithmetic unit further comprises: the conversion circuit is positioned at the first flowing water level and the third flowing water level, or the conversion circuit is positioned at the first flowing water level, or the conversion circuit is positioned at the third flowing water level.
10. A chip incorporating an arithmetic device as claimed in any one of claims 1 to 9.
11. An electronic device, characterized in that the electronic device comprises a chip according to claim 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711244055.7A CN107861757B (en) | 2017-11-30 | 2017-11-30 | Arithmetic device and related product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711244055.7A CN107861757B (en) | 2017-11-30 | 2017-11-30 | Arithmetic device and related product |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107861757A CN107861757A (en) | 2018-03-30 |
CN107861757B true CN107861757B (en) | 2020-08-25 |
Family
ID=61704370
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711244055.7A Active CN107861757B (en) | 2017-11-30 | 2017-11-30 | Arithmetic device and related product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107861757B (en) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108388446A (en) * | 2018-02-05 | 2018-08-10 | 上海寒武纪信息科技有限公司 | Computing module and method |
CN110413561B (en) * | 2018-04-28 | 2021-03-30 | 中科寒武纪科技股份有限公司 | Data acceleration processing system |
EP3796189A4 (en) | 2018-05-18 | 2022-03-02 | Cambricon Technologies Corporation Limited | Video retrieval method, and method and apparatus for generating video retrieval mapping relationship |
CN110147872B (en) * | 2018-05-18 | 2020-07-17 | 中科寒武纪科技股份有限公司 | Code storage device and method, processor and training method |
CN109032670B (en) * | 2018-08-08 | 2021-10-19 | 上海寒武纪信息科技有限公司 | Neural network processing device and method for executing vector copy instruction |
CN110929855B (en) * | 2018-09-20 | 2023-12-12 | 合肥君正科技有限公司 | Data interaction method and device |
CN111061507A (en) * | 2018-10-16 | 2020-04-24 | 上海寒武纪信息科技有限公司 | Operation method, operation device, computer equipment and storage medium |
CN110096310B (en) * | 2018-11-14 | 2021-09-03 | 上海寒武纪信息科技有限公司 | Operation method, operation device, computer equipment and storage medium |
CN111353124A (en) * | 2018-12-20 | 2020-06-30 | 上海寒武纪信息科技有限公司 | Operation method, operation device, computer equipment and storage medium |
CN111290788B (en) * | 2018-12-07 | 2022-05-31 | 上海寒武纪信息科技有限公司 | Operation method, operation device, computer equipment and storage medium |
CN111290789B (en) * | 2018-12-06 | 2022-05-27 | 上海寒武纪信息科技有限公司 | Operation method, operation device, computer equipment and storage medium |
CN111275197B (en) * | 2018-12-05 | 2023-11-10 | 上海寒武纪信息科技有限公司 | Operation method, device, computer equipment and storage medium |
CN109711539B (en) * | 2018-12-17 | 2020-05-29 | 中科寒武纪科技股份有限公司 | Operation method, device and related product |
US11841822B2 (en) | 2019-04-27 | 2023-12-12 | Cambricon Technologies Corporation Limited | Fractal calculating device and method, integrated circuit and board card |
CN111860799A (en) * | 2019-04-27 | 2020-10-30 | 中科寒武纪科技股份有限公司 | Arithmetic device |
CN117707992A (en) * | 2022-09-13 | 2024-03-15 | 华为技术有限公司 | Data operation method, device and equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5337395A (en) * | 1991-04-08 | 1994-08-09 | International Business Machines Corporation | SPIN: a sequential pipeline neurocomputer |
US6304963B1 (en) * | 1998-05-14 | 2001-10-16 | Arm Limited | Handling exceptions occuring during processing of vector instructions |
CN1349159A (en) * | 2001-11-28 | 2002-05-15 | 中国人民解放军国防科学技术大学 | Vector processing method of microprocessor |
CN105359052A (en) * | 2012-12-28 | 2016-02-24 | 英特尔公司 | Method and apparatus for integral image computation instructions |
CN106990940A (en) * | 2016-01-20 | 2017-07-28 | 南京艾溪信息科技有限公司 | A kind of vector calculation device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2069947A4 (en) * | 2006-09-26 | 2013-10-09 | Qualcomm Inc | Software implementation of matrix inversion in a wireless communication system |
US9250916B2 (en) * | 2013-03-12 | 2016-02-02 | International Business Machines Corporation | Chaining between exposed vector pipelines |
-
2017
- 2017-11-30 CN CN201711244055.7A patent/CN107861757B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5337395A (en) * | 1991-04-08 | 1994-08-09 | International Business Machines Corporation | SPIN: a sequential pipeline neurocomputer |
US6304963B1 (en) * | 1998-05-14 | 2001-10-16 | Arm Limited | Handling exceptions occuring during processing of vector instructions |
CN1349159A (en) * | 2001-11-28 | 2002-05-15 | 中国人民解放军国防科学技术大学 | Vector processing method of microprocessor |
CN105359052A (en) * | 2012-12-28 | 2016-02-24 | 英特尔公司 | Method and apparatus for integral image computation instructions |
CN106990940A (en) * | 2016-01-20 | 2017-07-28 | 南京艾溪信息科技有限公司 | A kind of vector calculation device |
Also Published As
Publication number | Publication date |
---|---|
CN107861757A (en) | 2018-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107861757B (en) | Arithmetic device and related product | |
CN109725936B (en) | Method for implementing extended computing instruction and related product | |
US11748601B2 (en) | Integrated circuit chip device | |
CN109961138B (en) | Neural network training method and related product | |
CN109961136B (en) | Integrated circuit chip device and related product | |
CN109978131B (en) | Integrated circuit chip apparatus, method and related product | |
CN111105033B (en) | Neural network processor board card and related products | |
TWI793224B (en) | Integrated circuit chip apparatus and related product | |
CN112308198A (en) | Calculation method of recurrent neural network and related product | |
CN109977446B (en) | Integrated circuit chip device and related product | |
CN109961131B (en) | Neural network forward operation method and related product | |
CN109978156B (en) | Integrated circuit chip device and related product | |
CN109978148B (en) | Integrated circuit chip device and related product | |
CN109978157B (en) | Integrated circuit chip device and related product | |
CN109960673B (en) | Integrated circuit chip device and related product | |
TWI795482B (en) | Integrated circuit chip apparatus and related product | |
CN109978158B (en) | Integrated circuit chip device and related product | |
CN109978153B (en) | Integrated circuit chip device and related product | |
CN111832712A (en) | Method for quantizing operation data and related product | |
CN111832711A (en) | Method for quantizing operation data and related product | |
CN111832696A (en) | Neural network operation method and related product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |