CN107861757B - Arithmetic device and related product - Google Patents

Arithmetic device and related product Download PDF

Info

Publication number
CN107861757B
CN107861757B CN201711244055.7A CN201711244055A CN107861757B CN 107861757 B CN107861757 B CN 107861757B CN 201711244055 A CN201711244055 A CN 201711244055A CN 107861757 B CN107861757 B CN 107861757B
Authority
CN
China
Prior art keywords
instruction
vector
calculation
operation instruction
arithmetic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711244055.7A
Other languages
Chinese (zh)
Other versions
CN107861757A (en
Inventor
陈天石
王秉睿
张潇
刘少礼
陈云霁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN201711244055.7A priority Critical patent/CN107861757B/en
Publication of CN107861757A publication Critical patent/CN107861757A/en
Application granted granted Critical
Publication of CN107861757B publication Critical patent/CN107861757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention provides an arithmetic device for executing an operation according to an extended instruction, the arithmetic device including: a memory, an arithmetic unit and a control unit; the extended instruction includes: an opcode and an operation field, a memory to store a vector; the control unit is used for acquiring an extended instruction, analyzing the extended instruction to obtain a vector operation instruction and a second operation instruction, determining the calculation sequence of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and reading an input vector corresponding to the input vector address from a memory; and the operation unit is used for executing the vector operation instruction and the second operation instruction to the input vector according to the calculation sequence to obtain the result of the extended instruction. The technical scheme provided by the invention has the advantages of low power consumption and low calculation overhead.

Description

Arithmetic device and related product
Technical Field
The present invention relates to the field of communications technologies, and in particular, to an arithmetic device and a related product.
Background
In modern general-purpose and special-purpose processors, computational instructions (e.g., vector instructions) are increasingly being introduced to perform operations. The vector instruction is an instruction for causing the processor to perform a vector or matrix operation, and examples thereof include addition and subtraction of vectors, inner product of vectors, matrix multiplication, and matrix convolution. At least one input of the vector instruction is a vector or a matrix or the result of the operation is a vector or a matrix. The vector instruction can perform parallel calculation by calling a vector processing component in the processor, so that the operation speed is improved. In the existing vector instructions, the vectors or matrixes in the operands or results are generally fixed-scale, for example, the vector instruction in the vector extension structure Neon in the ARM processor can process 32-bit floating-point vectors with length of 4 or 16-bit fixed-point vectors with length of 8 at a time.
Therefore, the conventional vector operation instruction cannot realize the operation of variable-scale vectors or matrices, and the conventional vector operation instruction can only realize one operation, for example, one vector instruction can only realize one operation of multiplication and addition, and one vector instruction cannot realize more than two operations, so that the conventional vector operation has high operation overhead and high energy consumption.
Disclosure of Invention
The embodiment of the invention provides an arithmetic device and a related product, which can realize the purpose of realizing various operations by a single operation instruction, reduce the operation overhead and reduce the power consumption of a module.
In a first aspect, an embodiment of the present invention provides a method for implementing an extended instruction, where the method includes the following steps:
an arithmetic device for performing an operation in accordance with an extended instruction, the arithmetic device comprising: a memory, an arithmetic unit and a control unit;
the extended instruction includes: an opcode and an operation domain, the opcode comprising: identifying an identification of a vector computation instruction; the operation domain includes: an input vector address of the vector calculation instruction, an output vector address of the vector calculation instruction, an identifier of the second calculation instruction, input data of the second calculation instruction, a data type and a data length N;
a memory for storing vectors;
the control unit is used for acquiring an extended instruction, analyzing the extended instruction to obtain a vector operation instruction and a second operation instruction, determining the calculation sequence of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and reading an input vector corresponding to the input vector address from a memory;
and the operation unit is used for executing the vector operation instruction and the second operation instruction to the input vector according to the calculation sequence to obtain the result of the extended instruction.
Optionally, the operation device further includes:
and the register unit is used for storing the extended instruction to be executed.
Optionally, the control unit includes:
the instruction fetching module is used for acquiring an extended instruction from the register unit;
the decoding module is used for decoding the obtained extended instruction to obtain a vector operation instruction, a second operation instruction and a calculation sequence;
and the instruction queue is used for storing the decoded vector operation instruction and the second operation instruction according to the calculation sequence.
Optionally, the operation device further includes:
the dependency relationship processing unit is used for judging whether the expansion instruction and a previous expansion instruction access the same vector or not before the control unit acquires the expansion instruction, and if so, after the previous expansion instruction is completely executed, providing the vector operation instruction and a second operation instruction of the current expansion instruction to the operation unit; otherwise, the vector operation instruction and the second operation instruction of the vector operation instruction are provided to the operation unit.
Optionally, if the current extended instruction and the previous extended instruction access the same vector, the dependency processing unit stores the current extended instruction in a storage queue, and after the previous extended instruction is executed, provides the current extended instruction in the storage queue to the control unit.
Optionally, the memory is a scratch pad memory.
Optionally, the operation unit includes a vector addition circuit, a vector multiplication circuit, a size comparison circuit, a nonlinear operation circuit, a vector scalar multiplication circuit, and an activation circuit.
Optionally, the operation unit is of a multi-pipeline structure, wherein the vector multiplication circuit and the vector scalar multiplication circuit are in a first pipeline stage, the magnitude comparison component and the vector addition circuit are in a second pipeline stage, the nonlinear operation component and the activation circuit are in a third pipeline stage, output data of the first pipeline stage is input data of the second pipeline stage, and output data of the second pipeline stage is input data of the third pipeline stage.
Optionally, the control unit is specifically configured to identify whether output data of the vector operation instruction is the same as input data of the second calculation instruction, and if so, determine that the calculation order is positive order calculation; identifying whether the input data of the vector operation instruction is the same as the output data of the second calculation instruction, and if so, determining that the calculation sequence is reverse calculation; and identifying whether the input data of the vector operation instruction is associated with the output data of the second calculation instruction, if not, determining that the calculation order is out-of-order calculation.
In a second aspect, a chip is provided, where the chip integrates the arithmetic device provided in the first aspect.
In a third aspect, an electronic device is provided, which includes the chip provided in the second aspect.
It can be seen that the extended instruction provided by the embodiment of the present invention strengthens the function of the instruction, and replaces a plurality of original instructions with one instruction. Therefore, the number of instructions required by complex vector and matrix operation is reduced, and the use of vector instructions is simplified; compared with a plurality of instructions, the method does not need to store intermediate results, saves storage space, and avoids additional reading and writing expenses.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description are some embodiments of the present invention.
Fig. 1A is a schematic structural diagram of an arithmetic device provided in the present invention.
FIG. 1 is a flow chart of a method of implementing an expand instruction of the present invention.
FIG. 2 is a schematic diagram of a structure of an arithmetic unit according to the present invention.
Fig. 3A is a schematic structural diagram of a control unit provided in the present invention.
Fig. 3 is a schematic structural diagram of a neural network processor board card according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a neural network chip package structure according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a neural network chip according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a neural network chip package structure according to an embodiment of the present disclosure;
fig. 6A is a schematic diagram of another neural network chip package structure according to the flow of the present application.
The dashed components in the drawings represent optional.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of the invention and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. The "/" herein may mean "or".
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The following describes a method of vector dot product by taking CPU as an example, and for the vector dot product, the dot product of vector and vector is calculated, and the function is described: given a vector x, y and a scalar r of length n, the following vector-vector operation is performed, the calculation formula of which is as follows:
Figure BDA0001490454140000041
for vector DOT product, the instruction of the vector DOT product may be "DOT TYPE, N, X, Y, R"; where DOT represents a vector DOT product instruction, type represents the type of data that can be manipulated (e.g., real or complex), N represents the length of the vector, X represents the first address of vector X, Y represents the first address of vector Y, and R is a scalar. As described above, the vector dot product instruction can only implement one type of operation, i.e., the operation of implementing the vector dot product, and cannot implement multiple operations, e.g., the two operations of reading the vector dot product and the discrete data.
As shown in fig. 1A, the arithmetic device includes: a memory 111, a register 112 (optional), an arithmetic unit 114, a control unit 115, and a dependency processing unit 116 (optional);
as shown in FIG. 2, the operation unit 114 includes: conversion circuitry (optional), vector addition circuitry, vector multiplication circuitry, size comparison circuitry, non-linear operation circuitry, vector scalar multiplication circuitry and activation circuitry.
The arithmetic unit has a multi-flow water level structure, specifically as shown in fig. 2, the first flow water level includes but is not limited to: vector multiplication circuits and vector scalar multiplication circuits, and the like.
The second pipeline stage includes, but is not limited to: magnitude comparison calculators (e.g., comparators), vector addition circuits, and the like.
Third effluent stages include, but are not limited to: a non-linear operation part (specifically, an activation circuit or a transcendental function calculation circuit, etc.), and the like.
If the arithmetic unit comprises a switching circuit, the switching circuit can be in the first or third flow level.
The output data of the first pipeline stage is input data of the second pipeline stage, and the output data of the second pipeline stage is input data of the third pipeline stage. The input to the first cascade stage may be input data (e.g., an input vector) and the output of the third cascade stage may be a result of the computation.
The present invention also provides an extended instruction, including an operation code and an operation field, where the operation code includes: identifying an identity (e.g., a ROT) of the first operation instruction; the operation domain includes: the data processing method comprises the steps of inputting a data address of a first calculation instruction, outputting a data address of the first calculation instruction, identifying a second calculation instruction, inputting data of the second calculation instruction, a data type and a data length N.
Optionally, the extended instruction may further include: a third computation instruction and input data of the third computation instruction.
It should be noted that the calculation instruction may be a vector operation instruction or a matrix instruction, and the embodiments of the present invention do not limit the specific expression of the calculation instruction.
The arithmetic device may be configured to execute an extended instruction, and specifically includes:
a memory for storing vectors;
the control unit is used for acquiring an extended instruction, analyzing the extended instruction to obtain a vector operation instruction and a second operation instruction, determining the calculation sequence of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and reading an input vector corresponding to the input vector address from a memory;
and the operation unit is used for executing the vector operation instruction and the second operation instruction to the input vector according to the calculation sequence to obtain the result of the extended instruction.
And the register unit is used for storing the extended instruction to be executed.
Optionally, as shown in fig. 3A, the control unit 115 may include:
the instruction fetching module is used for acquiring an extended instruction from the register unit;
the decoding module is used for decoding the obtained extended instruction to obtain a vector operation instruction, a second operation instruction and a calculation sequence;
and the instruction queue is used for storing the decoded vector operation instruction and the second operation instruction according to the calculation sequence.
The dependency relationship processing unit 116 is configured to, before the control unit obtains the extended instruction, determine whether the extended instruction and a previous extended instruction access the same vector, if yes, wait for the previous extended instruction to finish executing, and provide a vector operation instruction and a second operation instruction of the current extended instruction to the operation unit; otherwise, the vector operation instruction and the second operation instruction of the vector operation instruction are provided to the operation unit.
The dependency processing unit 116 is further configured to store the current extended instruction in a storage queue when the current extended instruction and a previous extended instruction access the same vector, and provide the current extended instruction in the storage queue to the control unit after the previous extended instruction is executed.
Optionally, the memory is a scratch pad memory.
Referring to fig. 1, fig. 1 provides an implementation method of an extended instruction, where the extended instruction in the method may include: an opcode and an operand field, the first arithmetic instruction may be a vector arithmetic instruction, such as AXPY. The operation code includes: an identification (e.g., AXPY) identifying a first arithmetic instruction; the operation domain includes: an input data address of the first calculation instruction, an output data address of the first calculation instruction, an identifier of the second calculation instruction, input data of the second calculation instruction, a data type and a data length N (which is a user-set value, and the invention is not limited to the specific form of N); the method is executed by a computing device or a computing chip, and the computing device is shown in fig. 1A. The method is shown in figure 1 and comprises the following steps:
s101, an arithmetic device acquires an extended instruction, and analyzes the extended instruction to obtain a first calculation instruction and a second calculation instruction;
step S102, the computing device determines a calculation order according to the first calculation instruction and the second calculation instruction, and executes the first calculation instruction and the second calculation instruction according to the calculation order to obtain a result of the extended instruction.
The technical scheme provided by the invention provides an implementation method of the extended instruction, so that an arithmetic device can execute the calculation of two calculation instructions on the extended instruction, and a single extended instruction can realize two types of calculation, thereby reducing the calculation overhead and reducing the power consumption.
Optionally, the calculation sequence may specifically include: the method comprises the following steps of performing out-of-order calculation, forward order calculation or backward order calculation, wherein the out-of-order calculation means that the first calculation instruction and the second calculation instruction do not have the corresponding order requirement, the forward order calculation means executes the first calculation instruction first and then executes the second calculation instruction, and the backward order calculation means executes the second calculation instruction first and then executes the first calculation instruction.
The specific implementation manner of the above-mentioned arithmetic device determining the calculation order according to the first calculation instruction and the second calculation instruction may be that the arithmetic device identifies whether the output data of the first calculation instruction is the same as the input data of the second calculation instruction, if so, the calculation order is determined to be the forward calculation, otherwise, the arithmetic device identifies whether the input data of the first calculation instruction is the same as the output data of the second calculation instruction, if so, the calculation order is determined to be the reverse calculation, the arithmetic device identifies whether the input data of the first calculation instruction is associated with the output data of the second calculation instruction, if not, the calculation order is determined to be the out-of-order calculation.
Specifically, for example, F ═ a × B + C, the first calculation instruction is a matrix multiplication instruction, the second calculation instruction is a matrix addition instruction, and the matrix addition instruction of the second calculation instruction requires the result applied to the first calculation instruction, that is, the output data, and therefore the calculation order is determined to be a positive order calculation. In another example, if the first operation instruction is a matrix multiplication instruction and the second operation instruction is a transformation, such as transposition or conjugation, the first operation instruction uses the output of the second operation instruction, and thus the operation sequence is a reverse order. And if no corresponding correlation exists, namely the output data of the first calculation instruction is different from the input data of the second calculation instruction, and the input data of the first calculation instruction is also different from the input data of the second calculation instruction, determining that the correlation does not exist.
The vector instruction provided by the invention is expanded, the function of the instruction is strengthened, and one instruction replaces a plurality of original instructions. Therefore, the number of instructions required by complex vector and matrix operation is reduced, and the use of vector instructions is simplified; compared with a plurality of instructions, the method does not need to store intermediate results, saves storage space, and avoids additional reading and writing expenses.
If the first calculation instruction is a vector instruction, for an input vector or matrix in the vector instruction, the instruction adds a function of scaling the input vector or matrix, i.e. adds an operand representing a scaling coefficient in the operation domain, and when the vector is read in, the vector is firstly scaled according to the scaling coefficient (i.e. the second calculation instruction is a scaling instruction). If there are operations in a vector instruction that multiply multiple input vectors or matrices, the scaling coefficients corresponding to these input vectors or matrices may be combined into one.
If the first calculation instruction is a vector instruction, the instruction adds a function of transposing an input matrix in the vector instruction (i.e., the second calculation instruction is a transpose instruction). An operand indicating whether to transpose the matrix is added to the instruction, and the operand indicates whether to transpose the matrix before operation.
If the first compute instruction is a vector instruction, the instruction adds a function to the original output vector or matrix for the output vector or matrix in the vector instruction (i.e., the second compute instruction is an add instruction). The coefficient for scaling the original output vector or matrix is added to the instruction (i.e. a third calculation instruction is added, and the third calculation instruction may be a scaling instruction), and the instruction indicates that after the vector or matrix operation is performed, the result is added to the scaled original output to be used as a new output.
If the first computing instruction is a vector instruction, the instruction adds a function to read in fixed steps to the input vector in the vector instruction. An operand representing the read step size of the input vector is added to the instruction (i.e. the second calculation instruction is a read vector in a fixed step size), representing the difference between the addresses of two adjacent elements in the vector.
If the first computing instruction is a vector instruction, the instruction adds a function of writing the result in fixed steps to the result vector in the vector instruction (i.e., the second computing instruction writes the vector in fixed steps). An operand representing the read step size of the result vector is added to the instruction, representing the difference between the addresses of two adjacent elements in the vector. If a vector is both an input and a result, the same step size is used for the vector as an input and as a result.
If the first computing instruction is a vector instruction, the instruction adds the function of reading row or column vectors in fixed steps to the input matrix in the vector instruction (i.e., the second computing instruction is reading multiple vectors in fixed steps). An operand representing a matrix read step size is added to the instruction, representing the difference in the first address between matrix row or column vectors.
If the first computing instruction is a vector instruction, the instruction adds the function of reading row or column vectors at a fixed step size to the result matrix in the vector instruction (i.e., the second computing instruction is writing multiple vectors at a fixed step size). An operand representing a matrix read step size is added to the instruction, representing the difference in the first address between matrix row or column vectors. If a matrix is both an input and a result matrix, the same step size is used as input and as result.
The actual structure of the above-described extension instruction is described below with respect to some actual extension instructions.
Vector multiply add
Calculating the product of a vector and a scalar and adding the result to another vector
Description of the function:
given a vector x, y and a scalar a, the following vector-vector operation is performed
Y:=a*x+y
The instruction format is shown in Table 1-1:
tables 1 to 1:
Figure BDA0001490454140000081
Figure BDA0001490454140000091
the length of the vector in the instruction format shown in table 1-1 is variable, which can reduce the number of instructions and simplify the use of instructions.
The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
Vector dot product
Computing a dot product of a vector and a vector
Description of the function: given a vector x, y of fixed length n and a scalar r, the following vector-vector operation is performed
Figure BDA0001490454140000092
The instruction format is shown in tables 1-2:
tables 1 to 2:
Figure BDA0001490454140000093
the variable length of the vectors in the instruction format shown in tables 1-2 can reduce the number of instructions and simplify the use of instructions.
The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
Norm of vector
Computing Euclidean norms of vectors
Description of the function:
the instruction performs the following vector reduction operation:
Figure BDA0001490454140000101
the instruction format is shown in tables 1-3:
tables 1 to 3
Figure BDA0001490454140000102
The variable length of the vectors in the instruction formats shown in tables 1-3 can reduce the number of instructions and simplify the use of instructions. The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
Vector addition
Computing the sum of the addition of all elements of a vector
Description of the function:
the instruction performs the following vector reduction operation:
Figure BDA0001490454140000103
the instruction format is shown in tables 1-4:
tables 1 to 4:
Figure BDA0001490454140000104
Figure BDA0001490454140000111
the variable length of the vectors in the instruction formats shown in tables 1-4 can reduce the number of instructions and simplify the use of instructions.
The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
Maximum value of vector
Calculating the position of the largest element among all elements of the vector
Description of the function:
for a vector x of length n, the instruction writes the position of the largest element in the vector into scalar i
The instruction format is shown in tables 1-5:
tables 1 to 5:
Figure BDA0001490454140000112
the variable length of the vectors in the instruction formats shown in tables 1-5 can reduce the number of instructions and simplify the use of instructions. The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
Vector minimum
Calculating the position of the smallest element among all elements of the vector
Description of the function:
for a vector x of length n, the instruction writes the position of the smallest element in the vector into scalar i
The instruction format is shown in tables 1-6:
tables 1 to 6
Figure BDA0001490454140000121
The variable length of the vectors in the instruction formats shown in tables 1-6 can reduce the number of instructions and simplify the use of instructions. The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
Vector outer product
Computing the tensor product of two vectors (outer product)
Description of the function:
the instruction performs the following matrix vector operations
A:=α*x*yT+A
The instruction format is shown in tables 1-7:
tables 1 to 7
Figure BDA0001490454140000122
Figure BDA0001490454140000131
Scaling the result matrix by the scalar alpha in the instruction format shown in tables 1-7 increases the flexibility of the instruction and avoids the additional overhead of scaling with scaling instructions. The size of the vector and the matrix is variable, so that the number of instructions can be reduced, and the use of the instructions is simplified. Matrices of different storage formats (row major sequence and column major sequence) can be processed, avoiding the overhead of transforming the matrices. The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.
For the arithmetic device shown in fig. 1, which implements a specific structure for calculating the extended instruction when the extended instruction is operated, that is, implements a combination of executing a plurality of calculation instructions by executing one extended instruction, it should be noted that the extended instruction is not split into a plurality of calculation instructions when the arithmetic device executes the extended instruction.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a neural network processor board card according to an embodiment of the present disclosure. As shown in fig. 3, the neural network processor board 10 includes a neural network chip package structure 11, a first electrical and non-electrical connection device 12, and a first substrate (substrate) 13.
The specific structure of the neural network chip package structure 11 is not limited in this application, and optionally, as shown in fig. 4, the neural network chip package structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, and a second substrate 113.
The specific form of the neural network chip 111 related to the present application is not limited, and the neural network chip 111 includes, but is not limited to, a neural network wafer integrating a neural network processor, and the wafer may be made of silicon material, germanium material, quantum material, molecular material, or the like. The neural network chip can be packaged according to practical conditions (such as a severer environment) and different application requirements, so that most of the neural network chip is wrapped, and the pins on the neural network chip are connected to the outer side of the packaging structure through conductors such as gold wires and the like for circuit connection with a further outer layer.
The specific structure of the neural network chip 111 is not limited in the present application, and optionally, please refer to fig. 1A, where fig. 1A is a schematic structural diagram of a computing device in the neural network chip according to an embodiment of the present application. As shown in fig. 1A, the computing apparatus includes: memory 111, register 112 (optional), arithmetic unit 114, control unit 115, and dependency processing unit 116 (optional). The specific functions or structures of the above units can be seen from the embodiment shown in fig. 1A.
The type of the first substrate 13 and the second substrate 113 is not limited in this application, and may be a Printed Circuit Board (PCB) or a Printed Wiring Board (PWB), and may be other circuit boards. The material of the PCB is not limited.
The second substrate 113 according to the present invention is used for carrying the neural network chip 111, and the neural network chip package structure 11 obtained by connecting the neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112 is used for protecting the neural network chip 111 and facilitating further packaging of the neural network chip package structure 11 and the first substrate 13.
The specific packaging method and the corresponding structure of the second electrical and non-electrical connecting device 112 are not limited, and an appropriate packaging method can be selected according to actual conditions and different application requirements, and can be simply improved, for example: flip Chip Ball Grid Array (FCBGAP) packages, Low-profile Quad Flat packages (LQFP), Quad Flat packages with Heat sinks (HQFP), Quad Flat packages (Quad Flat Non-lead Package, QFN), or small pitch Quad Flat packages (FBGA).
The Flip Chip (Flip Chip) is suitable for the conditions of high requirements on the area after packaging or sensitivity to the inductance of a lead and the transmission time of a signal. In addition, a Wire Bonding (Wire Bonding) packaging mode can be used, so that the cost is reduced, and the flexibility of a packaging structure is improved.
Ball Grid Array (Ball Grid Array) can provide more pins, and the average wire length of the pins is short, and has the function of transmitting signals at high speed, wherein, the package can be replaced by Pin Grid Array Package (PGA), Zero Insertion Force (ZIF), Single Edge Contact Connection (SECC), Land Grid Array (LGA) and the like.
Optionally, the neural network Chip 111 and the second substrate 113 are packaged in a Flip Chip Ball Grid Array (Flip Chip Ball Grid Array) packaging manner, and a schematic diagram of a specific neural network Chip packaging structure may refer to fig. 6. As shown in fig. 6, the neural network chip package structure includes: the neural network chip 21, the bonding pad 22, the solder ball 23, the second substrate 24, the connection point 25 on the second substrate 24, and the pin 26.
The bonding pads 22 are connected to the neural network chip 21, and the solder balls 23 are formed between the bonding pads 22 and the connection points 25 on the second substrate 24 by soldering, so that the neural network chip 21 and the second substrate 24 are connected, that is, the package of the neural network chip 21 is realized.
The pins 26 are used for connecting with an external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10), so as to realize transmission of external data and internal data, and facilitate processing of data by the neural network chip 21 or a neural network processor corresponding to the neural network chip 21. The type and number of the pins are not limited in the present application, and different pin forms can be selected according to different packaging technologies and arranged according to certain rules.
Optionally, the neural network chip packaging structure further includes an insulating filler, which is disposed in a gap between the pad 22, the solder ball 23 and the connection point 25, and is used for preventing interference between the solder ball and the solder ball.
Wherein, the material of the insulating filler can be silicon nitride, silicon oxide or silicon oxynitride; the interference includes electromagnetic interference, inductive interference, and the like.
Optionally, the neural network chip package structure further includes a heat dissipation device for dissipating heat generated when the neural network chip 21 operates. The heat dissipation device may be a metal plate with good thermal conductivity, a heat sink, or a heat sink, such as a fan.
For example, as shown in fig. 6A, the neural network chip package structure 11 includes: the neural network chip 21, the bonding pad 22, the solder ball 23, the second substrate 24, the connection point 25 on the second substrate 24, the pin 26, the insulating filler 27, the thermal grease 28 and the metal housing heat sink 29. The heat dissipation paste 28 and the metal case heat dissipation sheet 29 are used to dissipate heat generated during operation of the neural network chip 21.
Optionally, the neural network chip package structure 11 further includes a reinforcing structure connected to the bonding pad 22 and embedded in the solder ball 23 to enhance the connection strength between the solder ball 23 and the bonding pad 22.
The reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.
The specific form of the first electrical and non-electrical device 12 is not limited in the present application, and reference may be made to the description of the second electrical and non-electrical device 112, that is, the neural network chip package structure 11 is packaged by soldering, or a connection wire connection or a plug connection may be adopted to connect the second substrate 113 and the first substrate 13, so as to facilitate subsequent replacement of the first substrate 13 or the neural network chip package structure 11.
Optionally, the first substrate 13 includes an interface of a memory unit for expanding a storage capacity, for example: synchronous Dynamic Random Access Memory (SDRAM), Double Rate SDRAM (DDR), etc., which improve the processing capability of the neural network processor by expanding the Memory.
The first substrate 13 may further include a Peripheral component interconnect Express (PCI-E or PCIe) interface, a Small Form-factor pluggable (SFP) interface, an ethernet interface, a Controller Area Network (CAN) interface, and the like on the first substrate, for data transmission between the package structure and the external circuit, which may improve the operation speed and the convenience of operation.
The neural network processor is packaged into a neural network chip 111, the neural network chip 111 is packaged into a neural network chip packaging structure 11, the neural network chip packaging structure 11 is packaged into a neural network processor board card 10, and data interaction is performed with an external circuit (for example, a computer motherboard) through an interface (a slot or a plug core) on the board card, that is, the function of the neural network processor is directly realized by using the neural network processor board card 10, and the neural network chip 111 is protected. And other modules can be added to the neural network processor board card 10, so that the application range and the operation efficiency of the neural network processor are improved.
In one embodiment, the present disclosure discloses an electronic device comprising the above neural network processor board card 10 or the neural network chip package 11.
Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, cell phones, tachographs, navigators, sensors, cameras, servers, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, home appliances, and/or medical devices.
The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules illustrated are not necessarily required to practice the invention.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (11)

1. An arithmetic device for performing an operation in accordance with an extended instruction, the arithmetic device comprising: a memory, an arithmetic unit and a control unit;
the extended instruction includes: an opcode and an operation domain, the opcode comprising: identifying an identification of a vector computation instruction; the operation domain includes: an input vector address of the vector calculation instruction, an output vector address of the vector calculation instruction, an identifier of the second calculation instruction, input data of the second calculation instruction, a data type and a data length N;
a memory for storing vectors;
the control unit is used for acquiring an extended instruction, analyzing the extended instruction to obtain a vector operation instruction and a second operation instruction, determining the calculation sequence of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and reading an input vector corresponding to the input vector address from a memory;
the control unit is specifically configured to identify whether output data of the vector operation instruction is the same as input data of the second calculation instruction, and if so, determine that the calculation order is a positive order calculation; identifying whether the input data of the vector operation instruction is the same as the output data of the second calculation instruction, and if so, determining that the calculation sequence is reverse calculation; identifying whether input data of the vector operation instruction is associated with output data of the second calculation instruction or not, and if not, determining that the calculation sequence is unordered calculation;
the arithmetic unit is used for executing the vector arithmetic instruction and a second arithmetic instruction to the input vector according to the calculation sequence to obtain the result of the extended instruction, wherein the calculation sequence is positive order calculation, the vector arithmetic instruction is executed firstly, and then the second arithmetic instruction is executed; the calculation sequence is reverse order calculation, the second operation instruction is executed firstly, and then the vector operation instruction is executed; the calculation order is out-of-order calculation, and the vector operation instruction and the second operation instruction do not have the corresponding order requirement.
2. The arithmetic device of claim 1, further comprising:
and the register unit is used for storing the extended instruction to be executed.
3. The arithmetic device according to claim 2, wherein the control unit includes:
the instruction fetching module is used for acquiring an extended instruction from the register unit;
the decoding module is used for decoding the obtained extended instruction to obtain a vector operation instruction, a second operation instruction and a calculation sequence;
and the instruction queue is used for storing the decoded vector operation instruction and the second operation instruction according to the calculation sequence.
4. The arithmetic device of claim 3, further comprising:
the dependency relationship processing unit is used for judging whether the expansion instruction and a previous expansion instruction access the same vector or not before the control unit acquires the expansion instruction, and if so, after the previous expansion instruction is completely executed, providing the vector operation instruction and a second operation instruction of the current expansion instruction to the operation unit; otherwise, the vector operation instruction and the second operation instruction of the vector operation instruction are provided to the operation unit.
5. The computing device of claim 4, wherein the dependency processing unit is configured to store a current extended instruction in a store queue when the current extended instruction and a previous extended instruction access the same vector, and to provide the current extended instruction in the store queue to the control unit after the previous extended instruction is executed.
6. The computing device of any of claims 1-5, wherein the memory is a scratch pad memory.
7. The arithmetic device of claim 1, wherein the arithmetic unit comprises a vector addition circuit, a vector multiplication circuit, a size comparison circuit, a non-linear arithmetic circuit, and a vector scalar multiplication circuit.
8. The arithmetic device according to claim 7, wherein the arithmetic unit has a multi-pipeline structure, wherein the vector multiplication circuit and the vector scalar multiplication circuit are in a first pipeline stage, the magnitude comparison unit and the vector addition circuit are in a second pipeline stage, and the non-linear operation unit is in a third pipeline stage, wherein output data of the first pipeline stage is input data of the second pipeline stage, and output data of the second pipeline stage is input data of the third pipeline stage.
9. The arithmetic device according to claim 8, wherein the arithmetic unit further comprises: the conversion circuit is positioned at the first flowing water level and the third flowing water level, or the conversion circuit is positioned at the first flowing water level, or the conversion circuit is positioned at the third flowing water level.
10. A chip incorporating an arithmetic device as claimed in any one of claims 1 to 9.
11. An electronic device, characterized in that the electronic device comprises a chip according to claim 10.
CN201711244055.7A 2017-11-30 2017-11-30 Arithmetic device and related product Active CN107861757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711244055.7A CN107861757B (en) 2017-11-30 2017-11-30 Arithmetic device and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711244055.7A CN107861757B (en) 2017-11-30 2017-11-30 Arithmetic device and related product

Publications (2)

Publication Number Publication Date
CN107861757A CN107861757A (en) 2018-03-30
CN107861757B true CN107861757B (en) 2020-08-25

Family

ID=61704370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711244055.7A Active CN107861757B (en) 2017-11-30 2017-11-30 Arithmetic device and related product

Country Status (1)

Country Link
CN (1) CN107861757B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388446A (en) * 2018-02-05 2018-08-10 上海寒武纪信息科技有限公司 Computing module and method
CN110413561B (en) * 2018-04-28 2021-03-30 中科寒武纪科技股份有限公司 Data acceleration processing system
EP3796189A4 (en) 2018-05-18 2022-03-02 Cambricon Technologies Corporation Limited Video retrieval method, and method and apparatus for generating video retrieval mapping relationship
CN110147872B (en) * 2018-05-18 2020-07-17 中科寒武纪科技股份有限公司 Code storage device and method, processor and training method
CN109032670B (en) * 2018-08-08 2021-10-19 上海寒武纪信息科技有限公司 Neural network processing device and method for executing vector copy instruction
CN110929855B (en) * 2018-09-20 2023-12-12 合肥君正科技有限公司 Data interaction method and device
CN111061507A (en) * 2018-10-16 2020-04-24 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN110096310B (en) * 2018-11-14 2021-09-03 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111353124A (en) * 2018-12-20 2020-06-30 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111290788B (en) * 2018-12-07 2022-05-31 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111290789B (en) * 2018-12-06 2022-05-27 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111275197B (en) * 2018-12-05 2023-11-10 上海寒武纪信息科技有限公司 Operation method, device, computer equipment and storage medium
CN109711539B (en) * 2018-12-17 2020-05-29 中科寒武纪科技股份有限公司 Operation method, device and related product
US11841822B2 (en) 2019-04-27 2023-12-12 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card
CN111860799A (en) * 2019-04-27 2020-10-30 中科寒武纪科技股份有限公司 Arithmetic device
CN117707992A (en) * 2022-09-13 2024-03-15 华为技术有限公司 Data operation method, device and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5337395A (en) * 1991-04-08 1994-08-09 International Business Machines Corporation SPIN: a sequential pipeline neurocomputer
US6304963B1 (en) * 1998-05-14 2001-10-16 Arm Limited Handling exceptions occuring during processing of vector instructions
CN1349159A (en) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 Vector processing method of microprocessor
CN105359052A (en) * 2012-12-28 2016-02-24 英特尔公司 Method and apparatus for integral image computation instructions
CN106990940A (en) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 A kind of vector calculation device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2069947A4 (en) * 2006-09-26 2013-10-09 Qualcomm Inc Software implementation of matrix inversion in a wireless communication system
US9250916B2 (en) * 2013-03-12 2016-02-02 International Business Machines Corporation Chaining between exposed vector pipelines

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5337395A (en) * 1991-04-08 1994-08-09 International Business Machines Corporation SPIN: a sequential pipeline neurocomputer
US6304963B1 (en) * 1998-05-14 2001-10-16 Arm Limited Handling exceptions occuring during processing of vector instructions
CN1349159A (en) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 Vector processing method of microprocessor
CN105359052A (en) * 2012-12-28 2016-02-24 英特尔公司 Method and apparatus for integral image computation instructions
CN106990940A (en) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 A kind of vector calculation device

Also Published As

Publication number Publication date
CN107861757A (en) 2018-03-30

Similar Documents

Publication Publication Date Title
CN107861757B (en) Arithmetic device and related product
CN109725936B (en) Method for implementing extended computing instruction and related product
US11748601B2 (en) Integrated circuit chip device
CN109961138B (en) Neural network training method and related product
CN109961136B (en) Integrated circuit chip device and related product
CN109978131B (en) Integrated circuit chip apparatus, method and related product
CN111105033B (en) Neural network processor board card and related products
TWI793224B (en) Integrated circuit chip apparatus and related product
CN112308198A (en) Calculation method of recurrent neural network and related product
CN109977446B (en) Integrated circuit chip device and related product
CN109961131B (en) Neural network forward operation method and related product
CN109978156B (en) Integrated circuit chip device and related product
CN109978148B (en) Integrated circuit chip device and related product
CN109978157B (en) Integrated circuit chip device and related product
CN109960673B (en) Integrated circuit chip device and related product
TWI795482B (en) Integrated circuit chip apparatus and related product
CN109978158B (en) Integrated circuit chip device and related product
CN109978153B (en) Integrated circuit chip device and related product
CN111832712A (en) Method for quantizing operation data and related product
CN111832711A (en) Method for quantizing operation data and related product
CN111832696A (en) Neural network operation method and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant