CN107861757B

CN107861757B - Arithmetic device and related product

Info

Publication number: CN107861757B
Application number: CN201711244055.7A
Authority: CN
Inventors: 陈天石; 王秉睿; 张潇; 刘少礼; 陈云霁
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2017-11-30
Filing date: 2017-11-30
Publication date: 2020-08-25
Anticipated expiration: 2037-11-30
Also published as: CN107861757A

Abstract

The present invention provides an arithmetic device for executing an operation according to an extended instruction, the arithmetic device including: a memory, an arithmetic unit and a control unit; the extended instruction includes: an opcode and an operation field, a memory to store a vector; the control unit is used for acquiring an extended instruction, analyzing the extended instruction to obtain a vector operation instruction and a second operation instruction, determining the calculation sequence of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and reading an input vector corresponding to the input vector address from a memory; and the operation unit is used for executing the vector operation instruction and the second operation instruction to the input vector according to the calculation sequence to obtain the result of the extended instruction. The technical scheme provided by the invention has the advantages of low power consumption and low calculation overhead.

Description

Arithmetic device and related product

Technical Field

The present invention relates to the field of communications technologies, and in particular, to an arithmetic device and a related product.

Background

In modern general-purpose and special-purpose processors, computational instructions (e.g., vector instructions) are increasingly being introduced to perform operations. The vector instruction is an instruction for causing the processor to perform a vector or matrix operation, and examples thereof include addition and subtraction of vectors, inner product of vectors, matrix multiplication, and matrix convolution. At least one input of the vector instruction is a vector or a matrix or the result of the operation is a vector or a matrix. The vector instruction can perform parallel calculation by calling a vector processing component in the processor, so that the operation speed is improved. In the existing vector instructions, the vectors or matrixes in the operands or results are generally fixed-scale, for example, the vector instruction in the vector extension structure Neon in the ARM processor can process 32-bit floating-point vectors with length of 4 or 16-bit fixed-point vectors with length of 8 at a time.

Therefore, the conventional vector operation instruction cannot realize the operation of variable-scale vectors or matrices, and the conventional vector operation instruction can only realize one operation, for example, one vector instruction can only realize one operation of multiplication and addition, and one vector instruction cannot realize more than two operations, so that the conventional vector operation has high operation overhead and high energy consumption.

Disclosure of Invention

The embodiment of the invention provides an arithmetic device and a related product, which can realize the purpose of realizing various operations by a single operation instruction, reduce the operation overhead and reduce the power consumption of a module.

In a first aspect, an embodiment of the present invention provides a method for implementing an extended instruction, where the method includes the following steps:

an arithmetic device for performing an operation in accordance with an extended instruction, the arithmetic device comprising: a memory, an arithmetic unit and a control unit;

the extended instruction includes: an opcode and an operation domain, the opcode comprising: identifying an identification of a vector computation instruction; the operation domain includes: an input vector address of the vector calculation instruction, an output vector address of the vector calculation instruction, an identifier of the second calculation instruction, input data of the second calculation instruction, a data type and a data length N;

a memory for storing vectors;

the control unit is used for acquiring an extended instruction, analyzing the extended instruction to obtain a vector operation instruction and a second operation instruction, determining the calculation sequence of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and reading an input vector corresponding to the input vector address from a memory;

and the operation unit is used for executing the vector operation instruction and the second operation instruction to the input vector according to the calculation sequence to obtain the result of the extended instruction.

Optionally, the operation device further includes:

and the register unit is used for storing the extended instruction to be executed.

Optionally, the control unit includes:

the instruction fetching module is used for acquiring an extended instruction from the register unit;

the decoding module is used for decoding the obtained extended instruction to obtain a vector operation instruction, a second operation instruction and a calculation sequence;

and the instruction queue is used for storing the decoded vector operation instruction and the second operation instruction according to the calculation sequence.

Optionally, the operation device further includes:

the dependency relationship processing unit is used for judging whether the expansion instruction and a previous expansion instruction access the same vector or not before the control unit acquires the expansion instruction, and if so, after the previous expansion instruction is completely executed, providing the vector operation instruction and a second operation instruction of the current expansion instruction to the operation unit; otherwise, the vector operation instruction and the second operation instruction of the vector operation instruction are provided to the operation unit.

Optionally, if the current extended instruction and the previous extended instruction access the same vector, the dependency processing unit stores the current extended instruction in a storage queue, and after the previous extended instruction is executed, provides the current extended instruction in the storage queue to the control unit.

Optionally, the memory is a scratch pad memory.

Optionally, the operation unit includes a vector addition circuit, a vector multiplication circuit, a size comparison circuit, a nonlinear operation circuit, a vector scalar multiplication circuit, and an activation circuit.

Optionally, the operation unit is of a multi-pipeline structure, wherein the vector multiplication circuit and the vector scalar multiplication circuit are in a first pipeline stage, the magnitude comparison component and the vector addition circuit are in a second pipeline stage, the nonlinear operation component and the activation circuit are in a third pipeline stage, output data of the first pipeline stage is input data of the second pipeline stage, and output data of the second pipeline stage is input data of the third pipeline stage.

Optionally, the control unit is specifically configured to identify whether output data of the vector operation instruction is the same as input data of the second calculation instruction, and if so, determine that the calculation order is positive order calculation; identifying whether the input data of the vector operation instruction is the same as the output data of the second calculation instruction, and if so, determining that the calculation sequence is reverse calculation; and identifying whether the input data of the vector operation instruction is associated with the output data of the second calculation instruction, if not, determining that the calculation order is out-of-order calculation.

In a second aspect, a chip is provided, where the chip integrates the arithmetic device provided in the first aspect.

In a third aspect, an electronic device is provided, which includes the chip provided in the second aspect.

It can be seen that the extended instruction provided by the embodiment of the present invention strengthens the function of the instruction, and replaces a plurality of original instructions with one instruction. Therefore, the number of instructions required by complex vector and matrix operation is reduced, and the use of vector instructions is simplified; compared with a plurality of instructions, the method does not need to store intermediate results, saves storage space, and avoids additional reading and writing expenses.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description are some embodiments of the present invention.

Fig. 1A is a schematic structural diagram of an arithmetic device provided in the present invention.

FIG. 1 is a flow chart of a method of implementing an expand instruction of the present invention.

FIG. 2 is a schematic diagram of a structure of an arithmetic unit according to the present invention.

Fig. 3A is a schematic structural diagram of a control unit provided in the present invention.

Fig. 3 is a schematic structural diagram of a neural network processor board card according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a neural network chip package structure according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a neural network chip according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of a neural network chip package structure according to an embodiment of the present disclosure;

fig. 6A is a schematic diagram of another neural network chip package structure according to the flow of the present application.

The dashed components in the drawings represent optional.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of the invention and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. The "/" herein may mean "or".

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The following describes a method of vector dot product by taking CPU as an example, and for the vector dot product, the dot product of vector and vector is calculated, and the function is described: given a vector x, y and a scalar r of length n, the following vector-vector operation is performed, the calculation formula of which is as follows:

for vector DOT product, the instruction of the vector DOT product may be "DOT TYPE, N, X, Y, R"; where DOT represents a vector DOT product instruction, type represents the type of data that can be manipulated (e.g., real or complex), N represents the length of the vector, X represents the first address of vector X, Y represents the first address of vector Y, and R is a scalar. As described above, the vector dot product instruction can only implement one type of operation, i.e., the operation of implementing the vector dot product, and cannot implement multiple operations, e.g., the two operations of reading the vector dot product and the discrete data.

As shown in fig. 1A, the arithmetic device includes: a memory 111, a register 112 (optional), an arithmetic unit 114, a control unit 115, and a dependency processing unit 116 (optional);

as shown in FIG. 2, the operation unit 114 includes: conversion circuitry (optional), vector addition circuitry, vector multiplication circuitry, size comparison circuitry, non-linear operation circuitry, vector scalar multiplication circuitry and activation circuitry.

The arithmetic unit has a multi-flow water level structure, specifically as shown in fig. 2, the first flow water level includes but is not limited to: vector multiplication circuits and vector scalar multiplication circuits, and the like.

The second pipeline stage includes, but is not limited to: magnitude comparison calculators (e.g., comparators), vector addition circuits, and the like.

Third effluent stages include, but are not limited to: a non-linear operation part (specifically, an activation circuit or a transcendental function calculation circuit, etc.), and the like.

If the arithmetic unit comprises a switching circuit, the switching circuit can be in the first or third flow level.

The output data of the first pipeline stage is input data of the second pipeline stage, and the output data of the second pipeline stage is input data of the third pipeline stage. The input to the first cascade stage may be input data (e.g., an input vector) and the output of the third cascade stage may be a result of the computation.

The present invention also provides an extended instruction, including an operation code and an operation field, where the operation code includes: identifying an identity (e.g., a ROT) of the first operation instruction; the operation domain includes: the data processing method comprises the steps of inputting a data address of a first calculation instruction, outputting a data address of the first calculation instruction, identifying a second calculation instruction, inputting data of the second calculation instruction, a data type and a data length N.

Optionally, the extended instruction may further include: a third computation instruction and input data of the third computation instruction.

It should be noted that the calculation instruction may be a vector operation instruction or a matrix instruction, and the embodiments of the present invention do not limit the specific expression of the calculation instruction.

The arithmetic device may be configured to execute an extended instruction, and specifically includes:

a memory for storing vectors;

Optionally, as shown in fig. 3A, the control unit 115 may include:

The dependency relationship processing unit 116 is configured to, before the control unit obtains the extended instruction, determine whether the extended instruction and a previous extended instruction access the same vector, if yes, wait for the previous extended instruction to finish executing, and provide a vector operation instruction and a second operation instruction of the current extended instruction to the operation unit; otherwise, the vector operation instruction and the second operation instruction of the vector operation instruction are provided to the operation unit.

The dependency processing unit 116 is further configured to store the current extended instruction in a storage queue when the current extended instruction and a previous extended instruction access the same vector, and provide the current extended instruction in the storage queue to the control unit after the previous extended instruction is executed.

Optionally, the memory is a scratch pad memory.

Referring to fig. 1, fig. 1 provides an implementation method of an extended instruction, where the extended instruction in the method may include: an opcode and an operand field, the first arithmetic instruction may be a vector arithmetic instruction, such as AXPY. The operation code includes: an identification (e.g., AXPY) identifying a first arithmetic instruction; the operation domain includes: an input data address of the first calculation instruction, an output data address of the first calculation instruction, an identifier of the second calculation instruction, input data of the second calculation instruction, a data type and a data length N (which is a user-set value, and the invention is not limited to the specific form of N); the method is executed by a computing device or a computing chip, and the computing device is shown in fig. 1A. The method is shown in figure 1 and comprises the following steps:

s101, an arithmetic device acquires an extended instruction, and analyzes the extended instruction to obtain a first calculation instruction and a second calculation instruction;

step S102, the computing device determines a calculation order according to the first calculation instruction and the second calculation instruction, and executes the first calculation instruction and the second calculation instruction according to the calculation order to obtain a result of the extended instruction.

The technical scheme provided by the invention provides an implementation method of the extended instruction, so that an arithmetic device can execute the calculation of two calculation instructions on the extended instruction, and a single extended instruction can realize two types of calculation, thereby reducing the calculation overhead and reducing the power consumption.

Optionally, the calculation sequence may specifically include: the method comprises the following steps of performing out-of-order calculation, forward order calculation or backward order calculation, wherein the out-of-order calculation means that the first calculation instruction and the second calculation instruction do not have the corresponding order requirement, the forward order calculation means executes the first calculation instruction first and then executes the second calculation instruction, and the backward order calculation means executes the second calculation instruction first and then executes the first calculation instruction.

The specific implementation manner of the above-mentioned arithmetic device determining the calculation order according to the first calculation instruction and the second calculation instruction may be that the arithmetic device identifies whether the output data of the first calculation instruction is the same as the input data of the second calculation instruction, if so, the calculation order is determined to be the forward calculation, otherwise, the arithmetic device identifies whether the input data of the first calculation instruction is the same as the output data of the second calculation instruction, if so, the calculation order is determined to be the reverse calculation, the arithmetic device identifies whether the input data of the first calculation instruction is associated with the output data of the second calculation instruction, if not, the calculation order is determined to be the out-of-order calculation.

Specifically, for example, F ═ a × B + C, the first calculation instruction is a matrix multiplication instruction, the second calculation instruction is a matrix addition instruction, and the matrix addition instruction of the second calculation instruction requires the result applied to the first calculation instruction, that is, the output data, and therefore the calculation order is determined to be a positive order calculation. In another example, if the first operation instruction is a matrix multiplication instruction and the second operation instruction is a transformation, such as transposition or conjugation, the first operation instruction uses the output of the second operation instruction, and thus the operation sequence is a reverse order. And if no corresponding correlation exists, namely the output data of the first calculation instruction is different from the input data of the second calculation instruction, and the input data of the first calculation instruction is also different from the input data of the second calculation instruction, determining that the correlation does not exist.

The vector instruction provided by the invention is expanded, the function of the instruction is strengthened, and one instruction replaces a plurality of original instructions. Therefore, the number of instructions required by complex vector and matrix operation is reduced, and the use of vector instructions is simplified; compared with a plurality of instructions, the method does not need to store intermediate results, saves storage space, and avoids additional reading and writing expenses.

If the first calculation instruction is a vector instruction, for an input vector or matrix in the vector instruction, the instruction adds a function of scaling the input vector or matrix, i.e. adds an operand representing a scaling coefficient in the operation domain, and when the vector is read in, the vector is firstly scaled according to the scaling coefficient (i.e. the second calculation instruction is a scaling instruction). If there are operations in a vector instruction that multiply multiple input vectors or matrices, the scaling coefficients corresponding to these input vectors or matrices may be combined into one.

If the first calculation instruction is a vector instruction, the instruction adds a function of transposing an input matrix in the vector instruction (i.e., the second calculation instruction is a transpose instruction). An operand indicating whether to transpose the matrix is added to the instruction, and the operand indicates whether to transpose the matrix before operation.

If the first compute instruction is a vector instruction, the instruction adds a function to the original output vector or matrix for the output vector or matrix in the vector instruction (i.e., the second compute instruction is an add instruction). The coefficient for scaling the original output vector or matrix is added to the instruction (i.e. a third calculation instruction is added, and the third calculation instruction may be a scaling instruction), and the instruction indicates that after the vector or matrix operation is performed, the result is added to the scaled original output to be used as a new output.

If the first computing instruction is a vector instruction, the instruction adds a function to read in fixed steps to the input vector in the vector instruction. An operand representing the read step size of the input vector is added to the instruction (i.e. the second calculation instruction is a read vector in a fixed step size), representing the difference between the addresses of two adjacent elements in the vector.

If the first computing instruction is a vector instruction, the instruction adds a function of writing the result in fixed steps to the result vector in the vector instruction (i.e., the second computing instruction writes the vector in fixed steps). An operand representing the read step size of the result vector is added to the instruction, representing the difference between the addresses of two adjacent elements in the vector. If a vector is both an input and a result, the same step size is used for the vector as an input and as a result.

If the first computing instruction is a vector instruction, the instruction adds the function of reading row or column vectors in fixed steps to the input matrix in the vector instruction (i.e., the second computing instruction is reading multiple vectors in fixed steps). An operand representing a matrix read step size is added to the instruction, representing the difference in the first address between matrix row or column vectors.

If the first computing instruction is a vector instruction, the instruction adds the function of reading row or column vectors at a fixed step size to the result matrix in the vector instruction (i.e., the second computing instruction is writing multiple vectors at a fixed step size). An operand representing a matrix read step size is added to the instruction, representing the difference in the first address between matrix row or column vectors. If a matrix is both an input and a result matrix, the same step size is used as input and as result.

The actual structure of the above-described extension instruction is described below with respect to some actual extension instructions.

Vector multiply add

Calculating the product of a vector and a scalar and adding the result to another vector

Description of the function:

given a vector x, y and a scalar a, the following vector-vector operation is performed

Y:＝a*x+y

The instruction format is shown in Table 1-1:

tables 1 to 1:

the length of the vector in the instruction format shown in table 1-1 is variable, which can reduce the number of instructions and simplify the use of instructions.

The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.

Vector dot product

Computing a dot product of a vector and a vector

Description of the function: given a vector x, y of fixed length n and a scalar r, the following vector-vector operation is performed

The instruction format is shown in tables 1-2:

tables 1 to 2:

the variable length of the vectors in the instruction format shown in tables 1-2 can reduce the number of instructions and simplify the use of instructions.

Norm of vector

Computing Euclidean norms of vectors

Description of the function:

the instruction performs the following vector reduction operation:

the instruction format is shown in tables 1-3:

tables 1 to 3

The variable length of the vectors in the instruction formats shown in tables 1-3 can reduce the number of instructions and simplify the use of instructions. The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.

Vector addition

Computing the sum of the addition of all elements of a vector

Description of the function:

the instruction performs the following vector reduction operation:

the instruction format is shown in tables 1-4:

tables 1 to 4:

the variable length of the vectors in the instruction formats shown in tables 1-4 can reduce the number of instructions and simplify the use of instructions.

Maximum value of vector

Calculating the position of the largest element among all elements of the vector

Description of the function:

for a vector x of length n, the instruction writes the position of the largest element in the vector into scalar i

The instruction format is shown in tables 1-5:

tables 1 to 5:

the variable length of the vectors in the instruction formats shown in tables 1-5 can reduce the number of instructions and simplify the use of instructions. The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.

Vector minimum

Calculating the position of the smallest element among all elements of the vector

Description of the function:

for a vector x of length n, the instruction writes the position of the smallest element in the vector into scalar i

The instruction format is shown in tables 1-6:

tables 1 to 6

The variable length of the vectors in the instruction formats shown in tables 1-6 can reduce the number of instructions and simplify the use of instructions. The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.

Vector outer product

Computing the tensor product of two vectors (outer product)

Description of the function:

the instruction performs the following matrix vector operations

A:＝α*x*y^T+A

The instruction format is shown in tables 1-7:

tables 1 to 7

Scaling the result matrix by the scalar alpha in the instruction format shown in tables 1-7 increases the flexibility of the instruction and avoids the additional overhead of scaling with scaling instructions. The size of the vector and the matrix is variable, so that the number of instructions can be reduced, and the use of the instructions is simplified. Matrices of different storage formats (row major sequence and column major sequence) can be processed, avoiding the overhead of transforming the matrices. The vector format stored at certain intervals is supported, and the execution overhead of converting the vector format and the space occupation of storing intermediate results are avoided.

For the arithmetic device shown in fig. 1, which implements a specific structure for calculating the extended instruction when the extended instruction is operated, that is, implements a combination of executing a plurality of calculation instructions by executing one extended instruction, it should be noted that the extended instruction is not split into a plurality of calculation instructions when the arithmetic device executes the extended instruction.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a neural network processor board card according to an embodiment of the present disclosure. As shown in fig. 3, the neural network processor board 10 includes a neural network chip package structure 11, a first electrical and non-electrical connection device 12, and a first substrate (substrate) 13.

The specific structure of the neural network chip package structure 11 is not limited in this application, and optionally, as shown in fig. 4, the neural network chip package structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, and a second substrate 113.

The specific form of the neural network chip 111 related to the present application is not limited, and the neural network chip 111 includes, but is not limited to, a neural network wafer integrating a neural network processor, and the wafer may be made of silicon material, germanium material, quantum material, molecular material, or the like. The neural network chip can be packaged according to practical conditions (such as a severer environment) and different application requirements, so that most of the neural network chip is wrapped, and the pins on the neural network chip are connected to the outer side of the packaging structure through conductors such as gold wires and the like for circuit connection with a further outer layer.

The specific structure of the neural network chip 111 is not limited in the present application, and optionally, please refer to fig. 1A, where fig. 1A is a schematic structural diagram of a computing device in the neural network chip according to an embodiment of the present application. As shown in fig. 1A, the computing apparatus includes: memory 111, register 112 (optional), arithmetic unit 114, control unit 115, and dependency processing unit 116 (optional). The specific functions or structures of the above units can be seen from the embodiment shown in fig. 1A.

The type of the first substrate 13 and the second substrate 113 is not limited in this application, and may be a Printed Circuit Board (PCB) or a Printed Wiring Board (PWB), and may be other circuit boards. The material of the PCB is not limited.

The second substrate 113 according to the present invention is used for carrying the neural network chip 111, and the neural network chip package structure 11 obtained by connecting the neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112 is used for protecting the neural network chip 111 and facilitating further packaging of the neural network chip package structure 11 and the first substrate 13.

The specific packaging method and the corresponding structure of the second electrical and non-electrical connecting device 112 are not limited, and an appropriate packaging method can be selected according to actual conditions and different application requirements, and can be simply improved, for example: flip Chip Ball Grid Array (FCBGAP) packages, Low-profile Quad Flat packages (LQFP), Quad Flat packages with Heat sinks (HQFP), Quad Flat packages (Quad Flat Non-lead Package, QFN), or small pitch Quad Flat packages (FBGA).

The Flip Chip (Flip Chip) is suitable for the conditions of high requirements on the area after packaging or sensitivity to the inductance of a lead and the transmission time of a signal. In addition, a Wire Bonding (Wire Bonding) packaging mode can be used, so that the cost is reduced, and the flexibility of a packaging structure is improved.

Ball Grid Array (Ball Grid Array) can provide more pins, and the average wire length of the pins is short, and has the function of transmitting signals at high speed, wherein, the package can be replaced by Pin Grid Array Package (PGA), Zero Insertion Force (ZIF), Single Edge Contact Connection (SECC), Land Grid Array (LGA) and the like.

Optionally, the neural network Chip 111 and the second substrate 113 are packaged in a Flip Chip Ball Grid Array (Flip Chip Ball Grid Array) packaging manner, and a schematic diagram of a specific neural network Chip packaging structure may refer to fig. 6. As shown in fig. 6, the neural network chip package structure includes: the neural network chip 21, the bonding pad 22, the solder ball 23, the second substrate 24, the connection point 25 on the second substrate 24, and the pin 26.

The bonding pads 22 are connected to the neural network chip 21, and the solder balls 23 are formed between the bonding pads 22 and the connection points 25 on the second substrate 24 by soldering, so that the neural network chip 21 and the second substrate 24 are connected, that is, the package of the neural network chip 21 is realized.

The pins 26 are used for connecting with an external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10), so as to realize transmission of external data and internal data, and facilitate processing of data by the neural network chip 21 or a neural network processor corresponding to the neural network chip 21. The type and number of the pins are not limited in the present application, and different pin forms can be selected according to different packaging technologies and arranged according to certain rules.

Optionally, the neural network chip packaging structure further includes an insulating filler, which is disposed in a gap between the pad 22, the solder ball 23 and the connection point 25, and is used for preventing interference between the solder ball and the solder ball.

Wherein, the material of the insulating filler can be silicon nitride, silicon oxide or silicon oxynitride; the interference includes electromagnetic interference, inductive interference, and the like.

Optionally, the neural network chip package structure further includes a heat dissipation device for dissipating heat generated when the neural network chip 21 operates. The heat dissipation device may be a metal plate with good thermal conductivity, a heat sink, or a heat sink, such as a fan.

For example, as shown in fig. 6A, the neural network chip package structure 11 includes: the neural network chip 21, the bonding pad 22, the solder ball 23, the second substrate 24, the connection point 25 on the second substrate 24, the pin 26, the insulating filler 27, the thermal grease 28 and the metal housing heat sink 29. The heat dissipation paste 28 and the metal case heat dissipation sheet 29 are used to dissipate heat generated during operation of the neural network chip 21.

Optionally, the neural network chip package structure 11 further includes a reinforcing structure connected to the bonding pad 22 and embedded in the solder ball 23 to enhance the connection strength between the solder ball 23 and the bonding pad 22.

The reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.

The specific form of the first electrical and non-electrical device 12 is not limited in the present application, and reference may be made to the description of the second electrical and non-electrical device 112, that is, the neural network chip package structure 11 is packaged by soldering, or a connection wire connection or a plug connection may be adopted to connect the second substrate 113 and the first substrate 13, so as to facilitate subsequent replacement of the first substrate 13 or the neural network chip package structure 11.

Optionally, the first substrate 13 includes an interface of a memory unit for expanding a storage capacity, for example: synchronous Dynamic Random Access Memory (SDRAM), Double Rate SDRAM (DDR), etc., which improve the processing capability of the neural network processor by expanding the Memory.

The first substrate 13 may further include a Peripheral component interconnect Express (PCI-E or PCIe) interface, a Small Form-factor pluggable (SFP) interface, an ethernet interface, a Controller Area Network (CAN) interface, and the like on the first substrate, for data transmission between the package structure and the external circuit, which may improve the operation speed and the convenience of operation.

The neural network processor is packaged into a neural network chip 111, the neural network chip 111 is packaged into a neural network chip packaging structure 11, the neural network chip packaging structure 11 is packaged into a neural network processor board card 10, and data interaction is performed with an external circuit (for example, a computer motherboard) through an interface (a slot or a plug core) on the board card, that is, the function of the neural network processor is directly realized by using the neural network processor board card 10, and the neural network chip 111 is protected. And other modules can be added to the neural network processor board card 10, so that the application range and the operation efficiency of the neural network processor are improved.

In one embodiment, the present disclosure discloses an electronic device comprising the above neural network processor board card 10 or the neural network chip package 11.

Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, cell phones, tachographs, navigators, sensors, cameras, servers, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, home appliances, and/or medical devices.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules illustrated are not necessarily required to practice the invention.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An arithmetic device for performing an operation in accordance with an extended instruction, the arithmetic device comprising: a memory, an arithmetic unit and a control unit;

a memory for storing vectors;

the control unit is specifically configured to identify whether output data of the vector operation instruction is the same as input data of the second calculation instruction, and if so, determine that the calculation order is a positive order calculation; identifying whether the input data of the vector operation instruction is the same as the output data of the second calculation instruction, and if so, determining that the calculation sequence is reverse calculation; identifying whether input data of the vector operation instruction is associated with output data of the second calculation instruction or not, and if not, determining that the calculation sequence is unordered calculation;

the arithmetic unit is used for executing the vector arithmetic instruction and a second arithmetic instruction to the input vector according to the calculation sequence to obtain the result of the extended instruction, wherein the calculation sequence is positive order calculation, the vector arithmetic instruction is executed firstly, and then the second arithmetic instruction is executed; the calculation sequence is reverse order calculation, the second operation instruction is executed firstly, and then the vector operation instruction is executed; the calculation order is out-of-order calculation, and the vector operation instruction and the second operation instruction do not have the corresponding order requirement.

2. The arithmetic device of claim 1, further comprising:

3. The arithmetic device according to claim 2, wherein the control unit includes:

4. The arithmetic device of claim 3, further comprising:

5. The computing device of claim 4, wherein the dependency processing unit is configured to store a current extended instruction in a store queue when the current extended instruction and a previous extended instruction access the same vector, and to provide the current extended instruction in the store queue to the control unit after the previous extended instruction is executed.

6. The computing device of any of claims 1-5, wherein the memory is a scratch pad memory.

7. The arithmetic device of claim 1, wherein the arithmetic unit comprises a vector addition circuit, a vector multiplication circuit, a size comparison circuit, a non-linear arithmetic circuit, and a vector scalar multiplication circuit.

8. The arithmetic device according to claim 7, wherein the arithmetic unit has a multi-pipeline structure, wherein the vector multiplication circuit and the vector scalar multiplication circuit are in a first pipeline stage, the magnitude comparison unit and the vector addition circuit are in a second pipeline stage, and the non-linear operation unit is in a third pipeline stage, wherein output data of the first pipeline stage is input data of the second pipeline stage, and output data of the second pipeline stage is input data of the third pipeline stage.

9. The arithmetic device according to claim 8, wherein the arithmetic unit further comprises: the conversion circuit is positioned at the first flowing water level and the third flowing water level, or the conversion circuit is positioned at the first flowing water level, or the conversion circuit is positioned at the third flowing water level.

10. A chip incorporating an arithmetic device as claimed in any one of claims 1 to 9.

11. An electronic device, characterized in that the electronic device comprises a chip according to claim 10.