CN108037908B - Calculation method and related product - Google Patents

Calculation method and related product Download PDF

Info

Publication number
CN108037908B
CN108037908B CN201711362568.8A CN201711362568A CN108037908B CN 108037908 B CN108037908 B CN 108037908B CN 201711362568 A CN201711362568 A CN 201711362568A CN 108037908 B CN108037908 B CN 108037908B
Authority
CN
China
Prior art keywords
matrix
instruction
result
unit
operation instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711362568.8A
Other languages
Chinese (zh)
Other versions
CN108037908A (en
Inventor
胡帅
刘恩赫
张尧
孟小甫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to CN201711362568.8A priority Critical patent/CN108037908B/en
Publication of CN108037908A publication Critical patent/CN108037908A/en
Application granted granted Critical
Publication of CN108037908B publication Critical patent/CN108037908B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/76Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data
    • G06F7/78Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data for changing the order of data flow, e.g. matrix transposition or LIFO buffers; Overflow or underflow handling therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

The present disclosure provides an information processing method, which is applied in a computing device, and the computing device comprises: the device comprises a storage medium, a register unit and a matrix calculation unit; the method comprises the following steps: the computing device controls the matrix computing unit to obtain a first operation instruction, wherein the first operation instruction comprises a matrix reading instruction required by executing the instruction; the computing device controls the arithmetic unit to send a reading command to the storage medium according to the matrix reading instruction; and the computing device controls the operation unit to read the matrix corresponding to the matrix reading instruction according to a batch reading mode and execute the first operation instruction on the matrix. The technical scheme provided by the application has the advantages of high calculation speed and high efficiency.

Description

Calculation method and related product
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a computing method and a related product.
Background
Data processing is steps or stages which are needed to be carried out by most algorithms, after a computer is introduced into the field of data processing, more and more data processing is realized by the computer, and in the existing algorithms, the speed is low and the efficiency is low when computing equipment carries out matrix data computation.
Content of application
The embodiment of the application provides a computing method and a related product, which can improve the processing speed of a computing device and improve the efficiency.
In a first aspect, a computing method is provided, which is applied in a computing device, where the computing device includes a storage medium, a register unit, and a matrix operation unit, and the method includes:
the computing device controls the matrix operation unit to obtain a first operation instruction, the first operation instruction is used for realizing operation from a matrix to a matrix, the first operation instruction comprises a matrix reading instruction required by the execution of the instruction, the required matrix is at least one matrix, and the at least one matrix is the same in length or different in length;
the computing device controls the matrix operation unit to send a reading command to the storage medium according to the matrix reading instruction;
and the computing device controls the matrix operation unit to read the matrix corresponding to the matrix reading instruction from the storage medium in a batch reading mode and execute the first operation instruction on the matrix.
In some possible embodiments, the executing the first operation instruction on the matrix comprises:
and the computing device controls the matrix operation unit to execute the first operation instruction on the matrix by adopting a multi-level pipeline-level computing mode.
In some possible embodiments, each of the multiple pipeline stages includes at least one operator,
the computing device controls the matrix operation unit to adopt a multi-level pipeline computing mode, and the executing of the first operation instruction on the matrix comprises the following steps:
the computing device controls the matrix operation unit to calculate the matrix by using a first selection operator in a first-stage pipeline stage according to the selection of the multi-path selector to obtain a first result, inputs the first result into a second selection operator in a second-stage pipeline stage to perform calculation to obtain a second result, and so on until the ith-1 result is input into an ith selection operator in an ith-stage pipeline stage to perform calculation to obtain an ith result;
inputting the ith result into the storage medium for storage;
wherein the number i of the multiple pipeline stages is determined according to a computation topology of the first operation instruction, and i is a positive integer.
In some possible embodiments, each of the multiple pipeline stages is configured with a corresponding multiplexer, and the multiplexer is provided with an empty option for indicating that no calculation operation is performed on a k-th pipeline stage connected to the multiplexer and subsequent (k + 1) -th pipeline stages, where k is a positive integer less than or equal to i.
In some possible embodiments, the number of operators included in each of the multiple pipeline stages and the number of operators are custom set by a user side or the computing device side.
In some possible embodiments, each of the multiple pipeline stages includes a preset fixed operator, the fixed operators in each pipeline stage are different,
the computing device controls the matrix operation unit to adopt a multi-level pipeline computing mode, and the executing of the first operation instruction on the matrix comprises the following steps:
the computing device controls the matrix computing unit to compute the matrix by using a fixed arithmetic unit in a first-level pipeline level to obtain a first result, inputs the first result into a first fixed arithmetic unit in a second-level pipeline level to perform computation to obtain a second result, and so on until the (i-1) th result is input into a fixed arithmetic unit in an i-level pipeline level to perform computation to obtain an ith result;
inputting the ith result into the storage medium for storage;
wherein the number i of the multiple pipeline stages is determined according to a computation topology of the first operation instruction, and i is a positive integer.
In some possible embodiments, the operators in each of the multiple pipeline stages comprise any one or a combination of more of: a matrix addition operator, a matrix multiplication operator, a matrix scalar multiplication operator, a nonlinear operator, and a matrix comparison operator.
In some possible embodiments, the first operation instruction comprises any one of: the matrix inversion instruction MINV, the matrix exponent instruction MEXP, the matrix logarithm instruction MLOG, the matrix evolution instruction MSQRT, the matrix up-down overturning instruction MUPDO, the matrix left-right overturning instruction MLERI, the matrix left ninety-degree overturning instruction MFORTU, the matrix right ninety-degree overturning instruction MBACTU and the adjoint matrix instruction are used for solving.
In some possible embodiments, the first arithmetic instruction is a matrix inversion instruction MINV,
the computing device controls the matrix operation unit to adopt a multi-level pipeline computing mode, and the executing of the first operation instruction on the matrix comprises the following steps:
the computing device controls the matrix arithmetic unit to perform matrix amplification on the matrix by using a nonlinear arithmetic unit in a first-stage pipeline stage according to the selection of a multiplexer to obtain a first result, inputs the first result into a second-stage pipeline stage to perform primary row transformation on the first result by using a matrix addition arithmetic unit in the second-stage pipeline stage to obtain a second result, and inputs the second result into a third-stage pipeline stage to perform matrix interception on the second result by using the nonlinear arithmetic unit in the third-stage pipeline stage to obtain a third result; and inputting the third result to the storage medium for storage.
In some possible embodiments, the first operation instruction is any one of: a matrix exponent command MEXP, a matrix logarithm command MLOG and a matrix evolution command MSQRT,
the computing device controls the matrix operation unit to adopt a multi-level pipeline computing mode, and the executing of the first operation instruction on the matrix comprises the following steps:
the computing device controls the matrix operation unit to correspondingly carry out element-by-element logarithm, exponential or square-root operation on the matrix by utilizing a nonlinear operator in a first-stage pipeline stage according to the selection of the multiplexer to obtain a first result; and inputting the first result to the storage medium for storage.
In some possible embodiments, the instruction format of the first operation instruction includes an operation code and at least one operation field, the operation code is used for indicating the function of the operation instruction, the operation unit may perform different matrix operations by identifying the operation code, and the operation field is used for indicating data information of the operation instruction, where the data information may be an immediate number or a register number, for example, when a matrix is to be obtained, a matrix start address and a matrix length may be obtained in a corresponding register according to the register number, and then a matrix stored at a corresponding address is obtained in the storage medium according to the matrix start address and the matrix length. Optionally, any one or combination of more of the following information may be obtained in the respective registers: the instruction requires the number of rows, columns, data type, identification, memory address (head address), and length of dimension of the matrix, which refers to the length of the matrix rows and/or the length of the matrix columns.
In some possible embodiments, the matrix read indication comprises: the memory address of the matrix required by the instruction or the identification of the matrix required by the instruction.
In some possible embodiments, when the matrix read indicates the identity of the matrix required by the instruction,
the control of the matrix arithmetic unit by the computing device to send a read command to the storage medium according to the matrix read instruction comprises:
the computing device controls the matrix operation unit to read the storage address corresponding to the identifier from the register unit in a unit reading mode according to the identifier;
and the computing device controls the matrix arithmetic unit to send a reading command for reading the storage address to the storage medium and acquires the matrix in a batch reading mode.
In some possible embodiments, the computing device further comprises: a cache unit, the method further comprising:
the computing device caches operation instructions to be executed in the cache unit.
In some possible embodiments, before the computing device controls the matrix operation unit to obtain the first operation instruction, the method further comprises:
the computing device determines whether the first operation instruction is associated with a second operation instruction before the first operation instruction, if so, the first operation instruction is cached in the cache unit, and after the second operation instruction is executed, the first operation instruction is extracted from the cache unit and transmitted to the operation unit;
the determining whether the first operation instruction and a second operation instruction before the first operation instruction have an association relationship includes:
extracting a first storage address interval of a matrix required in the first operation instruction according to the first operation instruction, extracting a second storage address interval of the matrix required in the second operation instruction according to the second operation instruction, determining that the first operation instruction and the second operation instruction have an association relationship if the first storage address interval and the second storage address interval have an overlapped area, and determining that the first operation instruction and the second operation instruction do not have an association relationship if the first storage address interval and the second storage address interval do not have an overlapped area.
In a second aspect, a computing device is provided, comprising functional units for performing the method of the first aspect described above.
In a third aspect, a computer-readable storage medium is provided, which stores a computer program for electronic data exchange, wherein the computer program causes a computer to perform the method provided in the first aspect.
In a fourth aspect, there is provided a computer program product comprising a non-transitory computer readable storage medium having a computer program stored thereon, the computer program being operable to cause a computer to perform the method provided by the first aspect.
In a fifth aspect, there is provided a chip comprising a computing device as provided in the second aspect above.
In a sixth aspect, a chip packaging structure is provided, which includes the chip provided in the fifth aspect.
In a seventh aspect, a board is provided, where the board includes the chip packaging structure provided in the sixth aspect.
In an eighth aspect, an electronic device is provided, which includes the board card provided in the seventh aspect.
In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
The embodiment of the application has the following beneficial effects:
it can be seen that, through the embodiments of the present application, the computing apparatus is provided with the register unit and the storage medium, which are respectively used for storing scalar data and matrix data, and the present application allocates a unit reading mode and a batch reading mode to the two memories, and allocates a data reading mode matching the characteristics of the matrix data by the characteristics of the matrix data, so that the bandwidth can be well utilized, and the influence of the bottleneck of the bandwidth on the matrix computing speed is avoided.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a computing device according to an embodiment of the present application.
Fig. 2 is a schematic structural diagram of an arithmetic unit according to an embodiment of the present application.
Fig. 3 is a schematic flowchart of a calculation method according to an embodiment of the present invention.
Fig. 4A and 4B are schematic diagrams of architectures of two pipeline stages according to an embodiment of the present disclosure.
Fig. 5 is a schematic structural diagram of a pipeline stage according to an embodiment of the present application.
Fig. 6A and fig. 6B are schematic diagrams of formats of two instruction sets provided by an embodiment of the present application.
Fig. 7 is a schematic structural diagram of another computing device according to an embodiment of the present application.
Fig. 8 is a flowchart illustrating a computing device executing a matrix logarithm instruction according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The matrix referred to in the present application may be specifically an m × N matrix, where m and N are integers greater than or equal to 1, and when m or N is 1, it may be represented as a 1 × N matrix or an m × 1 matrix, and may also be referred to as a vector; when m and n are both 1, it can be regarded as a special matrix of 1 x 1. The matrix can be any one of the three types of matrices, which is not described in detail below.
The embodiment of the application provides a computing method which can be applied to a computing device. Fig. 1 is a schematic structural diagram of a possible computing device according to an embodiment of the present invention. The computing device shown in fig. 1 includes:
storage medium 201 for storing a matrix. The storage medium can be a high-speed temporary storage memory which can support matrix data with different lengths; the application temporarily stores necessary calculation data on a scratch pad Memory (Scratcchpad Memory), so that the arithmetic device can more flexibly and effectively support data with different lengths in the matrix operation process. The storage medium may also be an off-chip database, a database or other medium capable of storage, etc.
Register unit 202 to store scalar data, wherein the scalar data includes, but is not limited to: the matrix data (also referred to herein as a matrix) is a scalar when the matrix and the scalar are operated on at the memory address of the storage medium 201. In one embodiment, the register unit may be a scalar register file that provides scalar registers needed during operations, the scalar registers storing not only matrix addresses, but also scalar data. It should be understood that the matrix address (i.e., the memory address of the matrix, such as the first address) is also a scalar. When matrix and matrix operations are involved, the arithmetic unit needs to obtain not only the matrix address from the register unit, but also the corresponding scalar from the register unit, such as the row number, column number, type of matrix data (also may be referred to as data type), length of matrix dimension (specifically, length of matrix row, length of matrix column, etc.).
The arithmetic unit 203 (also referred to as a matrix arithmetic unit 203 in this application) is configured to obtain and execute a first arithmetic instruction. As shown in fig. 2, the arithmetic unit includes a plurality of arithmetic units, which include but are not limited to: a matrix addition operator 2031, a matrix multiplication operator 2032, a size comparison operator 2033 (which may also be a matrix comparison operator), a nonlinear operator 2034, and a matrix scalar multiplication operator 2035.
The method, as shown in fig. 3, includes the following steps:
step S301, the arithmetic unit 203 obtains a first arithmetic instruction, where the first arithmetic instruction is used to implement matrix-to-matrix arithmetic, and the first arithmetic instruction includes: the matrix read indication required to execute the instruction.
In step S301, the matrix read instruction required for executing the instruction may be various types, for example, in an optional technical solution of the present application, the matrix read instruction required for executing the instruction may be a storage address of a required matrix. For another example, in another optional technical solution of the present application, the matrix reading indication required for executing the instruction may be an identifier of a required matrix, and the identifier may be represented in various forms, for example, a name of the matrix, an identification number of the matrix, and further, for example, a register number or a storage address of the matrix in a register unit.
The matrix read instruction required for executing the first operation instruction is described below by using a practical example, where the matrix operation formula is assumed to be f (x) ═ a + B, where A, B are all matrices. Then the first operation instruction may carry the memory address of the matrix required by the matrix operation formula, specifically, for example, the memory address of a is 0000-0FFF, and the memory address of B is 1000-1FFF, in addition to the matrix operation formula. As another example, the identities of a and B may be carried, for example, the identity of a is 0101 and the identity of B is 1010.
In step S302, the arithmetic unit 203 sends a read command to the storage medium 201 according to the matrix read instruction.
The implementation method of the step S302 may specifically be:
if the matrix reading instruction can be a storage address of a required matrix, the arithmetic unit 203 sends a reading command for reading the storage address to the storage medium 201 and acquires a corresponding matrix in a batch reading manner.
If the matrix reading instruction can be the identifier of the required matrix, the arithmetic unit 203 reads the storage address corresponding to the identifier from the register unit by using a unit reading method according to the identifier, and then the arithmetic unit 203 sends the reading command for reading the storage address to the storage medium 201 and obtains the corresponding matrix by using a batch reading method.
The single reading mode may be specifically that data of a unit is read each time, that is, 1bit data. The reason why the unit reading mode, i.e., the 1-bit reading mode, is provided at this time is that, for scalar data, the occupied capacity is very small, and if the batch data reading mode is adopted, the read data amount is easily larger than the required data capacity, which may cause waste of bandwidth, so that the unit reading mode is adopted for reading scalar data to reduce the waste of bandwidth.
Step S303, the operation unit 203 reads the matrix corresponding to the instruction in a batch reading manner, and executes the first operation instruction on the matrix.
The batch reading mode in step S303 may specifically be that each reading is performed on data of multiple bits, for example, the number of bits of the data read each time is 16 bits, 32 bits, or 64 bits, that is, the data read each time is data of a fixed number of bits regardless of the required data amount, and this batch reading mode is very suitable for reading large data.
The technical scheme's that this application provided calculating device is provided with register unit and storage medium, it stores scalar data and matrix data respectively, and this application has been allocated unit reading mode and batch reading mode for two kinds of memory, through the data reading mode of the characteristic distribution matching to matrix data, bandwidth is utilized to can be fine, avoid because the bottleneck of bandwidth is to the influence of matrix computational rate, in addition, to register unit, because its storage be scalar data, scalar data's reading mode has been set up, bandwidth's utilization ratio has been improved, so the technical scheme that this application provides can fine utilization bandwidth, avoid bandwidth to the influence of computational rate, so it has the advantage that computational rate is fast, high efficiency.
Optionally, the executing the first operation instruction on the matrix may specifically be:
the operation unit 203 may adopt a multi-stage pipeline-level calculation manner, and in the embodiment of the present application, the first operation instruction may be executed on the matrix by adopting an i-stage pipeline-level calculation manner. The method specifically comprises the following embodiments.
In a first embodiment, the computing device may further design at least one multiplexer MMUX to implement multi-level pipeline computation. Specifically, fig. 4A and 4B respectively show the implementation architectures of two multi-flow water stages. As shown in fig. 4A, the computing device may design a multiplexer MMUX for each pipeline stage; fig. 4B shows that one multiplexer MMUX is provided for a plurality of pipeline stages. The multiplexer is connected with the pipeline stage which is correspondingly required to be controlled/selected and is used for selecting the arithmetic unit in the pipeline stage to realize related calculation. It should be understood that the operator selected by the multiplexer from the pipeline stage is determined according to the computing network topology corresponding to the first operation instruction, which will be described in detail below.
In a specific implementation, the operation unit 203 may calculate the matrix by using a first selection operator selected by a first (stage) pipeline stage according to the selection of the multiplexer to obtain a first result, then input the first result to a second pipeline stage, perform the calculation by using a second selection operator selected by the second pipeline stage to obtain a second result, and so on, input the i-1 th result to the ith pipeline stage, and perform the calculation by using the selected ith selection operator to obtain the ith result. Here, the ith result is an output result (specifically, an output matrix). Further, the arithmetic unit 203 may store the output result to the storage medium 201.
The number i of the multiple pipeline stages is specifically determined according to a calculation topology of the first operation instruction, and i is a positive integer. Typically, i ═ 3. A respective operator may be provided in each pipeline stage, including, but not limited to, any one or combination of: matrix addition operators, matrix scalar multiplication operators, non-linear operators, matrix comparison operators, and other matrix operators. That is, the number of the operators and the number of the operators included in each pipeline stage may be set by a user side or the computing device side in a self-defined manner, and is not limited.
Taking i as 3, three-stage pipeline as an example, the arithmetic unit may select, through three multiplexers, an arithmetic unit (also referred to as an arithmetic unit) to be used in each of the first to third pipeline stages; meanwhile, performing a first pipeline stage calculation on the matrix to obtain a first result, (optionally) inputting the first result to a second pipeline stage to perform a second pipeline stage calculation to obtain a second result, (optionally) inputting the second result to a third pipeline stage to perform a third pipeline stage calculation to obtain a third result, and (optionally) storing the third result in the storage medium 201. Fig. 5 shows a flow chart of the operation of a pipeline stage.
The first effluent stage includes, but is not limited to: matrix multiplication operators, etc.
The second pipeline stage includes but is not limited to: matrix addition operators, magnitude comparison operators, and the like.
Such third effluent stages include, but are not limited to: non-linear operators, matrix scalar multiplication operators, and the like.
For the calculation of the matrix, for example, when a general-purpose processor is used for calculation, the calculation steps may specifically be that the processor performs calculation on the matrix to obtain a first result, then stores the first result in the memory, reads the first result from the memory to perform second calculation to obtain a second result, then stores the second result in the memory, reads the second result from the memory to perform third calculation to obtain a third result, and then stores the third result in the memory. It can be seen from the above calculation steps that when the general-purpose processor performs matrix calculation, it does not perform calculation at the split water level, and then the calculated data needs to be stored after each calculation, and needs to be read again when performing the next calculation, so this scheme needs to repeatedly store and read data many times.
In another embodiment of the present application, the flow components may be freely combined or one stage of flow stages may be adopted. For example, the second pipeline stage may be merged with the third pipeline stage, or both the first and second pipelines and the third pipeline may be merged, or each pipeline stage may be responsible for different operations. For example, the first stage pipeline is responsible for comparison operations, partial multiplication operations, the second stage pipeline is responsible for combinations of nonlinear operations and matrix scalar multiplication, etc. That is, the i pipeline stages designed in the present application support parallel connection, serial connection, and combination of any multiple pipeline stages to form different permutation and combination, which is not limited in the present application.
It should be noted that each multiplexer may also be provided with a null option, that is, the pipeline stage connected to the multiplexer and the subsequent pipeline stages do not participate in the operation. That is, the null option in this application is used to indicate that the pipeline stage k connected to the multiplexer and subsequent pipeline stages k +1 to i do not perform the calculation operation, where k is a positive integer less than or equal to i.
Taking i as 3, taking three-stage pipeline as an example, if the selector connected with the third pipeline stage selects an empty option, the third pipeline stage does not participate in the operation, and the pipeline stage currently executing the operation is less than three; for example, if a certain operation instruction includes two stages of instructions, the multiplexer corresponding to the third pipeline stage selects the null option.
By adopting the computing device (namely, a multiplexer is designed to select the arithmetic unit/arithmetic part needed to be used in each level of pipeline level), the following beneficial effects are achieved: besides improving the bandwidth, the method has the characteristics of clear logic, single input interface and output interface and good operability, and the output result of the non-operation component jumps from a rear pipeline stage to a front pipeline stage.
In a second embodiment, the computing device may design a corresponding fixed pipeline level implementation architecture for each type of operation instruction. The architecture for implementing three stages of pipeline stages corresponding to an operation instruction is shown in fig. 5. That is, for a certain arithmetic instruction, the arithmetic unit included in each pipeline stage is fixedly set in advance on the user side or the computing device side, and may be referred to as a fixed arithmetic unit in the present application. In addition, the fixed operators in each pipeline stage may be the same or different, and are usually different. For example, the first stage pipeline is a matrix addition operator, the second stage pipeline is a matrix multiplication operator, and the third stage pipeline is a nonlinear operator; for another example, the first stage pipeline is a matrix addition operator, the second stage pipeline is a matrix addition operator, the third stage pipeline is a matrix multiplication operator, and so on. I.e. different operational instructions, relate to different pipeline stage devices (implementation architectures). It should be understood that, according to the implementation requirements of different operation instructions, the number i of pipeline stages involved may be different, and may be increased or decreased correspondingly, which is not limited in the present application.
In a specific implementation, the arithmetic unit 203 sequentially uses the fixed arithmetic units in the first (second) pipeline stage to calculate the matrix to obtain a first result, inputs the first result into the second pipeline stage to perform calculation using the fixed arithmetic units therein to obtain a second result, and so on until the (i-1) th result is input into the ith pipeline stage to perform calculation using the fixed arithmetic units therein to obtain the ith result. Here, the ith result is an output result (specifically, an output matrix). Further, the arithmetic unit 203 may store the output result to the storage medium 201. For the number i of the multiple pipeline stages and the fixed arithmetic units designed in each pipeline stage, reference may be made to the related explanations in the foregoing embodiments, and details are not repeated here.
Take i-3, three-stage flowing water as an example. Referring to fig. 5, a fixed implementation architecture of a pipeline stage corresponding to an operation instruction is shown. Specifically, the arithmetic unit performs multiplication calculation of the first pipeline stage on the matrix to obtain a first result, inputs the first result to the second pipeline stage to perform addition calculation of the second pipeline stage to obtain a second result, inputs the second result to the third pipeline stage to perform nonlinear calculation of the third pipeline stage to obtain a third result, and stores the third result (i.e., the output result) in the storage medium 201.
It should be noted that, the arithmetic unit in each pipeline stage in the computing apparatus is set by self-definition in advance, and once it is determined that the arithmetic unit cannot be changed; i.e. the i-stage pipeline stage can be designed as any permutation and combination of the arithmetic units, and once the i-stage pipeline stage is driven, the i-stage pipeline stage is not changed, different arithmetic instructions can be designed into different i-stage pipeline stage devices. Wherein the computing device may adaptively increase/decrease the number of pipeline stages as required by a particular instruction. Finally, pipeline devices designed for different instructions may be combined together to form the computing device.
By adopting the computing device (namely the arithmetic unit/arithmetic part in each level of the pipeline is designed and fixed), the following beneficial effects are achieved: besides improving the bandwidth, the method has the characteristics of high specificity, no redundant logic judgment, further improved operation performance and high operation speed.
Optionally, the computing device may further include: the cache unit 204 is configured to cache the first operation instruction. When an instruction is executed, if the instruction is the earliest instruction in the uncommitted instructions in the instruction cache unit, the instruction is back-committed, and once the instruction is committed, the change of the device state caused by the operation of the instruction cannot be cancelled. In one embodiment, the instruction cache unit may be a reorder cache.
Optionally, before step S301, the method may further include:
and determining whether the first operation instruction is associated with a second operation instruction before the first operation instruction, if so, extracting the first operation instruction from the cache unit and transmitting the first operation instruction to the operation unit 203 after the second operation instruction is completely executed. If the first operation instruction is not related to the instruction before the first operation instruction, the first operation instruction is directly transmitted to the operation unit.
The specific implementation method for determining whether the first operation instruction and the second operation instruction before the first operation instruction have an association relationship may be:
and extracting a first storage address interval of a required matrix in the first operation instruction according to the first operation instruction, extracting a second storage address interval of the required matrix in the second operation instruction according to the second operation instruction, and determining that the first operation instruction and the second operation instruction have an incidence relation if the first storage address interval and the second storage address interval have an overlapped area. And if the first storage address interval and the second storage address interval are not overlapped, determining that the first operation instruction and the second operation instruction do not have an association relation.
In the storage area section, an overlapped area appears to indicate that the first operation command and the second operation command access the same matrix, and for the matrix, because the storage space is relatively large, for example, the same storage area is used as a condition for judging whether the matrix is in the association relationship, it may happen that the storage area accessed by the second operation command includes the storage area accessed by the first operation command, for example, the second operation command accesses the a matrix storage area, the B matrix storage area and the C matrix storage area, and if the A, B storage area is adjacent or the A, C storage area is adjacent, the storage area accessed by the second operation command is the A, B storage area and the C storage area, or the A, C storage area and the B storage area. In this case, if the storage areas of the a matrix and the D matrix are accessed by the first operation instruction, the storage area of the matrix accessed by the first operation instruction cannot be the same as the storage area of the matrix of the second operation instruction paradigm, and if the same judgment condition is adopted, it is determined that the first operation instruction and the second operation instruction are not associated, but practice proves that the first operation instruction and the second operation instruction belong to an association relationship at this time, so the present application judges whether the matrix is the association relationship condition by whether there is an overlapping area, and can avoid the misjudgment of the above situation.
The following describes, by way of an actual example, which cases belong to the associative relationship and which cases belong to the non-associative relationship. It is assumed here that the matrices required by the first operation instruction are an a matrix and a D matrix, where the storage area of the a matrix is [ 0001, 0FFF ], the storage area of the D matrix is [ a000, AFFF ], and the matrices required for the second operation instruction are the a matrix, the B matrix, and the C matrix, and the storage areas corresponding to the matrices are [ 0001, 0FFF ], [ 1000, 1FFF ], [ B000, BFFF ], respectively, and for the first operation instruction, the storage areas corresponding to the matrices are: (0001, 0 FFF), (a 000, AFFF), for the second operation instruction, the corresponding storage area is: [ 0001, 1FFF ], [ B000, BFFF ], so that the storage area of the second operation instruction has an overlapping area [ 0001, 0FFF ] with the storage area of the first operation instruction, so that the first operation instruction has an association relationship with the second operation instruction.
It is assumed here that the matrices required by the first operation instruction are an E matrix and a D matrix, where the storage area of the a matrix is [ C000, CFFF ], the storage area of the D matrix is [ a000, AFFF ], and the matrices required for the second operation instruction are the a matrix, the B matrix, and the C matrix, and the storage areas corresponding to the matrices are [ 0001, 0FFF ], [ 1000, 1FFF ], [ B000, BFFF ], respectively, and for the first operation instruction, the storage areas corresponding to the matrices are: for the second operation instruction, the corresponding storage area is: because [ 0001, 1FFF ] and [ B000, BFFF ], the storage area of the second operation instruction does not have an overlapping area with the storage area of the first operation instruction, and the first operation instruction and the second operation instruction have no relationship.
In this application, as shown in fig. 6A, the operation instruction includes an operation code and at least one operation field, where the operation code is used to indicate a function of the operation instruction, and the operation unit can perform different matrix operations by identifying the operation code, and the operation field is used to indicate data information of the operation instruction, where the data information may be an immediate number or a register number, for example, when a matrix is to be obtained, a matrix start address and a matrix length may be obtained in a corresponding register according to the register number, and then a matrix stored in a corresponding address is obtained in a storage medium according to the matrix start address and the matrix length.
That is, the first operation instruction may include: the operation domains and the at least one opcode, for example, a matrix operation instruction, are shown in table 1, where register 0, register 1, register file 2, register 3, and register 4 may be the operation domains. Wherein, each register 0, register 1, register 2, register 3, register 4 is used to identify the number of the register, which may be one or more registers. It should be understood that the number of registers in the opcode is not limited, and each register is used to store data information associated with an operation instruction.
Figure BDA0001510700790000141
Fig. 6B is a schematic diagram of a format of an instruction set of another instruction (which may be a first operation instruction and may also be referred to as an operation instruction) provided in the present application, where, as shown in fig. 6B, the instruction includes at least two opcodes and at least one operation field, where the at least two opcodes include a first opcode and a second opcode (shown as opcode 1 and opcode 2, respectively). The opcode 1 is used to indicate a type of an instruction (i.e., a certain class of instructions), and may specifically be an IO instruction, a logic instruction, or an operation instruction, etc., and the opcode 2 is used to indicate a function of an instruction (i.e., an interpretation of a specific instruction under the class of instructions), such as a matrix operation instruction in the operation instruction (e.g., a matrix multiplication vector instruction MMUL, a matrix inversion instruction MINV, etc.), a vector operation instruction (e.g., a vector derivation instruction VDIER, etc.), etc., which are not limited in this application.
It should be understood that the format of the instructions may be custom set either on the user side or on the computing device side. The opcode of the instruction may be designed to be a fixed length, such as 8-bit, 16-bit, and so on. The instruction format as shown in fig. 6A has the following advantageous features: the operation code occupies less bits and the design of the decoding system is simple. The instruction format as shown in fig. 6B has the following advantageous features: the length can be increased, the average decoding efficiency is higher, and when the specific instructions are less and the calling frequency is high under a certain class of instructions, the length of a second operation code (namely, operation code 2) is designed to be short, so that the decoding efficiency can be improved; in addition, the readability and the expandability of the instruction can be enhanced, and the encoding structure of the instruction can be optimized.
In the embodiment of the present application, the instruction set includes operation instructions with different functions, which may specifically be:
a matrix inversion instruction (MINV) according to which the apparatus fetches matrix data of a set length from a specified address of a memory (preferably a scratch pad memory or a scalar register file), performs an operation on a matrix inversion matrix in an operation unit, and writes back the result. Preferably, and writes the results of the computations back to the specified address of the memory (preferably a scratch pad memory or scalar register file).
A matrix exponentiation instruction (MEXP), according to which the apparatus fetches matrix data of a set length from a specified address of a memory (preferably a scratch pad memory or a scalar register file), performs an operation of exponenting a matrix element by element in an arithmetic unit, and writes back the result. Preferably, and writes the results of the computations back to the specified address of the memory (preferably a scratch pad memory or scalar register file). For example, AmWhere A is a matrix, m is an index, and m is a positive integer.
Matrix arrayA logarithm instruction (MLOG) is used, according to which the apparatus fetches matrix data of a set length from a specified address of a memory (preferably a scratch pad memory or a scalar register file), performs an operation of logarithm element by element on the matrix in an arithmetic unit, and writes back the result. Preferably, and writes the results of the computations back to the specified address of the memory (preferably a scratch pad memory or scalar register file). E.g. logmA, wherein A is a matrix, m is a base number, and m is a positive integer.
A matrix squaring instruction (MSQRT) according to which the apparatus fetches matrix data of a set length from a specified address of a memory (preferably a scratch pad memory or a scalar register file), performs an operation of squaring the matrix element by element in an arithmetic unit, and writes back the result. Preferably, and writes the results of the computations back to the specified address of the memory (preferably a scratch pad memory or scalar register file). For example,
Figure BDA0001510700790000151
where A is a matrix, m is the number of times squared, and m is a positive integer.
A matrix up-down flip instruction (MUPDO) according to which the apparatus fetches matrix data of a set length from a specified address of a memory (preferably, a scratch pad memory or a scalar register file), performs an operation of flipping the matrix up and down in an arithmetic unit, and writes back the result. Preferably, and writes the results of the computations back to the specified address of the memory (preferably a scratch pad memory or scalar register file).
A matrix left-right flip instruction (MLERI) according to which the apparatus fetches matrix data of a set length from a specified address of a memory (preferably, a scratch pad memory or a scalar register file), performs an operation of left-right flipping the matrix in an arithmetic unit, and writes back the result. Preferably, and writes the results of the computations back to the specified address of the memory (preferably a scratch pad memory or scalar register file).
A matrix left ninety degree flip instruction (MFORTU), according to which the device fetches matrix data of a set length from a specified address of a memory (preferably a scratch pad memory or a scalar register file), performs an operation of flipping the matrix 90 ° to the left in an arithmetic unit, and writes the result back. Preferably, and writes the results of the computations back to the specified address of the memory (preferably a scratch pad memory or scalar register file).
A matrix right ninety degree flip instruction (MBACTU), according to which the apparatus fetches matrix data of a set length from a specified address of a memory (preferably, a scratch pad memory or a scalar register file), performs an operation of flipping the matrix 90 ° to the right in an arithmetic unit, and writes back the result. Preferably, and writes the results of the computations back to the specified address of the memory (preferably a scratch pad memory or scalar register file).
An accompanying matrix instruction is obtained, and according to the instruction, the apparatus fetches matrix data of a set length from a specified address of a memory (preferably, a scratch pad memory or a scalar register file), performs an operation of obtaining an accompanying matrix for the matrix in an arithmetic unit, and writes back the result. Preferably, and writes the results of the computations back to the specified address of the memory (preferably a scratch pad memory or scalar register file).
It should be understood that the matrix operation/operation instruction proposed in the present application is mainly used for matrix element-by-element numerical operation, special matrix transposition (inversion) operation, matrix form change (inversion), and the like. Thus, the arithmetic units designed in each pipeline stage include, but are not limited to, any one or combination of more of the following: a matrix addition operator, a matrix multiplication operator, a matrix scalar multiplication operator, a nonlinear operator, and a matrix comparison operator.
The following exemplifies calculation of an operation instruction (i.e., a first operation instruction) according to the present application.
Taking the first operation instruction as a matrix inversion instruction MINV as an example, calculating an inverse matrix of a given full-rank square matrix. In concrete implementation, a full rank square matrix A is given, and matrix inversion is carried out by adopting a row transformation cell, namely A1=[A/I](ii) a Then, the square matrix is eliminated into a unit matrix I by adopting the primary row transformation, and the obtained unit matrix is an inverse matrix B, namely [ A/I ]]→B=[I/A1]。
Correspondingly, the instruction format of the matrix inversion instruction is specifically:
Figure BDA0001510700790000161
Figure BDA0001510700790000171
in combination with the foregoing embodiments, the arithmetic unit may obtain the matrix inversion instruction MINV, decode the matrix inversion instruction MINV, perform matrix amplification processing on the matrix by using the first pipeline-level multiplexer to select the non-linear operator, obtain a first result, input the first result into the matrix addition operator selected by the second pipeline-level multiplexer to perform the first-order row transformation, and finally input the second result into the non-linear operator selected by the third pipeline-level multiplexer to perform the matrix truncation, so as to obtain a third result (i.e., an output result). Optionally, the third result is stored in a storage medium.
Taking the first operation instruction as a matrix exponent instruction MEXP for example, a matrix a is given, all elements in the matrix are subjected to element-by-element exponent calculation with a natural number e as a base number, and the element is used as an exponent. In concrete implementation, a matrix A is given, and an output matrix B is obtained by solving according to the following formula.
Figure BDA0001510700790000172
Wherein A ismnIs the element of the mth row and the nth column in the matrix A, and m and n are positive integers.
Correspondingly, the instruction format of the matrix exponent calculating instruction is specifically as follows:
Figure BDA0001510700790000173
Figure BDA0001510700790000181
in combination with the foregoing embodiments, the arithmetic unit may obtain the matrix exponent command MEXP, decode the command MEXP, and then use the first-class multiplexer to select the non-linear operator to perform the element-by-element exponent calculation on the matrix to obtain a first result (i.e., output result/output matrix). Optionally, the first result is stored in a storage medium.
Taking the first operation instruction as a matrix logarithm instruction MLOG for example, a matrix a is given, and element-by-element logarithm operation with a natural number e as a base number is performed on all elements in the matrix. In concrete implementation, a matrix A is given, and an output matrix B is obtained by solving according to the following formula.
Figure BDA0001510700790000182
Wherein A ismnIs the element of the mth row and the nth column in the matrix A, and m and n are positive integers. It should be understood that the present application only uses the base number e as an example, and other natural numbers such as 2, 3, etc. may be used alternatively, and are not limited.
Correspondingly, the instruction format of the matrix logarithm instruction is specifically as follows:
Figure BDA0001510700790000191
in combination with the foregoing embodiments, the arithmetic unit may obtain the matrix logarithm instruction MLOG, decode the matrix logarithm instruction MLOG, and then select the non-linear arithmetic unit to perform the element-by-element logarithm operation on the matrix by using the first-class multiplexer to obtain the first result (i.e., the output result/the output matrix). Optionally, the first result is stored in a storage medium.
Taking the first operation instruction as a matrix evolution instruction MSQRT as an example, a matrix A is given, and element-by-element evolution operation is performed on all elements in the matrix. In concrete implementation, a matrix A is given, and an output matrix B is obtained by solving according to the following formula.
Figure BDA0001510700790000201
Wherein A ismnIs the element of the mth row and the nth column in the matrix A, and m and n are positive integers. It should be understood that the present application only uses the number of the root as 2 (i.e. sqrt), and optionally, other natural numbers, such as 3, 4, etc., may be used, and are not limited.
Correspondingly, the instruction format of the matrix evolution instruction is specifically as follows:
Figure BDA0001510700790000202
in combination with the foregoing embodiments, the arithmetic unit may obtain the matrix squaring instruction MSQRT, decode the MSQRT, and then use the first-class multiplexer to select the non-linear arithmetic unit to perform the element-by-element squaring operation on the matrix to obtain the first result (i.e., output result/output matrix). Optionally, the first result is stored in a storage medium.
Taking the first operation instruction as a matrix up-down flip instruction MUPDO as an example, a matrix a is given, and the matrix is subjected to up-down flip operation. In concrete implementation, a matrix A is given, and an output matrix B is obtained by solving according to the following formula.
Figure BDA0001510700790000211
Wherein A ismnIs the element of the mth row and the nth column in the matrix A, and m and n are positive integers.
Correspondingly, the instruction format of the matrix up-down flip instruction MUPDO is specifically:
Figure BDA0001510700790000212
in combination with the foregoing embodiments, the arithmetic unit may obtain the matrix up-down flip instruction MUPDO, decode the matrix up-down flip instruction MUPDO, and then use the first-class multiplexer to select the matrix transpose device to perform the corresponding element exchange according to the transformation formula to obtain the first result (i.e., the output result/output matrix). Optionally, the first result is stored in a storage medium.
Taking the first operation instruction as a matrix left-right flip instruction MLERI as an example, a matrix a is given, and left-right flip operation is performed on the matrix. In concrete implementation, a matrix A is given, and an output matrix B is obtained by solving according to the following formula.
Figure BDA0001510700790000221
Wherein A ismnIs the element of the mth row and the nth column in the matrix A, and m and n are positive integers.
Correspondingly, the instruction format of the matrix left-right flip instruction MLERI is specifically:
Figure BDA0001510700790000222
with reference to the foregoing embodiment, the arithmetic unit may obtain the matrix left-right flipping instruction MLERI, decode the matrix left-right flipping instruction MLERI, and then use the first-class multiplexer to select the matrix transpose device to perform the corresponding element exchange according to the transformation formula to obtain the first result (i.e., output result/output matrix). Optionally, the first result is stored in a storage medium. The present application does not go into much detail about the matrix conversion device.
Taking the first operation instruction as a matrix left 90 ° (ninety degrees) flip instruction MFORTU as an example, given a matrix a, a left 90 ° flip operation is performed on the matrix. In concrete implementation, a matrix A is given, and an output matrix B is obtained by solving according to the following formula.
Figure BDA0001510700790000231
Wherein A ismnIs the element of the mth row and the nth column in the matrix A, and m and n are positive integers.
Correspondingly, the instruction format of the matrix left 90 ° flip instruction MFORTU is specifically:
Figure BDA0001510700790000232
Figure BDA0001510700790000241
with reference to the foregoing embodiment, the arithmetic unit may obtain the matrix left 90 ° flip instruction MFORTU, decode the MFORTU, and then use the first-level multiplexer to select the matrix transpose device to perform the corresponding element exchange according to the transformation formula to obtain the first result (i.e., output result/output matrix). Optionally, the first result is stored in a storage medium.
Taking the first operation instruction as a matrix right 90 ° (ninety degrees) flip instruction MBACTU as an example, given a matrix a, the matrix is subjected to a right 90 ° flip operation. In concrete implementation, a matrix A is given, and an output matrix B is obtained by solving according to the following formula.
Figure BDA0001510700790000242
Wherein A ismnIs the element of the mth row and the nth column in the matrix A, and m and n are positive integers.
Correspondingly, the instruction format of the matrix right 90 ° flip instruction MBACTU is specifically:
Figure BDA0001510700790000243
Figure BDA0001510700790000251
with reference to the foregoing embodiment, the arithmetic unit may obtain a matrix right 90 ° flip instruction MBACTU, decode the MBACTU, and then use the first-stage multiplexer to select the matrix transpose device to perform the exchange of corresponding elements according to the transformation formula to obtain a first result (i.e., output result/output matrix). Optionally, the first result is stored in a storage medium.
It should be noted that the fetching and decoding of the various operation instructions will be described in detail later. It should be understood that, by adopting the structure of the above-mentioned computing apparatus to implement the computation of the operation instruction (such as the matrix inversion instruction MINV, etc.), the following beneficial effects can be obtained: the matrix has variable scale, so that the number of instructions can be reduced, and the use of the instructions is simplified; the matrix with different storage formats (row main sequence and column main sequence) can be processed, and the cost for converting the matrix is avoided; the matrix format stored at certain intervals is supported, and the execution overhead of converting the matrix storage format and the space occupation of storing intermediate results are avoided.
The set length in the above operation instruction (i.e. operation instruction) can be set by the user, and in an alternative embodiment, the user can set the set length to one value, but in practical application, the user can also set the set length to multiple values. The specific value and the number of the set length are not limited in the embodiments of the present invention. In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.
Referring to fig. 7, fig. 7 is a block diagram of another computing device 50 according to an embodiment of the present disclosure. As shown in fig. 7, the computing device 50 includes: a storage medium 501, a register unit 502 (preferably, a scalar data storage unit, a scalar register unit), an operation unit 503 (may also be referred to as a matrix operation unit 503), and a control unit 504;
a storage medium 501 for storing a matrix;
a scalar data storage unit 502 for storing scalar data including at least: a storage address of the matrix within the storage medium;
a control unit 504, configured to control the arithmetic unit to obtain a first arithmetic instruction, where the first arithmetic instruction is used to implement matrix-to-matrix arithmetic, and the first arithmetic instruction includes a matrix reading instruction required to execute the instruction;
an arithmetic unit 503, configured to send a read command to the storage medium according to the matrix read instruction; and executing the first operation instruction on the matrix according to the matrix corresponding to the matrix reading instruction read by adopting a batch reading mode.
Optionally, the matrix reading instruction includes: the memory address of the matrix required by the instruction or the identification of the matrix required by the instruction.
Optionally as the matrix reads the identity of the matrix indicated as required by the instruction,
a control unit 504, configured to control the arithmetic unit to read, according to the identifier, the storage address corresponding to the identifier from the register unit in a unit reading manner, control the arithmetic unit to send a reading command for reading the storage address to the storage medium, and acquire the matrix in a batch reading manner.
Optionally, the operation unit 503 is specifically configured to execute the first operation instruction on the matrix in a multi-level pipeline-level calculation manner.
Optionally, each of the multiple pipeline stages comprises at least one operator,
an operation unit 503, specifically configured to calculate the matrix by using a first selection operator in the first-stage pipeline stage to obtain a first result according to the selection of the multiplexer, input the first result to a second selection operator in the second-stage pipeline stage to perform calculation to obtain a second result, and so on until the i-1 th result is input to the i-th selection operator in the i-th pipeline stage to perform calculation to obtain an i-th result; inputting the ith result into the storage medium for storage; wherein the number i of the multiple pipeline stages is determined according to a computation topology of the first operation instruction, and i is a positive integer.
Optionally, each pipeline stage in the multiple pipeline stages includes a preset fixed operator, the fixed operators in each pipeline stage are different,
an operation unit 503, configured to calculate the matrix by using a fixed operator in the first-stage pipeline stage to obtain a first result, input the first result to a second fixed operator in the second-stage pipeline stage to perform calculation to obtain a second result, and so on until the (i-1) th result is input to the fixed operator in the i-stage pipeline stage to perform calculation to obtain an ith result; inputting the ith result into the storage medium for storage; wherein the number i of the multiple pipeline stages is determined according to a computation topology of the first operation instruction, and i is a positive integer.
Optionally, the first operation instruction is a matrix inversion instruction MINV,
an operation unit 503, configured to perform matrix amplification on the matrix by using a nonlinear operator in a first-stage pipeline stage according to selection of the multiplexer to obtain a first result, input the first result into a second-stage pipeline stage to perform an initial row transformation on the first result by using a matrix addition operator therein to obtain a second result, and input the second result into a third-stage pipeline stage to perform matrix truncation on the second result by using a nonlinear operator therein to obtain a third result; and inputting the third result to the storage medium for storage.
Optionally, the first operation instruction is any one of the following: a matrix exponent command MEXP, a matrix logarithm command MLOG and a matrix evolution command MSQRT,
an operation unit 503, configured to perform, according to the selection of the multiplexer, element-by-element logarithm, exponential, or square root operation on the matrix by using a nonlinear operator in the first-stage pipeline stage, so as to obtain a first result; and inputting the first result to the storage medium for storage.
Optionally, the computing apparatus further includes:
a cache unit 505, configured to cache an operation instruction to be executed;
the control unit 504 is configured to cache an operation instruction to be executed in the cache unit 504.
Optionally, the control unit 504 is configured to determine whether an association relationship exists between the first operation instruction and a second operation instruction before the first operation instruction, if the association relationship exists between the first operation instruction and the second operation instruction, cache the first operation instruction in the cache unit, and after the second operation instruction is executed, extract the first operation instruction from the cache unit and transmit the first operation instruction to the operation unit;
the determining whether the first operation instruction and a second operation instruction before the first operation instruction have an association relationship includes:
extracting a first storage address interval of a matrix required in the first operation instruction according to the first operation instruction, extracting a second storage address interval of the matrix required in the second operation instruction according to the second operation instruction, if the first storage address interval and the second storage address interval have an overlapped area, determining that the first operation instruction and the second operation instruction have an association relation, and if the first storage address interval and the second storage address interval do not have an overlapped area, determining that the first operation instruction and the second operation instruction do not have an association relation.
Optionally, the control unit 503 may be configured to obtain an operation instruction from the instruction cache unit, process the operation instruction, and provide the processed operation instruction to the operation unit. The control unit 503 may be divided into three modules, which are: an instruction fetching module 5031, a decoding module 5032 and an instruction queue module 5033,
the instruction fetching module 5031 is configured to obtain an operation instruction from the instruction cache unit;
a decoding module 5032, configured to decode the obtained operation instruction;
the instruction queue 5033 is configured to store the decoded operation instructions sequentially, and to cache the decoded instructions in consideration of possible dependencies among registers of different instructions, and to issue the instructions when the dependencies are satisfied.
Referring to fig. 8 and fig. 8 are flowcharts illustrating a computing device according to an embodiment of the present disclosure to execute an operation instruction, as shown in fig. 8, a hardware structure of the computing device refers to the structure shown in fig. 7, and as shown in fig. 7, the storage medium takes a scratch pad as an example, and a process of executing a matrix logarithm instruction MLOG includes:
in step S601, the computing device controls the instruction fetching module to fetch a matrix logarithm instruction, and sends the matrix logarithm instruction to the decoding module.
Step S602, the decoding module decodes the matrix logarithm instruction, and sends the matrix logarithm instruction to an instruction queue.
Step S603, in the instruction queue, the matrix logarithm instruction needs to obtain data in the scalar registers corresponding to the four operation domains in the instruction from the scalar register file, where the data includes the row number of the input matrix, the column number of the input matrix, the input matrix address, and the output matrix address.
Step S604, the control unit determines whether the matrix logarithm instruction and the operation instruction before the matrix logarithm instruction have an association relationship, if so, the matrix logarithm instruction is stored in the cache unit, and if not, the matrix logarithm instruction is transmitted to the operation unit.
In step S605, the arithmetic unit takes out the required matrix data from the high-speed register according to the data in the scalar registers corresponding to the five operation domains, and then completes the logarithm operation in the arithmetic unit.
In step S606, after the arithmetic unit completes the arithmetic operation, the result is written into the designated address of the memory (preferably, the scratch pad memory or the scalar register file), and the matrix logarithm instruction in the reorder buffer is submitted.
Optionally, in the step S605, when the arithmetic unit performs the logarithm operation, the calculating device may use a non-linear operator to perform the matrix element-by-element logarithm operation.
In a specific implementation, after the decoding module decodes the matrix logarithm instruction, according to a control signal generated by decoding, the matrix obtained in S603 is input to a nonlinear operator selected by the first-stage pipeline-level multiplexer to perform matrix element-by-element logarithm calculation, so as to obtain a first result, and the second-stage pipeline-level multiplexer can know that the stage is an empty option according to the control signal. Accordingly, the first result (i.e., the output result) is directly transmitted to the output terminal.
The operation instruction in fig. 8 is exemplified by a matrix logarithm instruction, and in practical applications, the matrix logarithm instruction in the embodiment shown in fig. 8 may be replaced by a matrix operation/operation instruction such as a matrix inversion instruction, a matrix exponential instruction, a matrix squaring instruction, a matrix up-down flipping instruction, a matrix left-right 90 ° flipping instruction, and a matrix right-right 90 ° flipping instruction, which are not described herein again.
Embodiments of the present application also provide a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program enables a computer to execute some or all of the steps of any implementation described in the above method embodiments.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform any, some or all of the steps of any of the implementations described in the above method embodiments.
An embodiment of the present application further provides an acceleration apparatus, including: a memory: executable instructions are stored; a processor: for executing the executable instructions in the memory unit, and when executing the instructions, operate according to the embodiments described in the above method embodiments.
Wherein the processor may be a single processing unit, but may also comprise two or more processing units. In addition, the processor may also include a general purpose processor (CPU) or a Graphics Processor (GPU); it may also be included in a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC) to set up and operate the neural network. The processor may also include on-chip memory (i.e., including memory in the processing device) for caching purposes.
In some embodiments, a chip is also disclosed, which includes the neural network processor for performing the above method embodiments.
In some embodiments, a chip packaging structure is disclosed, which includes the above chip.
In some embodiments, a board card is disclosed, which includes the above chip package structure.
In some embodiments, an electronic device is disclosed that includes the above board card.
The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.
The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (8)

1. A computing method applied in a computing apparatus including a storage medium, a register unit, and a matrix operation unit, the method comprising:
the computing device controls the matrix operation unit to obtain a first operation instruction, the first operation instruction is used for realizing operation from a matrix to a matrix, the first operation instruction comprises a matrix reading instruction required by the execution of the instruction, the required matrix is at least one matrix, and the at least one matrix is the same in length or different in length;
the computing device controls the matrix operation unit to send a reading command to the storage medium according to the matrix reading instruction;
the computing device controls the matrix operation unit to read the matrix corresponding to the matrix reading instruction from the storage medium in a batch reading mode and execute the first operation instruction on the matrix in a multi-level pipeline computing mode;
the first operational instruction comprises any one of: matrix inversion instruction MINV and adjoint matrix instruction are solved;
the MINV comprises: an opcode and an operation domain, the operation domain comprising: TYPE, N, A, LDA, B, LDB; the data TYPE that TYPE relates to for the matrix operation, N be the row number of matrix A ', A is the first address of matrix A ', LDA is the first address interval between two adjacent row vectors of matrix A ' or the first address interval between two adjacent column vectors, B is the first address of matrix B ', LDB is the first address interval between two adjacent row vectors of matrix B ' or the first address interval between two adjacent column vectors.
2. The method of claim 1, wherein each of the multiple pipeline stages includes at least one operator,
the executing the first operation instruction on the matrix by adopting a multi-level pipeline-level computing mode comprises:
the computing device controls the matrix operation unit to calculate the matrix by using a first selection operator in a first-stage pipeline stage according to the selection of the multi-path selector to obtain a first result, inputs the first result into a second selection operator in a second-stage pipeline stage to perform calculation to obtain a second result, and so on until the ith-1 result is input into an ith selection operator in an ith-stage pipeline stage to perform calculation to obtain an ith result;
inputting the ith result into the storage medium for storage;
the ith result is an output matrix, the number i of the multiple stages of pipeline stages is determined according to a calculation topological structure of the first operation instruction, and i is a positive integer.
3. The method of claim 1, wherein each of the multiple pipeline stages comprises a pre-configured fixed operator, the fixed operators in each pipeline stage being different,
the executing the first operation instruction on the matrix by adopting a multi-level pipeline-level computing mode comprises:
the computing device controls the matrix computing unit to compute the matrix by using a fixed arithmetic unit in a first-level pipeline level to obtain a first result, inputs the first result into a fixed arithmetic unit in a second-level pipeline level to perform computation to obtain a second result, and so on until the (i-1) th result is input into a fixed arithmetic unit in an i-level pipeline level to perform computation to obtain an ith result;
inputting the ith result into the storage medium for storage;
wherein the number i of the multiple pipeline stages is determined according to a computation topology of the first operation instruction, and i is a positive integer.
4. The method according to any one of claims 1-3, wherein each of the multiple pipeline stages is configured with a corresponding multiplexer, the multiplexer is provided with a null option, the null option is used for indicating that no calculation operation is performed on a k-th pipeline stage connected with the multiplexer and subsequent (k + 1) -th to i-th pipeline stages, wherein k is a positive integer less than or equal to i;
the arithmetic unit in each of the multiple pipeline stages comprises any one or a combination of more than one of: a matrix addition operator, a matrix multiplication operator, a matrix scalar multiplication operator, a nonlinear operator, and a matrix comparison operator.
5. A computing device, comprising a storage medium, a register unit, a matrix operation unit, and a controller unit;
the storage medium is used for storing the matrix;
the register unit is configured to store scalar data, where the scalar data at least includes: a storage address of the matrix within the storage medium;
the controller unit is configured to control the matrix operation unit to obtain a first operation instruction, where the first operation instruction is used to implement an operation from a matrix to a matrix, the first operation instruction includes a matrix reading instruction required by executing the instruction, the required matrix is at least one matrix, and the at least one matrix is a matrix with the same length or a matrix with different lengths;
the matrix operation unit is used for sending a reading command to the storage medium according to the matrix reading instruction; reading a matrix corresponding to the matrix reading indication by adopting a batch reading mode, and executing the first operation instruction on the matrix by adopting a multi-level pipeline calculation mode;
the first operational instruction comprises any one of: matrix inversion instruction MINV and adjoint matrix instruction are solved;
the MINV comprises: an opcode and an operation domain, the operation domain comprising: TYPE, N, A, LDA, B, LDB; the data TYPE that TYPE relates to for the matrix operation, N be the row number of matrix A ', A is the first address of matrix A ', LDA is the first address interval between two adjacent row vectors of matrix A ' or the first address interval between two adjacent column vectors, B is the first address of matrix B ', LDB is the first address interval between two adjacent row vectors of matrix B ' or the first address interval between two adjacent column vectors.
6. A chip comprising the computing device of claim 5.
7. An electronic device, characterized in that it comprises a chip according to claim 6.
8. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any of claims 1-4.
CN201711362568.8A 2017-12-15 2017-12-15 Calculation method and related product Active CN108037908B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711362568.8A CN108037908B (en) 2017-12-15 2017-12-15 Calculation method and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711362568.8A CN108037908B (en) 2017-12-15 2017-12-15 Calculation method and related product

Publications (2)

Publication Number Publication Date
CN108037908A CN108037908A (en) 2018-05-15
CN108037908B true CN108037908B (en) 2021-02-09

Family

ID=62099684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711362568.8A Active CN108037908B (en) 2017-12-15 2017-12-15 Calculation method and related product

Country Status (1)

Country Link
CN (1) CN108037908B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110727911B (en) * 2018-07-17 2022-09-02 展讯通信(上海)有限公司 Matrix operation method and device, storage medium and terminal
CN111242293B (en) * 2020-01-13 2023-07-18 腾讯科技(深圳)有限公司 Processing component, data processing method and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090028A (en) * 2017-12-15 2018-05-29 北京中科寒武纪科技有限公司 A kind of computational methods and Related product

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102262525B (en) * 2011-08-29 2014-11-19 孙瑞玮 Vector-operation-based vector floating point operational device and method
CN103902507B (en) * 2014-03-28 2017-05-10 中国科学院自动化研究所 Matrix multiplication calculating device and matrix multiplication calculating method both oriented to programmable algebra processor
CN105468335B (en) * 2015-11-24 2017-04-12 中国科学院计算技术研究所 Pipeline-level operation device, data processing method and network-on-chip chip
CN111580865B (en) * 2016-01-20 2024-02-27 中科寒武纪科技股份有限公司 Vector operation device and operation method
CN106991077A (en) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 A kind of matrix computations device
CN108388541B (en) * 2016-04-22 2020-12-11 安徽寒武纪信息科技有限公司 Convolution operation device and method
CN107315715B (en) * 2016-04-26 2020-11-03 中科寒武纪科技股份有限公司 Apparatus and method for performing matrix addition/subtraction operation
CN112612521A (en) * 2016-04-26 2021-04-06 安徽寒武纪信息科技有限公司 Apparatus and method for performing matrix multiplication operation
CN107329936A (en) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing neural network computing and matrix/vector computing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090028A (en) * 2017-12-15 2018-05-29 北京中科寒武纪科技有限公司 A kind of computational methods and Related product

Also Published As

Publication number Publication date
CN108037908A (en) 2018-05-15

Similar Documents

Publication Publication Date Title
CN107957976B (en) Calculation method and related product
CN108009126B (en) Calculation method and related product
CN108121688B (en) Calculation method and related product
CN111291880B (en) Computing device and computing method
CN108108190B (en) Calculation method and related product
CN110688157B (en) Computing device and computing method
EP3629158B1 (en) Systems and methods for performing instructions to transform matrices into row-interleaved format
CN107957977B (en) Calculation method and related product
CN107957975B (en) Calculation method and related product
US20240078285A1 (en) Systems and methods of instructions to accelerate multiplication of sparse matrices using bitmasks that identify non-zero elements
CN110163360B (en) Computing device and method
CN107943756B (en) Calculation method and related product
CN107315715B (en) Apparatus and method for performing matrix addition/subtraction operation
US8984043B2 (en) Multiplying and adding matrices
US10324689B2 (en) Scalable memory-optimized hardware for matrix-solve
US20130159665A1 (en) Specialized vector instruction and datapath for matrix multiplication
US20200210516A1 (en) Apparatuses, methods, and systems for fast fourier transform configuration and computation instructions
CN112612521A (en) Apparatus and method for performing matrix multiplication operation
CN108090028B (en) Calculation method and related product
CN108108189B (en) Calculation method and related product
EP3623940A2 (en) Systems and methods for performing horizontal tile operations
WO2014004394A1 (en) Vector multiplication with operand base system conversion and re-conversion
CN108037908B (en) Calculation method and related product
CN108021393B (en) Calculation method and related product
CN107977231B (en) Calculation method and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences

Applicant after: Zhongke Cambrian Technology Co., Ltd

Address before: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences

Applicant before: Beijing Zhongke Cambrian Technology Co., Ltd.

GR01 Patent grant
GR01 Patent grant