CN116737239A

CN116737239A - Instruction processing method, device, equipment and medium

Info

Publication number: CN116737239A
Application number: CN202210203953.2A
Authority: CN
Inventors: 班志华
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd; Guangzhou Shiyuan Artificial Intelligence Innovation Research Institute Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd; Guangzhou Shiyuan Artificial Intelligence Innovation Research Institute Co Ltd
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2023-09-12

Abstract

The application provides an instruction processing method, which comprises the following steps: receiving a target instruction, and decoding the target instruction to obtain a first register code and a second register code; acquiring a source operation array from a first display register set corresponding to the first register code, performing matrix multiplication calculation on the source operation array to obtain a first parameter set, and storing the first parameter set into a first hidden register set; and acquiring a target operation array from a second display register set corresponding to the second register code, acquiring a first parameter set from the first hidden register set, performing addition calculation on the target operation array and the first parameter set, and writing an addition calculation result back into the second display register set to obtain a target processing result. The application can process 16 fixed point number multiplications and 16 fixed point number additions simultaneously, is a single instruction multi-data instruction, and has higher calculation efficiency; and the result is stored in a register that is 4 times as wide as the source operand bit, which is advantageous in avoiding result overflow.

Description

Instruction processing method, device, equipment and medium

Technical Field

The present application relates to the field of computer technologies, and for example, to a method, an apparatus, a device, and a medium for processing instructions.

Background

Matrix multiplication is a fundamental component of many algorithms, especially in the field of artificial intelligence, which occupies most of the computation time of the algorithm. In order to improve the calculation efficiency of these algorithms, the prior art mostly adopts single instruction stream multiple data instructions to accelerate matrix multiplication calculation, however, the bit width of the operands of the existing instructions is mostly more than 16 bits, which requires that the matrix multiplication with 8 bits of source operands firstly converts the source operands into 16 bits, thereby adding additional type conversion time, and simultaneously increasing the requirement on the memory bandwidth, and finally resulting in lower calculation speed.

Disclosure of Invention

The application aims at: provided are an instruction processing method, apparatus, device, and medium, which can improve instruction processing efficiency.

In order to achieve the above purpose, the application adopts the following technical scheme:

the application provides an instruction processing method, which comprises the following steps:

receiving a target instruction, and decoding the target instruction to obtain a first register code and a second register code, wherein the target instruction is in a single-instruction multi-data structure;

acquiring the source operation array from a first display register group corresponding to the first register code, performing matrix multiplication on the source operation array to obtain a first parameter group, and storing the first parameter group into a first hidden register group, wherein the channel bit width of the first display register group is 8 bits;

the target operation array is obtained from a second display register set corresponding to the second register code, the first parameter set is obtained from the first hidden register set, the target operation array and the first parameter set are subjected to addition calculation, and the addition calculation result is written back into the second display register set to obtain a target processing result, wherein the channel bit width of the second display register set is 32 bits, and the bit width of the target processing result is 32 bits.

The application also provides an instruction processing device, which comprises:

a first display register set for storing a source operation array of a target instruction;

a first set of hidden registers for storing a first set of parameters;

and the second display register group is used for storing a destination operation array of the target instruction.

the first storage unit is used for receiving a target instruction, decoding the target instruction to obtain a first register code and a second register code, wherein the target instruction is in a single-instruction multi-data structure;

the first computing unit is used for acquiring a source operation array from a first display register set corresponding to the first register code, performing matrix multiplication computation on the source operation array to obtain a first parameter set, and storing the first parameter set into a first hidden register set, wherein the channel bit width of the first display register set is 8 bits;

and the second calculation unit is used for acquiring a target operation array from a second display register set corresponding to the second register code, acquiring the first parameter set from the first hidden register set, carrying out addition calculation on the target operation array and the first parameter set, and writing the addition calculation result back into the second display register set to obtain a target processing result, wherein the channel bit width of the second display register set is 32 bits, and the bit width of the target processing result is 32 bits.

The application also provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of an instruction processing method according to any one of the preceding claims when executing the computer program.

The application relates to an instruction processing method, which is characterized in that a source operation array and a destination operation array of a target instruction are respectively stored into a first display register group with 8 bit width and a second display register group with 32 bit width, matrix multiplication calculation and addition calculation are respectively carried out, and a complete instruction calculation process is decomposed; the addition calculation result is input into the second display register group to obtain a target processing result with 32-bit output, so that the output bit width is four times of that of a source operand, data overflow is easier to avoid, and the accuracy of an algorithm is more convenient to ensure; by implicitly using the first hidden register set, the general vector register is not occupied, and encoding processing is not needed, so that the processing efficiency of the instruction is improved.

Drawings

FIG. 1 is a flow chart of an instruction processing method according to an embodiment;

FIG. 2 is a schematic diagram of an instruction processing apparatus according to an embodiment;

FIG. 3 is a schematic diagram of an instruction processing apparatus according to an embodiment;

fig. 4 is a block diagram schematically illustrating a structure of a computer device according to an embodiment.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, modules, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, modules, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any module and all combination of one or more of the associated listed items.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Referring to fig. 1, a flow chart of an instruction processing method disclosed in the present embodiment includes:

s1: receiving a target instruction, and decoding the target instruction to obtain a first register code and a second register code, wherein the target instruction is in a single-instruction multi-data structure;

s2: acquiring a source operation array from a first display register group corresponding to the first register code, performing matrix multiplication on the source operation array to obtain a first parameter group, and storing the first parameter group into a first hidden register group, wherein the channel bit width of the first display register group is 8 bits;

s3: and acquiring a target operation array from a second display register set corresponding to the second register code, acquiring the first parameter set from the first hidden register set, performing addition calculation on the target operation array and the first parameter set, and writing the addition calculation result back into the second display register set to obtain a target processing result, wherein the channel bit width of the second display register set is 32 bits, and the bit width of the target processing result is 32 bits.

As described in step S1, in the algorithm processing of the computer, especially in the instruction processing of the neural network algorithm, a large number of target instructions often need to be subjected to matrix multiplication, and in order to perform the reduction processing on the matrix, before receiving the target instructions, the source operation array and the destination operation array are respectively stored into different display register groups according to the previously received storage instructions, and the different display register groups are represented by different codes, where the codes are usually addresses of the display register groups, and the storage instructions are usually 32-bit binary instructions; when a target instruction is received, decoding the target instruction to obtain a first register code and a second register code, wherein the first register code and the second register code respectively correspond to a register group, namely the target instruction can extract a required source operation array in the register by designating a register required for calculation, the source operation array comprises a plurality of source operands, the source operands refer to operands with contents not changing along with the execution of the instruction in the execution process of the target instruction, the target operation array comprises a plurality of target operands, and the target operands refer to operands with contents changing along with the execution of the instruction in the execution process of the target instruction; in a specific embodiment, the source operand is 32-bit data, and the source operand is respectively stored in a first display register group formed by 4 8-bit registers, and the target operand is 128-bit data, and the target operand is respectively stored in a second display register group formed by 4 32-bit registers.

After the source operation array is obtained from the first display register set and matrix multiplication is performed on the source operation array, the first parameter set is stored in the first hidden register set as described in step S2. Specifically, the registers in the first display register set may be 32-bit four-channel vector registers, the registers in the first hidden register set may be 64-bit four-channel vector registers, the source operands are respectively stored in the four channels of the registers in the first display register set, the bit width of each channel stored is 8 bits, and source operation arrays of different channels are respectively obtained from the first display register set so as to respectively perform multiplication calculation, so that the matrix multiplication of the instruction is decomposed into multiplications of a plurality of small matrices, the instruction cycle number is reduced, and the calculation efficiency is improved.

In the above step S3, the destination operation array is obtained in the second display register set, the first parameter set is obtained in the first hidden register set to perform addition calculation, and the calculation result is written back into the second display register set with the 32-bit width, so that the target processing result is 32-bit output, thereby realizing the output of 32-bit number.

In summary, the source operation array and the destination operation array of the target instruction are respectively stored into a first display register group with 8 bits and a second display register group with 32 bits, matrix multiplication calculation and addition calculation are respectively carried out, and the complete instruction calculation process is decomposed; the addition calculation result is input into the second display register group to obtain a target processing result with 32-bit output, so that the output bit width is four times of that of a source operand, data overflow is easier to avoid, and the accuracy of an algorithm is more convenient to ensure; by implicitly using the first hidden register set, the general vector register is not occupied, and encoding processing is not needed, so that the processing efficiency of the instruction is improved.

In one embodiment, before receiving the target instruction, the method further includes:

selecting a preset number of vector registers to form a first display register group according to a first preset code;

selecting a vector register as a first destination register according to a second preset code;

and selecting a plurality of vector registers which are adjacent to the first destination register and are sequentially arranged, and adding the vector registers into the second display register group.

As described above, before receiving the target instruction, the central processing unit may send an encoding instruction to the central processing unit, so that the central processing unit may select the corresponding vector registers according to the first preset encoding and the second preset encoding in the encoding instruction, to form the first display register set and the second display register set. In addition, in a specific application scenario, the user can switch the central processing unit into an automatic coding mode, and when a target instruction needs to be received, a first preset code and a second preset code are automatically and randomly generated, so that a corresponding vector register is selected according to an address corresponding to the code.

In a specific embodiment, the displayed register needs code access, and the hidden register does not need a general vector register, so that the code is not needed, and the bit number of the code instruction can be saved; for the first display register set needing to be encoded, as the number of source operands is small and matrix multiplication calculation is only carried out by the first hidden register set, the number of vector registers in the first display register set is small, and the vector registers are encoded one by one according to the actual required number.

Specifically, for the second display register set to be encoded, since the number of destination operands is large and matrix multiplication is required by the second display register set, the number of vector registers in the second display register set is large.

Specifically, three consecutive vector registers after the first destination register may be selected as registers of the destination operand, that is, four vector registers are selected together to form the second display register set, which are denoted as Vd, vd+1, vd+2, and vd+3, respectively.

In one embodiment, the source operation array includes a first source operand and a second source operand, and the storing the source operation array of the target instruction in the first display register set includes:

storing the first source operand into a first source register of the first display register set, and storing the second source operand into a second source register of the first display register set, wherein the number of channels of the first source register is the same as that of channels of the second source register.

As described above, since in actual matrix computation, a source operation array is typically composed of two source operands, two vector registers may be called in an encoded manner in a central processor as a first source register and a second source register, so as to store the source operands required for matrix multiplication computation, respectively, so as to facilitate subsequent matrix decomposition, where the first source register and the second source register are denoted as Vm and Vn, respectively.

In one embodiment, the obtaining a source operation array from the first display register set corresponding to the first register code, performing matrix multiplication on the source operation array to obtain a first parameter set, and storing the first parameter set in a first hidden register set includes:

performing matrix multiplication construction on the first source operands stored in each channel in the first source register and the second source operands stored in each channel in the second source register to obtain a first matrix formula, and performing multiplication calculation to obtain the first parameter set;

selecting vector registers, the number of which is the same as that of channels of the first source register, as hidden registers, and forming a first hidden register group according to the hidden registers, wherein the number of channels of the hidden registers is the same as that of channels of the first source register;

and respectively storing parameters in the first parameter set in different channels of different hidden registers.

As described above, using Vm [ i ], vn [ i ] to represent the source operands of the ith lane of the first source register and the second source register, the first matrix equation is as follows:

Vi＝(Vm[0]*Vn[0]，Vm[1]*Vn[0]，Vm[2]*Vn[0]，Vm[3]*Vn[0])

Vi+1＝(Vm[0]*Vn[1]，Vm[1]*Vn[1]，Vm[2]*Vn[1]，Vm[3]*Vn[1])

Vi+2＝(Vm[0]*Vn[2]，Vm[1]*Vn[2]，Vm[2]*Vn[2]，Vm[3]*Vn[2])

Vi+3＝(Vm[0]*Vn[3]，Vm[1]*Vn[3]，Vm[2]*Vn[3]，Vm[3]*Vn[3])

the multiplication results of Vi, vi+1, vi+2 and vi+3 can form the first parameter set.

In one embodiment, the adding the destination operation array and the first parameter set, and writing the addition result back to the second display register set includes:

constructing a first addition formula for the first parameter set and the destination operation set;

and controlling vector registers in the second display register set to acquire the first addition formula, and respectively distributing addition items in the first addition formula to different channels of different vector registers to respectively perform addition calculation to obtain the target processing result, wherein the number of the vector registers is the same as that of the hidden registers, and the number of the channels of the vector registers is the same as that of the hidden registers.

As described above, with Vd [ i ], vd+1[i ], vd+2[ i ], vd+3[i ] representing the destination operand of the ith lane in the vector registers in the second display register set, the first addition formula is as follows:

Vd＝(Vd[0]+Vi[0]，Vd[1]+Vi[1]，Vd[2]+Vi[2]，Vd[3]+Vi[3])

Vd+1＝(Vd+1[0]+Vi+1[0]，Vd+1[1]+Vi+1[1]，Vd+1[2]+Vi+1[2]，Vd+1[3]+Vi+1[3])

Vd+2＝(Vd+2[0]+Vi+2[0]，Vd+2[1]+Vi+2[1]，Vd+2[2]+Vi+2[2]，Vd+2[3]+Vi+2[3])

Vd+3＝(Vd+3[0]+Vi+3[0]，Vd+3[1]+Vi+3[1]，Vd+3[2]+Vi+3[2]，Vd+3[3]+Vi+3[3])

therefore, in this embodiment, four-channel vector registers need to be selected to form the second display register set, and the four vector registers respectively obtain one of the first addition formulas, that is, the first addition formulas corresponding to Vd, vd+1, vd+2, and vd+3 are respectively sent to the four vector registers, so that in this embodiment, four-channel vector registers need to be selected to form the second display register set, and the addition calculation results in the first addition formulas can be respectively allocated to different channels for storage.

For example, for a first destination register in the second display register set, a first addition formula vd= (Vd [0] +vi [0], vd [1] +vi [1], vd [2] +vi [2], vd [3] +vi [3 ]) is obtained, wherein four addition items Vd [0] +vi [0], vd [1] +vi [1], vd [2] +vi [2], vd [3] +vi [3] are respectively calculated by one channel, and since all the additions in the above first addition formula are independent of each other, the calculation can be completed simultaneously, thereby improving the calculation efficiency.

In a specific embodiment, for a 4 by 4 32 bit output matrix C, the two elements required to be input are 8 bit small matrices a and B, and the number of rows and columns of matrices a and B are 4xk and kx4, respectively. Invoking a single instruction multiple data multiply accumulate instruction requires loading one 32 bits of data from matrices a and B, respectively, thus requiring a total of 2k 32 bit data loads, k invocations of the target instruction of the present embodiment, and 16 32 bit loads and stores.

Assuming that the bandwidth of consecutive loads of operands from the cache is a instruction cycles per 4 bytes, the latency of accessing the cache is 2a cycles per 4 bytes. The target instruction of this embodiment consumes b instruction cycles to be called once, and the total number of cycles required to complete the above calculation is about:

k(2a+b)+16*2a*2

however, if a single instruction multiple data instruction with a width of 16 bits is adopted in the prior art, a maximum of 4 multiplication and addition operations are usually performed by one calculation, for example, VMLAL instruction of ARM, and matrix multiplication in this embodiment is calculated, and after 32 bits are loaded once, the matrix multiplication is expanded to 4 16 bits, so two instructions are consumed to convert operands before each calculation, and the cycle number of the converted instruction is assumed to be c. Then the number of instruction cycles required to complete the computation of the output matrix C is approximately:

k(2a+4b+2c)+16*2a*2

to sum up, the instruction processing method of the present embodiment can reduce k (3b+2c) instruction cycles at least with respect to the VMLAL instruction of ARM. If the values of a, b and c are all 1, the instruction processing method of the embodiment can accelerate to be 8/3 times as k increases, so that the calculation performance of matrix multiplication is greatly improved.

Referring to fig. 2, a block diagram of an instruction processing apparatus according to the present disclosure includes:

a first set of display registers 110 for storing a source operation array of target instructions;

a first set of hidden registers 120 for storing a first set of parameters;

a second set of display registers 130 for storing a destination operand set for the destination instruction.

In one embodiment, the first display register set includes a first source register for storing a first source operand in the source operation array and a second source register for storing a second source operand in the source operation array;

the first hidden register group comprises a plurality of hidden registers, the second display register group comprises a plurality of vector registers, the number of the hidden registers is the same as that of the vector registers, and the channel numbers of the hidden registers, the vector registers, the first source registers and the second source registers are the same.

Referring to fig. 3, a block diagram of an instruction processing apparatus according to the present disclosure includes:

a first storage unit 210, configured to receive a target instruction, and decode the target instruction to obtain a first register code and a second register code, where the target instruction is a single instruction multiple data structure;

a first calculating unit 220, configured to obtain a source operation array from a first display register set corresponding to the first register code, perform matrix multiplication calculation on the source operation array to obtain a first parameter set, and store the first parameter set into a first hidden register set, where a channel bit width of the first display register set is 8 bits;

and a second calculating unit 230, configured to obtain a destination operation array from the second display register set corresponding to the second register code, obtain the first parameter set from the first hidden register set, perform addition calculation on the destination operation array and the first parameter set, and write the addition calculation result back to the second display register set to obtain a target processing result, where a channel bit width of the second display register set is 32 bits, and a bit width of the target processing result is 32 bits.

In one embodiment, the method further comprises an encoding unit for:

In one embodiment, the first storage unit 210 is specifically configured to:

In one embodiment, the first computing unit 220 is specifically configured to:

In one embodiment, the second computing unit 230 is specifically configured to:

Referring to fig. 4, in an embodiment of the present application, there is further provided a computer device, and an internal structure of the computer device may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer device is configured with a processor for providing computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The nonvolatile storage medium stores an operating device, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing text detection data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an instruction processing method.

It will be appreciated by those skilled in the art that the architecture shown in fig. 4 is merely a block diagram of a portion of the architecture in connection with the present inventive arrangements and is not intended to limit the computer devices to which the present inventive arrangements are applicable.

An embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an instruction processing method. It is understood that the computer readable storage medium in this embodiment may be a volatile readable storage medium or a nonvolatile readable storage medium.

The application relates to an instruction processing method, a device, equipment and a medium, which are used for respectively carrying out matrix multiplication calculation and addition calculation by respectively storing a source operation array and a destination operation array of a target instruction into a first display register group with 8 bits and a second display register group with 32 bits, and decomposing a complete instruction calculation process; the addition calculation result is input into the second display register group to obtain a target processing result with 32-bit output, so that the output bit width is four times of that of a source operand, data overflow is easier to avoid, and the accuracy of an algorithm is more convenient to ensure; by implicitly using the first hidden register set, the general vector register is not occupied, and encoding processing is not needed, so that the processing efficiency of the instruction is improved.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Wherein any reference to memory, storage, database, or other medium provided by the present application and used in the embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application or directly or indirectly applied to other related technical fields are included in the scope of the application.

Claims

1. A method of processing instructions, comprising:

acquiring a source operation array from a first display register group corresponding to the first register code, performing matrix multiplication on the source operation array to obtain a first parameter group, and storing the first parameter group into a first hidden register group, wherein the channel bit width of the first display register group is 8 bits;

obtaining a target operation array from a second display register set corresponding to the second register code, obtaining the first parameter set from the first hidden register set, performing addition calculation on the target operation array and the first parameter set, and writing the addition calculation result back into the second display register set to obtain a target processing result, wherein the channel bit width of the second display register set is 32 bits, and the bit width of the target processing result is 32 bits.

2. The method of claim 1, further comprising, prior to receiving the target instruction:

3. The method of claim 2, wherein the source operation array comprises a first source operand and a second source operand, and wherein storing the source operation array of the target instruction into the first set of display registers comprises:

4. A method for processing an instruction according to claim 3, wherein said obtaining a source operation array from a first display register set corresponding to said first register code, performing matrix multiplication on said source operation array to obtain a first parameter set, and storing said first parameter set in a first hidden register set, comprises:

5. The instruction processing method according to claim 4, wherein the adding the destination operation array and the first parameter group and writing the addition result back to the second display register group includes:

6. An instruction processing apparatus, comprising:

a first set of hidden registers for storing a first set of parameters;

7. The instruction processing apparatus of claim 6, wherein the first set of display registers includes a first source register to store a first source operand in the source operation array and a second source register to store a second source operand in the source operation array;

8. An instruction processing apparatus, comprising:

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the instruction processing method of any of claims 1 to 5.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the instruction processing method of any one of claims 1 to 5.