Disclosure of Invention
In view of the above, an object of the present application is to provide a data processing method, a decoding circuit and a processor, so as to solve the problem that the conventional instruction block can only accommodate 8 three-operand instructions, which results in many instructions being required to be issued during task execution, and is not favorable for power optimization.
The embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a data processing method, including: judging whether the obtained instruction is a compression instruction or not; if yes, key information in the compression instruction is obtained, and the key information comprises: the instruction repetition type is used for indicating the type of the instruction to be repeated, and the instruction repetition number is a positive integer greater than or equal to 2; decompressing the compressed instruction according to the instruction repetition type and the instruction repetition number so as to decompress the compressed instruction into a plurality of instructions which correspond to the instruction repetition type and have the same number as the instruction repetition number. In the embodiment of the application, when the obtained instruction is a compression instruction, key information in the compression instruction is obtained, and then the compression instruction is decompressed according to the instruction repetition type and the instruction repetition times in the key information, so that the compression instruction is decompressed into a plurality of instructions which correspond to the instruction repetition type and are the same as the instruction repetition times in number, and through compressing the instructions, one instruction block can accommodate more three-operand instructions, so that the probability of instruction cache miss is effectively reduced, and the efficiency is optimized.
With reference to one possible implementation manner of the embodiment of the first aspect, decompressing the compressed instruction according to the instruction repetition type and the instruction repetition number includes: generating an instruction according to the address ID corresponding to the operand in the instruction repetition type, and updating the instruction repetition times; updating the address ID corresponding to the operand when the updated instruction repetition times are determined to be larger than a preset threshold; generating an instruction according to the address ID corresponding to the updated operand, and updating the instruction repetition times again; judging whether the instruction repetition times after being updated again is equal to the preset threshold value or not; if yes, determining that decompression of the compressed instruction is finished, and obtaining a plurality of instructions which correspond to the instruction repetition types and are the same as the instruction repetition times. In the embodiment of the application, when a compression instruction is decompressed according to an instruction repetition type and an instruction repetition number, after each instruction is generated, the instruction repetition number is updated, whether the updated instruction repetition number is equal to a preset threshold value or not is judged, if not, an address ID corresponding to an operand is updated, an instruction is generated based on the address ID corresponding to the updated operand, then the instruction repetition number is updated again, whether the updated instruction repetition number is equal to the preset threshold value or not is judged, decompression of the compression instruction is completed until the updated instruction repetition number is equal to the preset threshold value, in the whole process of judging whether decompression of the compression instruction is completed or not, other elements (such as a counter) are not needed, the decompression can be completed by directly updating the instruction repetition number after each instruction is generated, and the processing flow can be simplified to the maximum extent on the premise of ensuring accuracy, and the cost is saved.
With reference to one possible implementation manner of the embodiment of the first aspect, decompressing the compressed instruction according to the instruction repetition type and the instruction repetition number includes: generating an instruction according to the address ID corresponding to the operand in the instruction repetition type, and recording the generation times of the generated instruction; when the generation times are determined to be less than the instruction repetition times, updating the address ID corresponding to the operand; generating an instruction according to the address ID corresponding to the updated operand, and updating the generation times; judging whether the updated generation times are equal to the instruction repetition times or not; if yes, determining that decompression of the compressed instruction is finished, and obtaining a plurality of instructions which correspond to the instruction repetition types and are the same as the instruction repetition times. In the embodiment of the application, when decompressing the compression instruction according to the instruction repetition type and the instruction repetition number, after generating an instruction, the generation number of the generation instruction is recorded, whether the recorded generation number is equal to the instruction repetition number is judged, if not, the address ID corresponding to the operand is updated, the instruction is generated based on the address ID corresponding to the updated operand, then the generation number is updated, whether the updated generation number is equal to the instruction repetition number is judged, until the updated instruction repetition number is equal to the instruction repetition number, the decompression of the compression instruction is completed, in the process, the generation number of the generation instruction is recorded by a counter, after generating an instruction, the generation number of the generation instruction is updated, when the generation number is equal to the instruction repetition number, the decompression of the compression instruction is completed, provides another feasible mode and enriches the applicability of the scheme.
With reference to a possible implementation manner of the embodiment of the first aspect, the updating the address ID corresponding to the operand includes: and updating the address ID corresponding to the operand according to the operand source type pointed by the address ID corresponding to the operand. In the embodiment of the application, the address ID corresponding to the operand is updated according to the operand source type pointed by the address ID corresponding to the operand, so that rules for updating the address ID corresponding to the operand according to different operand source types can be different when the address is updated.
With reference to a possible implementation manner of the embodiment of the first aspect, the updating the address ID corresponding to the operand includes: and updating the address ID corresponding to the operand according to the data type of the data stored in the operand source pointed by the address ID corresponding to the operand. In the embodiment of the application, the data type of the data stored in the operand source pointed by the address ID corresponding to the operand updates the address ID corresponding to the operand, so that different data types can correspond to different update rules when the address is updated.
With reference to a possible implementation manner of the embodiment of the first aspect, the operand in the instruction repeat type is a destination operand, and the key information further includes: a destination pass-through DF field, wherein before updating the address ID corresponding to the operand, the method further comprises: determining that a value in the destination cut-through DF field is not a set threshold. In this embodiment, when the operand is the destination operand, it needs to be determined that the value in the destination direct-connection DF field is not the set threshold before updating the address ID corresponding to the operand, so as to avoid the influence on the data direct-connection.
With reference to a possible implementation manner of the embodiment of the first aspect, before obtaining the key information in the compressed instruction, the method further includes: determining that the compress instruction is valid. In the embodiment of the application, before the key information in the compression instruction is acquired, it is further required to determine that the compression instruction is effective, so that efficiency is improved, and resource waste caused by decompression of an erroneous compression instruction is avoided.
In a second aspect, an embodiment of the present application further provides a decoding circuit, including: a decoder and an instruction decompression module; the decoder is used for judging whether the obtained instruction is a compression instruction or not, and if so, obtaining key information in the compression instruction, wherein the key information comprises: the instruction repetition type is used for indicating the type of the instruction to be repeated, and the instruction repetition number is a positive integer greater than or equal to 2; and the instruction decompressing module is used for decompressing the compressed instruction according to the instruction repetition type and the instruction repetition number so as to decompress the compressed instruction into a plurality of instructions which correspond to the instruction repetition type and have the same number as the instruction repetition number.
With reference to one possible implementation manner of the embodiment of the second aspect, the instruction decompressing module includes: the controller is used for acquiring an address ID corresponding to an operand in the instruction repetition type; the instruction generator is used for generating an instruction according to the address ID corresponding to the operand in the instruction repetition type; the controller is further configured to update the instruction repetition times after the instruction generator generates an instruction according to an address ID corresponding to an operand in the instruction repetition type, update the address ID corresponding to the operand when it is determined that the updated instruction repetition times is greater than a preset threshold, and send the updated address ID corresponding to the operand to the instruction generator; the instruction generator is further used for generating an instruction according to the updated address ID corresponding to the operand; the controller is further configured to update the instruction repetition number again after the instruction generator generates an instruction according to the updated address ID corresponding to the operand, and determine whether the instruction repetition number updated again is equal to the preset threshold; if yes, determining that decompression of the compressed instruction is finished, and obtaining a plurality of instructions which correspond to the instruction repetition types and are the same as the instruction repetition times.
With reference to one possible implementation manner of the embodiment of the second aspect, the instruction decompressing module includes: the controller is used for acquiring an address ID corresponding to an operand in the instruction repetition type; the instruction generator is used for generating an instruction according to the address ID corresponding to the operand in the instruction repetition type; the controller is further configured to record the generation times of the generated instruction after the instruction generator generates the instruction according to the address ID corresponding to the operand in the instruction repetition type, update the address ID corresponding to the operand when it is determined that the generation times is smaller than the instruction repetition times, and send the updated address ID corresponding to the operand to the instruction generator; the instruction generator is further used for generating an instruction according to the updated address ID corresponding to the operand; the controller is further configured to update the generation times and determine whether the updated generation times is equal to the instruction repetition times after the instruction generator generates an instruction according to the updated address ID corresponding to the operand; if yes, determining that decompression of the compressed instruction is finished, and obtaining a plurality of instructions which correspond to the instruction repetition types and are the same as the instruction repetition times.
In combination with a possible implementation manner of the embodiment of the second aspect, the controller is configured to update the address ID corresponding to the operand according to an operand source type pointed to by the address ID corresponding to the operand.
In combination with a possible implementation manner of the embodiment of the second aspect, the controller is configured to update the address ID corresponding to the operand according to a data type of data stored in an operand source pointed to by the address ID corresponding to the operand.
With reference to a possible implementation manner of the embodiment of the second aspect, the operand in the instruction repeat type is a destination operand, and the key information further includes: a destination cut-through DF field, wherein the controller is further configured to determine that a value in the destination cut-through DF field is not a set threshold value before updating the address ID corresponding to the operand.
With reference to a possible implementation manner of the embodiment of the second aspect, the operand source type pointed to by the address ID corresponding to the operand in the instruction repeat type is LDS, and the instruction decompression module further includes: the configuration register is used for storing and acquiring the address of a source operand in the LDS, and automatically updating the address of the configuration register to the address corresponding to the next source operand after the corresponding source operand is read from the LDS according to the current address; correspondingly, the controller is configured to update the address ID corresponding to the operand according to the address currently indicated by the configuration register, where the address ID corresponding to the operand is the same as the address currently indicated by the configuration register.
With reference to a possible implementation manner of the embodiment of the second aspect, the decoder is further configured to determine that the compression instruction is valid before obtaining the key information in the compression instruction.
With reference to a possible implementation manner of the embodiment of the second aspect, the instruction decompressing module is further configured to send, to the decoder, an instruction to prevent the decoder from acquiring the instruction from the instruction distribution unit when the key information sent by the decoder is received, and send, to the decoder, an instruction to allow the decoder to acquire the instruction from the instruction distribution unit when it is determined that decompression of the compressed instruction is finished.
In a third aspect, an embodiment of the present application further provides a processor, including: the instruction dispatch unit and the instruction execution unit are connected to the decoding circuit, as provided in the second aspect of the embodiments above and/or in combination with any one of the possible implementations of the second aspect of the embodiments above.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.
Considering that currently, a Cache line can only store 8 arithmetic instructions with 3 operands, in order to avoid the situation of instruction Cache Miss (Cache Miss), only 8 three-operand instructions can be accommodated in one instruction block, which is far from sufficient for power optimization. Therefore, the embodiment of the present application provides an efficient instruction compression method, so that 64 3-operand instructions can be compressed into 64 bits, and thus each cache line can store 512 3-operand instructions at most, which not only improves the operation performance, but also can significantly reduce the instruction cache miss.
In order to support that 64 3-Operand instructions can be compressed into 64 bits, a VOP3R (Vector Operation with 3 operands and Repeat) instruction is introduced in the application, and the setting type is "110010", namely 110010 means that the instruction is a VOP3R instruction, as shown in fig. 1. The VOP3R instruction defines the following special fields, see table 1.
TABLE 1
It should be noted that the bit number (bit width) of each field in table 1 is relatively fixed, and the position thereof may be changed, for example, Repeat _ Enable may no longer be [ 62: 59] which may be a value between [ 3: 0] this number of bits, and the rest of the fields are similar.
Wherein, Repeat _ Enable: a repeat enable field, 4 bits, each bit indicating a repeat of a source Operand (operandd 0, operandd 1, operandd 2) and a destination Operand (also referred to as a Result), e.g., B [ 59: 59] (OrB [ 0: 0 ]): repeat Operand 0; b [ 60: 60 (OrB [ 1: 1 ]): repeat Operand 1; b [ 61: 61 (OrB [ 2: 2 ]): repeat Operand 2; b [ 62: 62 (OrB [ 3: 3 ]): repeat destination. It should be noted that, only the operands whose source operands are derived from Vector General Purpose Register (VGPR)/Scalar General Purpose Register (SGPR)/Local Data store (LDS _ DIRECT), and whose destination operands are derived from VGPR/SGPR are duplicated, and the other cases are directly ignored.
To support such instruction repetition, in hardware, the embodiment of the present application provides a decoding circuit, as shown in fig. 2. After the decoding circuit acquires an Instruction from an Instruction distribution unit (Instruction Dispatch), judging whether the Instruction is a compression Instruction, if not, namely the current Instruction is not the compression Instruction, directly sending the Instruction to an Instruction Execution unit (Instruction Execution) by the decoding circuit, and executing the Instruction by the Instruction Execution unit; if yes, namely when the current instruction is a compression instruction, the decoding circuit acquires key information in the compression instruction; and decompressing the compressed instruction according to the instruction repetition type and the instruction repetition times in the key information so as to decompress the compressed instruction into a plurality of instructions which correspond to the instruction repetition type and have the same number as the instruction repetition times.
Wherein the key information includes: the instruction repetition type is used for indicating the type of the instruction to be repeated, and the instruction repetition number is a positive integer greater than or equal to 2. The instruction Repeat type is obtained from a Repeat Enable field (Repeat _ Enable) in the compressed instruction, and the instruction Repeat number is obtained from a Repeat count field (Repeat _ Counter). When determining whether the command is a compress command, it can be determined whether the current command is a compress command according to the Repeat _ Counter field, if the Repeat _ Counter! If the repeat _ count is 0x0 (0 in 16 th order), the command is a compression command, and if the repeat _ count is 0x0, the command is an uncompressed command. The detailed parameters of the key information are shown in table 2.
TABLE 2
Field(s)
|
Number of bits
|
Operation_code
|
10
|
Repeat_Counter
|
6
|
Result_ID
|
8
|
Repeat_Enable
|
4
|
Operand2_ID
|
9
|
Operand1_ID
|
9
|
Operand0_ID
|
9 |
For ease of understanding, the specific description is made, for example, with the compression instruction:
Repeat Enable(0x3),Repeat Counter(62)::
Forwarding=LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding;
decompressing the compressed instruction according to the instruction repetition type and the instruction repetition number, so as to obtain 62 instructions which correspond to the instruction repetition type (repeat operandd 0 and operandd 1) and are the same as the instruction repetition number (62), and obtaining the following instructions:
Forwarding=LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(2,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(3,ALU_Index)+Forwarding;
……
Forwarding=LDS_Direct(M0_register)*B(61,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(62,ALU_Index)+Forwarding;
wherein, Repeat Enable represents an instruction Repeat type, wherein 0x3 represents that two operands, i.e. Operand0 and Operand1, are repeated, and Repeat Counter represents the instruction Repeat times, wherein 62 represents the Repeat times, so that 62 instructions can be obtained after decompressing the compressed instruction. It should be noted that, here, only the instruction types to be repeated are operandd 0 and operandd 1 as examples, the instruction types to be repeated may be at least one of four operands, i.e., Repeat Result (destination Operand), operandd 0, operandd 1 and operandd 2, so that there are 15 combinations, different Repeat types are represented by defining different bytes, e.g., Repeat Enable (0x1) represents Repeat operandd 0 Operand, Repeat Enable (0x2) represents Repeat operandd 1 Operand, and Repeat Enable (0x3) represents two operands, i.e., Repeat operandd 0 and operandd 1.
Because the instructions are divided into normal instructions (single instructions) and compressed instructions, the instruction logic of the corresponding hardware comprises a normal mode and a Repeat mode, when the Repeat _ Count is 0, the normal mode is represented, and in the normal mode, the execution logic acquires the instructions from the instruction distribution unit and executes the instructions. Repeat _ Count! When the decoding circuit finishes decompressing the compressed instruction, that is, when Repeat _ Count is 0, the decoding circuit switches back to the normal mode.
As an embodiment, the process of decompressing the compressed instruction by the decoding circuit according to the instruction repetition type and the instruction repetition number may be: generating an instruction according to the address ID corresponding to the operand in the instruction repetition type, and updating the instruction repetition times; when the number of times of instruction repetition after updating is determined to be larger than a preset threshold value, updating the address ID corresponding to the operand according to the address ID corresponding to the operand; generating an instruction according to the address ID corresponding to the updated operand, and updating the instruction repetition times again; judging whether the repeated times of the updated instruction are equal to a preset threshold value or not; if so, completing decompression of the compressed instruction to obtain a plurality of instructions which correspond to the instruction repetition types and have the same number as the instruction repetition times, otherwise, repeating the operation (updating the address ID corresponding to the operand, generating the instruction according to the address ID corresponding to the updated operand, updating the instruction repetition times again, judging whether the instruction repetition times after updating again is equal to a preset threshold value or not), and ending the operation until the instruction repetition times after updating is equal to the preset threshold value.
The code for this process is represented as follows:
in this embodiment, that is, when the instruction is generated for the first time, the instruction is generated according to the address ID carried in the compressed instruction, in the above example, the instruction of "Forwarding _ Direct (M0_ register) × B (1, ALU _ Index) + Forwarding" is generated according to the default address ID (address1) in the compressed instruction, then the instruction repetition number (the instruction repetition number at this time is 61) is updated, when it is determined that the updated instruction repetition number (61) is greater than the preset threshold (e.g. 0), the address ID (address2) corresponding to the operand is updated, the instruction is generated according to the updated address ID, and the instruction repetition number is updated again, then it is determined whether the updated instruction repetition number is equal to the preset threshold, if not, the address ID corresponding to the operand is updated again, the instruction is generated according to the updated address ID, and the instruction repetition number is updated again (the instruction repetition number at this time is 60), and then judging whether the updated instruction repetition frequency (60) is equal to a preset threshold value or not, if the updated instruction repetition frequency is still larger than the preset threshold value, repeating the operation (updating the address ID corresponding to the operand, generating the instruction according to the updated address ID, updating the instruction repetition frequency again, and then judging whether the updated instruction repetition frequency is equal to the preset threshold value or not) until the updated instruction repetition frequency (0) is equal to the preset threshold value (such as 0), ending the operation, and when the updated instruction repetition frequency is equal to the preset threshold value, obtaining 62 instructions of the operand, namely completing the decompression of the compression instruction.
As another embodiment, the process of the decoding circuit decompressing the compressed instruction according to the instruction repetition type and the instruction repetition number may be: generating an instruction according to an address ID corresponding to an operand in the instruction repetition type, and recording the generation times of the generated instruction; when the generation times are determined to be less than the instruction repetition times, updating the address ID corresponding to the operand; generating an instruction according to the address ID corresponding to the updated operand, and updating the generation times; judging whether the updated generation times are equal to the instruction repetition times or not; if so, completing decompression of the compressed instruction to obtain a plurality of instructions which correspond to the instruction repetition type and have the same number as the instruction repetition times; if not, repeating the operation (updating the address ID corresponding to the operand, generating an instruction according to the address ID corresponding to the updated operand, updating the generation times, judging whether the updated generation times are equal to the instruction repetition times or not), and ending the operation until the updated generation times are equal to the instruction repetition times.
The principle of this embodiment is the same as that of the foregoing embodiment, except that in the first embodiment, after the instruction is generated, the number of times of instruction repetition is updated, and whether the decompression operation for the compressed instruction is completed is determined by determining whether the number of times of instruction repetition after update is equal to a preset threshold (for example, 0). That is, in this embodiment, a counter is used to count the number of times an instruction is generated, the number is incremented once every time an instruction is generated, and whether the instruction needs to be generated continuously is determined by determining whether the number of times recorded is equal to the number of times the instruction is repeated (62).
When a compressed instruction is decompressed, the instruction is issued to the instruction execution unit every time an instruction is generated.
The Operand in the instruction repeat type may be at least one of four operands, namely Result, operandd 0, operandd 1 and operandd 2. In updating the address ID corresponding to the operand, in one embodiment, the operand source type (e.g., VGPR/SGPR/LDS _ DIRECT) pointed by the address ID corresponding to the operand may be determined, for example, the rule for updating the address ID corresponding to different operand source types may be different, for example, the rule for updating the address ID corresponding to VGPR for the operand source is different from the rule for updating the address ID corresponding to SGPR for the operand source.
For example, when the Operand source pointed by the ID corresponding to the Operand is VGPR/SGPR, the address ID may be updated based on a rule (Operand _ ID + +, or Result _ ID + +) when the address ID is updated, that is, the updated address is equal to the address before updating plus one. For the sake of easy understanding, taking Operand1 as an example, if Operand1_ ID points to VGPR/SGPR, repeat the following:
that is, if Operand1_ ID points to the Operand source is VGPR/SGPR, and Repeat _ Enable [60] is 1, then the address of Operand1 (Operand1_ ID) is incremented by 1; if 0, the address of operand1 remains unchanged. It should be noted that, in the above example, only address self-increment and increment of 1 are taken as examples, and the rule of address update may also be address self-decrement, in this case, the amplitude may also not be 1, which mainly depends on whether to store data in an incremental manner or a decremental manner, whether to store data continuously, or the like.
When the operand source pointed to by the ID corresponding to the operand is LDS _ DIRECT, the rule for updating the address ID is different from that when the operand source pointed to by the ID corresponding to the operand is VGPR/SGPR. If in this mode the hardware reads data from the LDS as an operand when the operand source pointed to by the operand's corresponding ID is LDS _ DIRECT, the access address and data type are determined by configuration registers, such as the M0 register (32-bit special hardware internal register whose lower 16-bits are used as addresses by LDS _ DIRECT). The 32bit definition of the M0 register is shown in Table 3.
TABLE 3
Thus, when the source operand is derived from LDS _ DIRECT, the address field of the M0 register needs to be automatically updated when the address ID is updated. Correspondingly, the address pointed to by the address ID is the address stored in the M0 register, which is used to read the source operand stored in the LDS. That is, the M0 register is used to store the address of the source operand (e.g., the element of each row in the matrix) in the LDS for reading, and the address of the M0 register needs to be updated to the address corresponding to the next element after the corresponding element is read from the LDS according to the current address.
In yet another embodiment, in addition to updating the address ID corresponding to the operand according to the operand source type pointed to by the address ID corresponding to the operand, the address ID corresponding to the operand may also be updated according to the data type of the data stored in the operand source pointed to by the address ID corresponding to the operand. Different data types and corresponding address update rules are different, for example, as follows:
address i +1 ═ Address i +0x 1; // data type is unsigned byte;
address i +1 ═ Address i +0x 2; // data type is unsigned byte;
address i +1 ═ Address i +0x 4; // data type DWord;
address i +1 ═ Address i +0x 0; // data type Default (reserved);
address i +1 ═ Address i +0x 1; the data type is signed byte;
address i +1 ═ Address i +0x 2; the data type is signed short;
address i +1 ═ Address i +0x 8; // data type is Qword;
taking the operand source pointed by the Address ID corresponding to the operand as LDS _ DIRECT as an example, at this time, the Address field of the M0 register is automatically updated during updating, and the data type of the data stored in the LDS is also considered, and if the data type is unsigned byte, the data is updated according to the rule that Address i +1 is Address i +0x 1.
When the operand is a destination operand (Result), it is further required to ensure that the operand source type pointed to by the address ID corresponding to the destination operand is not a temporary register for data communication before performing address update. Whether the operand source type pointed by the address ID corresponding to the destination operand is a temporary register for data through can be judged through the destination through DF field. When DF is equal to 1, Result _ ID is forwarding, at this time, the address does not need to be updated, and forwarding is maintained. Otherwise, that is, DF is not 1, the operand source type pointed to by the address ID corresponding to the destination operand is not a temporary register for data through, and if it is VGPR/SGPR, the address update is performed in the foregoing manner.
In order to improve efficiency, before obtaining the key information in the compressed instruction, the decoding circuit may further determine whether the compressed instruction is valid, obtain the key information in the compressed instruction only after determining that the compressed instruction is valid, and decompress the compressed instruction according to the instruction repetition type and the instruction repetition number in the key information.
As an embodiment, whether the compress instruction is valid may be determined by: judging whether the compression instruction is valid according to a repeat enable field which represents a source operand in the compression instruction or a repeat enable field which represents a destination operand in the compression instruction; when the repeat enable field for representing the source operand is not zero and the address ID corresponding to the source operand points to the source type of the specified operand (such as VGPR/SGPR/LDS _ Direct), or when the repeat enable field for representing the destination operand is not zero, the compression instruction is represented to be valid. Characterizing the compress instruction as valid if at least one of:
If(Repeat_Enable[59:59]!=0x0)andoperand0_ID isVGPR/SGPR/LDS_DIRECT;
If(Repeat_Enable[60:60]!=0x0)andoperand1_ID isVGPR/SGPR/LDS_DIRECT;
If(Repeat_Enable[61:61]!=0x0)andoperand2_ID isVGPR/SGPR/LDS_DIRECT;
If(Repeat_Enable[62:62]!=0x0);
that is, if the repeat enable field of at least one source operand is not zero and the corresponding address ID points to the specified operand source type, or if the repeat enable field of the destination operand is not zero, the packed instruction is valid.
The above is described from the perspective of the entire decoding circuit, and the steps performed by each element in the decoding circuit are described below in order to facilitate understanding of information exchange between each element in the decoding circuit. As shown in fig. 1, the decoding circuit includes: the Decoder (Repeat Decoder) and the instruction decompression module are connected.
The decoder is used for judging whether the obtained instruction is a compression instruction or not, if not, the instruction is sent to the instruction execution unit to execute the instruction, and if so, key information in the compression instruction is obtained.
To improve efficiency, the decoder is optionally further configured to determine that the compressed instruction is valid before retrieving critical information in the compressed instruction. As an embodiment, the decoder is configured to determine that the compress instruction is valid according to: judging whether the compression instruction is valid according to a repeat enable field which represents a source operand in the compression instruction or a repeat enable field which represents a destination operand in the compression instruction; the compression instruction is characterized to be valid when the repeat enable field for characterizing the source operand is not zero and the address ID corresponding to the source operand points to the source type of the specified operand, or when the repeat enable field for characterizing the destination operand is not zero.
And the instruction decompressing module is used for decompressing the compressed instruction according to the instruction repetition type and the instruction repetition number so as to decompress the compressed instruction into a plurality of instructions which correspond to the instruction repetition type and have the same number as the instruction repetition number.
Optionally, the instruction decompressing module is further configured to send, to the decoder, an instruction to prevent the decoder from obtaining the instruction from the instruction distributing unit when the key information sent by the decoder is received, and send, to the decoder, an instruction to allow the decoder to obtain the instruction from the instruction distributing unit when decompression of the compressed instruction is completed.
Under one implementation, as shown in fig. 3, the instruction decompressing module includes: a controller and an instruction generator. The controller is connected with the instruction generator and the decoder respectively.
In one embodiment, optionally, the controller is configured to obtain an address ID corresponding to an operand in the instruction repeat type; the instruction generator is used for generating an instruction according to the address ID corresponding to the operand in the instruction repetition type; the controller is further used for updating the instruction repetition times after the instruction generator generates the instruction according to the address ID corresponding to the operand in the instruction repetition type, updating the address ID corresponding to the operand when the updated instruction repetition times is determined to be larger than a preset threshold value, and sending the address ID corresponding to the updated operand to the instruction generator; the instruction generator is also used for generating an instruction according to the address ID corresponding to the updated operand; the controller is further used for updating the instruction repetition times again after the instruction generator generates the instruction according to the address ID corresponding to the updated operand, and judging whether the instruction repetition times after updating again is equal to a preset threshold value or not; if yes, decompression of the compressed instruction is completed, and a plurality of instructions which correspond to the instruction repetition type and are the same as the instruction repetition times are obtained.
In another embodiment, the controller is configured to obtain an address ID corresponding to an operand in an instruction repeat type; the instruction generator is used for generating an instruction according to the address ID corresponding to the operand in the instruction repetition type; the controller is further used for recording the generation times of the generated instruction after the instruction generator generates the instruction according to the address ID corresponding to the operand in the instruction repetition type, updating the address ID corresponding to the operand when the generation times are determined to be smaller than the instruction repetition times, and sending the address ID corresponding to the updated operand to the instruction generator; the instruction generator is also used for generating an instruction according to the address ID corresponding to the updated operand; the controller is also used for updating the generation times after the instruction generator generates the instruction according to the address ID corresponding to the updated operand, and judging whether the updated generation times are equal to the instruction repetition times or not; if yes, decompression of the compressed instruction is completed, and a plurality of instructions which correspond to the instruction repetition type and are the same as the instruction repetition times are obtained.
In one embodiment, when updating the address ID corresponding to the operand, the controller is further configured to update the address ID corresponding to the operand according to the operand source type pointed to by the address ID corresponding to the operand.
In one embodiment, when the controller updates the address ID corresponding to the operand, the controller is further configured to update the address ID corresponding to the operand according to the data type of the data stored in the operand source pointed by the address ID corresponding to the operand.
Optionally, the operand in the instruction repeat type is a destination operand, and the key information further includes: and the destination direct-through DF field is also used for determining that the value in the destination direct-through DF field is not a set threshold (such as 1) before the controller updates the address ID corresponding to the operand.
Optionally, the controller is further configured to send an instruction to the decoder to prevent the decoder from fetching the instruction from the instruction dispatch unit during decompression of the compressed instruction, when the decoder is not fetching the instruction from the instruction dispatch unit. When decompression is complete, an indication is sent to the decoder that allows it (the decoder) to fetch instructions from the instruction dispatch unit, at which point the decoder can fetch instructions from the instruction dispatch unit. That is, the controller includes a normal mode in which the decoder is allowed to acquire an instruction from the instruction distribution unit and execute it, and a Repeat mode in which the normal mode is indicated when Repeat _ Count ═ 0. In the Repeat mode (Repeat _ Count | ═ 0 indicates a Repeat mode), the controller prevents the decoder from fetching an instruction from the instruction distribution unit, and switches back to the normal mode when the controller completes the decompression of the compressed instruction, that is, when Repeat _ Count ═ 0.
When the operand source type pointed by the address ID corresponding to the operand in the instruction repeat type is LDS, the instruction decompressing module further includes: and the configuration register is used for storing and acquiring the address of the source operand in the LDS, and automatically updating the address of the configuration register to the address corresponding to the next source operand after reading the corresponding source operand from the LDS according to the current address. At this time, when updating the address ID corresponding to the operand, the controller is further configured to update the address ID corresponding to the operand according to the address currently indicated by the configuration register, where the address ID is the same as the address currently indicated by the configuration register. In this case, as shown in fig. 4, the instruction decompressing module includes: a controller, a configuration register (M0 register), and an instruction generator. The controller is connected with the decoder, the instruction generator and the configuration register respectively.
When the functional functions of the respective components are described, reference may be made to the same parts in the foregoing embodiments in which the decoding circuit is described as a whole, and the parts have been described in detail in the foregoing embodiments of the apparatus, and for the sake of brevity of the description, the descriptions will not be repeated here.
In the embodiment of the application, the instructions are compressed through the VOP3R, so that each cache line (512bit) can accommodate 512 3-operand instructions, the probability of instruction cache miss is effectively reduced, and the efficiency is optimized. For ease of understanding, the method provided by the embodiments of the present application is applied to matrix multiplication as an example. Taking a 64X64 matrix as an example, C64x64=A64x64*B64x64Here, the matrix size of 64X64 is merely an example, and is not limited thereto. Assume that there are 64 arithmetic operation units each having a VGPR space of 200x64 bit.
The calculation process is roughly as follows:
1) matrix a is loaded to LDS in linear mode:
a (0,0) → LDS (Address 0); // A (0,0) is stored at the location of Address0 of LDS;
a (0,1) → LDS (Address 1); // A (0,1) is stored at the location of Address1 of LDS;
a (0,2) → LDS (Address 2); // A (0,2) is stored at the location of Address2 of LDS;
……
2) matrix B is loaded into the VGPR space as shown in Table 4.
TABLE 4
ALU0
|
ALU1
|
ALU2
|
……
|
ALU62
|
ALU63
|
B0,0
|
B0,1
|
B0,2
|
……
|
B0,62
|
B0,63
|
B1,0
|
B1,1
|
B1,2
|
……
|
B1,62
|
B1,63
|
……
|
……
|
……
|
……
|
……
|
……
|
B63,0
|
B63,1
|
B63,2
|
……
|
B63,62
|
B63,63 |
During calculation, elements in the matrix A are loaded into 64 ALUs one by one in parallel and multiplied by elements corresponding to columns stored in 64 vector general registers respectively, and the 64 ALUs accumulate multiplication results generated by the elements in the same row in the matrix A and the corresponding elements in the matrix B one by one in parallel in sequence to obtain all elements in the same row in the matrix C, so that multiplication operation of the matrix A and the second matrix B is completed.
3) Calculating a matrix C:
the instructions for calculating matrix C in the normal mode are as follows:
m0_ register is start _ address; the initial address of the register// M0, wherein the M0 register is used to store the address of each element in the read matrix A and is automatically updated to the address of the next element after the 64 ALUs read the corresponding element in the matrix A from the LDS in parallel according to the current address of the M0 register.
//-----------------------------------------
// Calculate the first row of Matrix C (first row of calculation Matrix C):
// C (0,0) is calculated on ALU _ Index0 ALU _ Index ═ 0(ALU0 calculates C (0,0)).
// C (0,1) is calculated on ALU _ Index1 ALU _ Index ═ 1(ALU1 calculates C (0,1)).
//......
The execution instruction for each ALU to compute a corresponding element in the first row of the matrix C is as follows:
Block_Start::Forwarding=LDS_Direct(M0_register)*B(0,ALU_Index);
Forwarding=LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(2,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(3,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(4,ALU_Index)+Forwarding;
……
Block_End::C(0,ALU_Index)=LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding;
//-----------------------------------------
// calculating the second row of Matrix C:
// C (1,0) is calculated on ALU _ Index0(ALU0 calculates C (1,0)).
// C (1,1) is calculated on ALU _ Index1(ALU1 calculates C (1,1)).
//......
The execution instruction for each ALU to compute a corresponding element in the second row of the matrix C is as follows:
Block_Start::Forwarding=LDS_Direct(M0_register)*B(0,ALU_Index);
Forwarding=LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(2,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(3,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(4,ALU_Index)+Forwarding;
……
Block_End::C(1,ALU_Index)=LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding;
……
//-----------------------------------------
// Calculate the last row of Matrix C:
// C (63,0) is calculated on ALU _ Index0(ALU0 calculates C (63,0)).
// C (63,1) is calculated on ALU _ Index1(ALU1 calculates C (63,1)).
//......
The execution instruction for each ALU to compute a corresponding element in the last row of the matrix C is as follows:
Block_Start::Forwarding=LDS_Direct(M0_register)*B(0,ALU_Index);
Forwarding=LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(2,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(3,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(4,ALU_Index)+Forwarding;
……
Block_End::C(63,ALU_Index)=LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding;
the above is a conventional mode without using instruction compression, and the following is that with the instruction compression method provided by the present application, the above conventional instruction list can be compressed as follows:
M0_register=start_address;
//-----------------------------------------
// Calculate the first row of Matrix C (first row of calculation Matrix C):
//C(0,0)is calculated on ALU_Index0:ALU_Index=0.
//C(0,1)is calculated on ALU_Index1:ALU_Index=1.
//......
//-----------------------------------------
Block_Star::Forwarding=LDS_Direct(M0_register)*B(0,ALU_Index);
RepeatEnable(0x3,RepeatCounter(62)::
Forwarding=LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding;//Repeat Operand0 and Operand1;
Block_End::C(0,ALU_Index)=LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding;
//-----------------------------------------
// calculating the second row of Matrix C:
//C(1,0)is calculated on ALU_Index0.
//C(1,1)is calculated on ALU_Index1.
//...........
//-----------------------------------------
Block_Start::Forwarding=LDS_Direct(M0_register)*B(0,ALU_Index);
RepeatEnable(0x3),RepeatCounter(62)::
Forwarding=LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding;//Repeat Operand0 and Operand1;
Block_End::C(1,ALU_Index)=LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding;
……
//-----------------------------------------
// Calculate the last row of Matrix C:
//C(63,0)is calculated on ALU_Index0.
//C(63,1)is calculated on ALU_Index1.
//...........
//-----------------------------------------
Block_Start::Forwarding=LDS_Direct(M0_register)*B(0,ALU_Index);
RepeatEnable(0x3),RepeatCounter(62)::
Forwarding=LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding;//Repeat Operand0 and Operand1;
Block_End::C(63,ALU_Index)=LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding;
as can be seen from the above, C is accomplished using the conventional instruction pattern64x64=A64x64*B64x6464X64 instructions are needed, 4096 instructions, and only 3X64 instructions are needed after the application of the instruction compression, so that the efficiency is remarkably improved.
Referring to fig. 5, a data processing method according to an embodiment of the present application will be described with reference to fig. 5.
Step S101: and judging whether the acquired instruction is a compression instruction or not.
If yes, step S102 is executed, and if no, the acquired instruction is sent to the instruction execution unit.
Step S102: obtaining key information in the compression instruction, wherein the key information comprises: an instruction repeat type and an instruction repeat number.
The instruction repetition type is used for indicating the type of the instruction to be repeated, and the instruction repetition number is a positive integer greater than or equal to 2.
Step S103: decompressing the compressed instruction according to the instruction repetition type and the instruction repetition number so as to decompress the compressed instruction into a plurality of instructions which correspond to the instruction repetition type and have the same number as the instruction repetition number.
Optionally, before obtaining the key information in the compression instruction, the method further includes: determining that the compress instruction is valid.
In an embodiment, the process of decompressing the compressed instruction according to the instruction repetition type and the instruction repetition number may be: generating an instruction according to the address ID corresponding to the operand in the instruction repetition type, and updating the instruction repetition times; updating the address ID corresponding to the operand when the updated instruction repetition times are determined to be larger than a preset threshold; generating an instruction according to the address ID corresponding to the updated operand, and updating the instruction repetition times again; judging whether the instruction repetition times after being updated again is equal to the preset threshold value or not; if yes, determining that decompression of the compressed instruction is finished, and obtaining a plurality of instructions which correspond to the instruction repetition types and are the same as the instruction repetition times.
In one embodiment, the process of decompressing the compressed instruction according to the instruction repetition type and the instruction repetition number may be: generating an instruction according to the address ID corresponding to the operand in the instruction repetition type, and recording the generation times of the generated instruction; when the generation times are determined to be less than the instruction repetition times, updating the address ID corresponding to the operand; generating an instruction according to the address ID corresponding to the updated operand, and updating the generation times; judging whether the updated generation times are equal to the instruction repetition times or not; if yes, determining that decompression of the compressed instruction is finished, and obtaining a plurality of instructions which correspond to the instruction repetition types and are the same as the instruction repetition times.
Optionally, the process of updating the address ID corresponding to the operand may be: and updating the address ID corresponding to the operand according to the operand source type pointed by the address ID corresponding to the operand.
Optionally, the process of updating the address ID corresponding to the operand may also be: and updating the address ID corresponding to the operand according to the data type of the data stored in the operand source pointed by the address ID corresponding to the operand.
Optionally, the operand in the instruction repeat type is a destination operand, and the key information further includes: a destination pass-through DF field, wherein before updating the address ID corresponding to the operand, the method further comprises: determining that a value in the destination cut-through DF field is not a set threshold.
The method provided by the embodiment of the present application, which has the same implementation principle and the same technical effect as the foregoing device embodiment, for the sake of brief description, and where no part of the method embodiment is mentioned, reference may be made to the corresponding content in the foregoing device embodiment.
The embodiment of the application also provides a processor, as shown in fig. 6. The processor comprises the decoding circuit, the instruction execution unit and the instruction distribution unit in any one of the above embodiments. The instruction distributing unit and the instruction executing unit are both connected with the decoding circuit. The instruction dispatch unit is to store the instruction to facilitate the decode circuitry to fetch the instruction from the instruction dispatch unit. The instruction execution unit is used for executing the instruction issued by the decoding circuit.
The processor may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), a Graphics Processing Unit (GPU), and the like; a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.