CN118626145A - Instruction conversion method and device and related equipment - Google Patents

Instruction conversion method and device and related equipment Download PDF

Info

Publication number
CN118626145A
CN118626145A CN202410851860.XA CN202410851860A CN118626145A CN 118626145 A CN118626145 A CN 118626145A CN 202410851860 A CN202410851860 A CN 202410851860A CN 118626145 A CN118626145 A CN 118626145A
Authority
CN
China
Prior art keywords
target
basic block
scalar instructions
expanded
scalar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410851860.XA
Other languages
Chinese (zh)
Inventor
包赵泠
赖庆宽
康梦博
潘治
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haiguang Information Technology Co Ltd
Original Assignee
Haiguang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haiguang Information Technology Co Ltd filed Critical Haiguang Information Technology Co Ltd
Priority to CN202410851860.XA priority Critical patent/CN118626145A/en
Publication of CN118626145A publication Critical patent/CN118626145A/en
Pending legal-status Critical Current

Links

Landscapes

  • Devices For Executing Special Programs (AREA)
  • Complex Calculations (AREA)

Abstract

The embodiment of the invention provides an instruction conversion method, an instruction conversion device and related equipment, wherein the method comprises the steps of determining a target basic block, wherein the target basic block is determined based on a source program and comprises a plurality of target scalar instructions with continuous addresses; determining target cycle expansion times corresponding to the target basic blocks at least according to the number of the target scalar instructions; expanding the target basic block based on the target cyclic expansion times to obtain an expanded basic block, wherein the expanded basic block comprises a plurality of expanded target scalar instructions, and the number of the plurality of expanded target scalar instructions is an integer multiple of the number of vectorizable scalar instructions supported by a processor; and carrying out vectorization processing on the unfolded basic blocks to obtain a converted vector instruction. The technical scheme provided by the embodiment of the invention can reduce the limitation of vectorization of the scalar instruction, thereby improving the conversion quantity of the scalar instruction converted into the vector instruction.

Description

Instruction conversion method and device and related equipment
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to an instruction conversion method, an instruction conversion device and related equipment.
Background
Single instruction multiple data (Single Instruction Multiple Data, SIMD) is a way of parallel computing, i.e. one instruction can operate on multiple data simultaneously, single instruction multiple data is intended to exploit data-level parallelism of programs in the field of data processing. Automatic vectorization (Auto-vectorization) of a compiler is one of vector mining methods based on SIMD, namely, automatic vectorization can realize conversion from scalar instructions to vector instructions, so that parallel data processing capacity of data processing devices such as a processor can be effectively improved.
In this context, how to provide a technical solution to reduce the limitation of vectorization of scalar instructions, so as to increase the number of conversion from scalar instructions to vector instructions, which is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of this, the embodiments of the present invention provide an instruction conversion method, apparatus and related device, which reduce the restriction on vectorization of scalar instructions, thereby increasing the number of conversion of scalar instructions into vector instructions.
In order to achieve the above purpose, the embodiment of the present invention provides the following technical solutions.
In a first aspect, an embodiment of the present invention provides an instruction conversion method, including:
Determining a target basic block, the target basic block determined based on a source program, and the target basic block comprising a plurality of target scalar instructions;
Determining target cycle expansion times corresponding to the target basic blocks at least according to the number of the target scalar instructions;
Expanding the target basic block based on the target cyclic expansion times to obtain an expanded basic block, wherein the expanded basic block comprises a plurality of expanded target scalar instructions, and the number of the plurality of expanded target scalar instructions is an integer multiple of the number of vectorizable scalar instructions supported by a processor;
and carrying out vectorization processing on the unfolded basic blocks to obtain a converted vector instruction.
In a second aspect, an embodiment of the present invention provides an instruction converting apparatus, including:
A target basic block determination module for determining a target basic block, the target basic block being determined based on a source program and the target basic block comprising a plurality of target scalar instructions;
The loop expansion times determining module is used for determining target loop expansion times corresponding to the target basic blocks at least according to the number of the target scalar instructions;
The loop expansion module is used for expanding the target basic block based on the target loop expansion times to obtain an expanded basic block, wherein the expanded basic block comprises a plurality of expanded target scalar instructions, and the number of the plurality of expanded target scalar instructions is an integer multiple of the number of vectorizable scalar instructions supported by a processor;
And the vector instruction generation module is used for carrying out vectorization processing on the unfolded basic blocks to obtain a converted vector instruction.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory storing a program and a processor calling the program stored in the memory to execute the instruction converting method according to the first aspect.
In a fourth aspect, an embodiment of the present invention provides a storage medium storing a program that when executed implements the instruction conversion method according to the first aspect.
In a fifth aspect, an embodiment of the present invention provides a computer program product comprising a computer program which, when executed by a processor, implements the instruction conversion method according to the first aspect.
The embodiment of the invention provides an instruction conversion method, which comprises the steps of firstly, determining a target basic block, wherein the target basic block is determined based on a source program and comprises a plurality of target scalar instructions; then, determining the target cycle expansion times corresponding to the target basic blocks at least according to the number of the target scalar instructions; expanding the target basic block based on the target cyclic expansion times to obtain an expanded basic block, wherein the expanded basic block comprises a plurality of expanded target scalar instructions, and the number of the plurality of expanded target scalar instructions is an integer multiple of the number of vectorizable scalar instructions supported by a processor; and finally, vectorizing the expanded basic block to obtain a converted vector instruction.
When the instruction conversion method provided by the embodiment of the invention is used for realizing automatic vectorization processing of scalar instructions, the target loop expansion times corresponding to the target basic blocks are calculated at least according to the number of target scalar instructions included in the target basic blocks; and performing loop expansion on each target scalar instruction of the target basic block by using the target loop expansion times, so that the number of the expanded target scalar instructions in the expanded basic block is as follows: an integer multiple of the number of vectorizable scalar instructions supported by the processor. Therefore, each expanded target scalar instruction in the expanded basic block can be divided according to the number of the vectorizable scalar instructions so as to be combined into a plurality of vector instructions, and vectorization processing of the expanded target scalar instructions is realized. Therefore, the embodiment of the invention can lead the number of the target scalar instructions after the expansion in the basic block after the expansion to be the integral multiple of the number of the scalar instructions capable of vectorizing, lead the target basic block to realize vectorizing processing after the expansion, reduce the limit of vectorizing the scalar instructions and improve the conversion number of converting the scalar instructions into vector instructions.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of an implementation of automatic vectorization according to an embodiment of the present invention;
FIG. 2 is a flow chart of an instruction converting method according to an embodiment of the invention;
FIG. 3 is a schematic flow chart of an instruction converting method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a structure of the target basic block;
FIG. 5 is a schematic diagram showing a result of processing the target basic block shown in FIG. 4 according to the instruction converting method of the present invention;
FIG. 6 is a flow chart of automatic vectorization corresponding to the instruction conversion method according to the embodiment of the present invention;
fig. 7 is a schematic structural diagram of an instruction converting apparatus according to an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
SIMD instruction sets are widely adopted based on the need of modern processors to improve data processing performance and power consumption efficiency. However, writing code that makes efficient use of the SIMD instruction set is difficult, and the types of SIMD instruction sets used or supportable on different platforms are also different. Where a processor supports execution of vector instructions, compiler-based automated vectorization is one solution to writing efficiently utilized SIMD instruction sets.
Automatic vectorization of compilers is largely divided into cyclic vectorization and super word parallel (SLP, superword LEVEL PARALLELISM) vectorization.
The loop vectorization scheme is mainly applied to a loop structure formed by loop sentences in a source program, and particularly is applied to a loop with fixed iteration times in the source program. Loop vectorization focuses on vectorization opportunities among loop iterations, so that the iteration times of the loop are reduced, and the calculated data volume of a single iteration is increased.
The SLP vectorization scheme is then primarily directed to a single Basic Block (Basic Block), where a Basic Block is a code segment formed by a set of scalar instructions that have potential vectorization opportunities, where there is no control flow entry and no control flow exit within the Basic Block. SLP vectorization focuses on vectorization opportunities within an iteration (within a basic block), combining similar scalar computation/memory operations in a single iteration into one vector instruction, i.e., reducing the number of instruction generation within a single iteration.
The process of automatic vectorization of a source program by a compiler may refer to fig. 1, and fig. 1 is a schematic flow chart of an implementation of automatic vectorization according to an embodiment of the present invention.
Compiler automated vectorization includes 2 vectorization analysis and processing Procedures (PASS):
The first PASS is analysis and processing of cyclic vectorization (cyclic vectorized PASS).
After the loop vectorized PASS is completed, a second PASS is performed.
The second PASS is SLP vectorized analysis and processing (SLP vectorized PASS).
The implementation flow of automatic vectorization can specifically refer to fig. 1, and mainly includes the following steps.
First, cyclic vectorization PASS:
Step 1: the basic blocks divided based on the source program are traversed.
Step 2: and carrying out data dependency analysis on each traversed basic block.
In performing data dependency analysis, the execution of scalar instructions for each basic block is primarily determined, and it is determined whether instructions of scalar instructions for each basic block are inter-dependent. For basic blocks without mutual dependence among scalar instructions, determining whether the operation types of the scalar instructions are the same, and whether addresses of operands (including source operands and destination operands) are continuous; thus, a group of scalar instructions with potential vectorization opportunities is determined, and the basic blocks formed by the group of scalar instructions are determined to be basic blocks capable of being automatically vectorized.
Step 3: and carrying out SLP vectorization feasibility analysis on the basic block for completing the data dependence analysis.
After the basic blocks of the source program are divided, the basic blocks are traversed, and the feasibility analysis of SLP vectorization is carried out on the basic blocks which successfully pass the data dependency analysis.
In the feasibility analysis of SLP vectorization of a basic block, it is mainly analyzed whether the number of scalar instructions in the basic block satisfies the number of vectorizable scalar instructions supported by a processor (the number of vectorizable scalar instructions is, for example, 2 n, and the value of n may be a value of 0,1, 2.
After determining that the basic block can perform SLP vectorization, the SLP vectorizing PASS may be directly performed, i.e. step 6 is performed: SLP vectorizes PASS. Wherein PASS represents vectorized analysis and processing.
After determining that the basic block does not have SLP vectorization feasibility, step 4 is performed: and performing cyclic vectorization feasibility analysis on the basic block.
In step4, the loop vectorization feasibility analysis is to analyze whether scalar instructions in basic blocks can be subjected to loop vectorization based on a source program, and in the loop vectorization process, it is mainly determined whether a loop program capable of vectorizing exists in the source program, and then vectorizing is performed mainly by the loop program.
Step 5: the basic block is circularly vectorized.
Loop vectorization PASS execution ends, followed by SLP vectorization PASS:
step 6: SLP vectorizes PASS.
After the SLP vectorization PASS is completed, the whole process of automatic vectorization processing of the compiler is completed.
The SLP vectorization feasibility analysis in the cyclic vectorization PASS is not an SLP vectorization PASS, and is a step in the vectorization analysis and processing procedure of the cyclic vectorization PASS.
However, in the implementation process of the automatic vectorization, the following problems exist:
On the one hand, in the implementation process of loop vectorization, when there is a strong correlation in the longitudinal operation logic, that is, in the loop program in the next round of loops, part of the operations need to wait for the operation result of the loop program in the previous round of loops, which results in that the loop program in the round of loops cannot perform loop vectorization processing. Also, if the partial operations in the basic block are not identical, the basic block cannot perform SLP vectorization. Since the automatic vectorization technique optimizes the source program in units of the entire Basic Block or the cyclic Basic Block (Basic Block where SLP cannot be performed), if there is a case where a portion of the Basic Block is not vectorizable, the correctness analysis cannot be performed correctly, and thus scalar instructions in the Basic Block cannot be vectorized.
On the other hand, even if the SLP vectorization condition is satisfied, if the number of scalar instructions (elements) in the basic block is in the form of a non-2 power, since the scalar instructions of the number of non-2 powers are not the number of vectorizable scalar instructions, vectorization processing cannot be directly performed. At this time, the compiler will use split mode for vectorization.
For example, a basic block includes 3 scalar instructions, and the compiler splits the 3 scalar instructions into a 2+1 form of scalar instruction combination. Vectorizing a set of scalar instructions, where the combination is 2, leaves a set of scalar instructions to continue processing using scalar, which also makes the SLP vectorization approach unable to maximize the computational performance of utilizing vector instructions.
Based on this, the embodiment of the invention provides an instruction conversion method, which is used for realizing full and complete vectorization of scalar instructions in a basic block, improving the conversion quantity of converting scalar instructions into vector instructions, further maximizing the operation performance of utilizing vector instructions and improving the execution efficiency of the whole program.
Referring to fig. 2, fig. 2 is a flow chart of an instruction converting method according to an embodiment of the invention.
As shown in fig. 2, the process may include the steps of:
In step S100, a target basic block is determined, the target basic block being determined based on a source program, and the target basic block comprising a plurality of target scalar instructions.
The target basic block may be partitioned by the source program as a basis for a subsequent analysis of whether there is SLP vectorized feasibility or loop vectorized feasibility.
When dividing a basic block of a source program, the division may be performed according to program instructions of the source program. First, entries of basic blocks are found in program instructions of a source program, and entries of one basic block can be classified into three types: 1. the first instruction of the code segment, 2, target statement of conditional jump and unconditional jump, 3, next statement of conditional jump statement; then, dividing the source program into a plurality of basic blocks according to the divided entries; one basic block range is: starting at the determined entry and ending at the entry where the next basic block is encountered.
When the automatic vectorization is performed, a basic block, which is continuous in address of scalar instructions and independent of each other in execution of each scalar instruction, is selected from among a plurality of basic blocks, and is used as a basis for determining a target basic block.
Step S101, determining a target loop expansion number corresponding to the target basic block at least according to the number of the target scalar instructions.
Based on the flow chart of automatic vectorization shown in fig. 1, it can be seen that when one basic block is automatically vectorized, in the cyclic vectorization PASS, whether it has the feasibility of SLP vectorization is first analyzed, and if so, the SLP vectorization PASS is directly performed. But for basic blocks with SLP vectorization feasibility, basic blocks are also included in which the number of scalar instructions is an integer multiple of the number of vectorizable scalar instructions that are not supported by the processor.
For example, when the number of vectorizable scalar instructions is 2, the 3 scalar instructions are split and vectorized, and one of the processed instructions is still a scalar instruction, so that the scalar instructions in the basic block cannot be completely quantized.
And for the basic block without the possibility of carrying out SLP vectorization, carrying out analysis of the cyclic vectorization feasibility of the basic block. But is done for the entire source procedure as in the loop vectorized PASS. Thus, if the execution of scalar instructions for one of the basic blocks in one loop Q is needed to be executable based on the execution results of the basic blocks in the other loop P, then the basic blocks in loop Q cannot be loop vectorized. That is, in a basic block that does not have the possibility of performing SLP vectorization, there is also a case where a part of the basic block cannot be vectorized.
Therefore, the target basic block may have insufficient scalar instruction vectorization during SLP vectorization and loop vectorization. In view of this problem, in the embodiment of the present invention, based on the determined target basic block (whether the target basic block satisfies SLP vectorization or satisfies cyclic vectorization), the target basic block is processed in a manner of directly performing cyclic expansion according to the number of target cyclic expansion times, so that the number of target scalar instructions after expansion in the basic block after expansion is an integer multiple of the number of scalar instructions capable of vectorizing, so that after expansion of the target basic block, all target scalar instructions after expansion included in the basic block after expansion obtained can be subjected to vectorization, and the conversion number of scalar instructions into vector instructions is improved.
The target loop unroll times are used to replicate each target scalar instruction of the target basic block a corresponding number of times.
For example, the target loop unroll number is 2, and the target basic block includes a first target scalar instruction, a second target scalar instruction, and a third target scalar instruction. Then each target scalar instruction in the target basic block performs a copy of 2 times the target loop unrolling number, i.e., the loop unrolling of each target scalar instruction is implemented. The resulting post-expansion target scalar instruction includes 2 identical first target scalar instructions, 2 identical second target scalar instructions, and 2 identical third target scalar instructions, totaling 6 post-expansion target scalar instructions.
And step S102, expanding the target basic block based on the target cyclic expansion times to obtain an expanded basic block.
The expanded basic block includes a plurality of expanded target scalar instructions, the number of which is an integer multiple of the number of vectorizable scalar instructions supported by the processor.
Vectorization may be considered as the process of dividing and combining scalar instructions in a basic block by a corresponding number according to the number of scalar instructions that can be vectorized. Therefore, when the number of scalar instructions in the target basic block is not an integer multiple of the number of scalar instructions that can be vectorized, then all vectorization is not possible.
Note that the target basic block that is not an integer multiple of the number of vectorizable scalar instructions may be: the number of target scalar instructions is less than the basic blocks of the number of vectorizable scalar instructions. For example, a base block with a scalar instruction number of 4 and a base block with a scalar instruction number of 3 may be vectorized.
Or the target basic block may be a basic block in which the number of target scalar instructions is not less than the number of vectorizable scalar instructions, the number of target scalar instructions not being an integer multiple of the number of vectorizable scalar instructions. For example, the number of vectorizable scalar instructions is 4, and the number of scalar instructions for a basic block is 6 (the target basic block at this time is a basic block where the number of target scalar instructions is greater than the number of vectorizable scalar instructions).
And step S103, vectorizing the expanded basic block to obtain a converted vector instruction.
Since the number of the target scalar instructions after expansion is an integer multiple K of the number S of vectorizable scalar instructions supported by the processor in the basic block after expansion, vectorization processing can be performed on the target scalar instructions after expansion according to the number of vectorizable scalar instructions, and the target scalar instructions after expansion are divided into K vector instructions, each vector instruction including S target scalar instructions after expansion.
Therefore, when the instruction conversion method provided by the embodiment of the invention is used for realizing automatic vectorization processing of scalar instructions, the target loop expansion times corresponding to the target basic blocks are calculated at least according to the number of target scalar instructions included in the target basic blocks; and performing loop expansion on each target scalar instruction of the target basic block by using the target loop expansion times, so that the number of the expanded target scalar instructions in the expanded basic block is as follows: an integer multiple of the number of vectorizable scalar instructions supported by the processor. Therefore, each expanded target scalar instruction in the expanded basic block can be divided according to the number of the vectorizable scalar instructions so as to be combined into a plurality of vector instructions, and vectorization processing of the expanded target scalar instructions is realized. Therefore, the embodiment of the invention can lead the number of the target scalar instructions after the expansion in the basic block after the expansion to be the integral multiple of the number of the scalar instructions capable of vectorizing, lead the target basic block to realize vectorizing processing after the expansion, reduce the limit of vectorizing the scalar instructions and improve the conversion number of converting the scalar instructions into vector instructions.
In order to ensure that the number of target scalar instructions after loop expansion is an integer multiple of the number of vectorizable scalar instructions, the target scalar instructions in the target basic block are quantized sufficiently, and in one embodiment, the number of target loop expansion times can be calculated accurately according to the actual configuration situation of the processor in combination with the number of target scalar instructions in the target basic block.
Referring to fig. 3, fig. 3 is another flow chart of the instruction converting method according to the embodiment of the invention.
As shown in fig. 3, the method comprises the steps of:
step S200, obtaining configuration parameters of the processor.
The configuration parameters of the processor are determined according to the specific functional implementation of the processor, and different processors have different configuration parameters.
Step S201, determining the number of vectorizable scalar instructions supported by a processor according to the configuration parameters of the processor.
The number of vectorizable scalar instructions supported by the processor is determined based on the processor specific functional implementation, and the configuration parameters of the processor correspond to the processor specific functional implementation. Thus, the number of vectorizable scalar instructions supported by the processor may be derived based on the configuration parameters of the processor to facilitate subsequent determination of the loop unrolling times.
For example, in the configuration parameters of the processor, the bit width of the vector register is 128 bits, and the number of vectorizable scalar instructions supported by the processor may be 2 scalar instructions of 64 bits combined into 1 vector instruction, i.e. the number of vectorizable scalar instructions supported by the processor is 2. Of course, it is also possible to merge 4 32-bit scalar instructions into 1 vector instruction, and then the number of vectorizable scalar instructions supported by the processor is 4.
In step S202, a target basic block is determined, the target basic block being determined based on a source program, and the target basic block comprising a plurality of target scalar instructions.
The target basic block includes a plurality of target scalar instructions, and each target scalar instruction has the same operation type, for example, the target basic block includes 3 target scalar instructions, which can be expressed as tj++ =ai ] [ j ], and j=1, 2,3, i is any value greater than or equal to 1. Wherein, t [ j ] is a destination operand of a one-dimensional array, a [ i ] is a source operand of a two-dimensional array, and t [ j ] + = a [ i ] represents t [ j ] = t [ j ] +a [ i ] [ j ]; the addresses of the respective target scalar instructions are consecutive, e.g., a [ i ] [1], a [ i ] [2] and a [ i ] [3] are consecutive, t [1], t [2] and t [3] are consecutive, and the respective target scalar instructions operate independently of each other.
In the embodiment of the invention, after traversing and data dependency analysis are carried out on basic blocks, basic blocks which can carry out SLP vectorization PASS, but cannot realize complete vectorization of scalar instructions, and basic blocks which cannot carry out SLP vectorization PASS are taken as target basic blocks, so that the target basic blocks are circularly unfolded, and the target basic blocks are disguised into basic blocks which can be completely vectorized (namely, the basic blocks after being unfolded).
To accurately determine the target basic block, implementing sufficient quantization of scalar instructions for the basic block, in one embodiment, step S202 may include:
Acquiring a source program, wherein the source program comprises a plurality of basic blocks, one basic block comprises a plurality of scalar instructions, and the basic blocks are positioned in a loop of the source program; determining a candidate basic block from the basic blocks, wherein the candidate basic block comprises a plurality of continuous scalar instructions, and the operation types of the scalar instructions are the same; the target basic block is determined from the candidate basic blocks based on the number of scalar instructions for the candidate basic block.
Based on the foregoing, it can be known that the basic block is obtained by dividing the source program according to different kinds of entries. There may be basic blocks without vectorization potential, that is, the operation types of scalar instructions in the basic blocks are different, and the addresses of the scalar instructions are also discontinuous, so that the basic blocks have no vectorization potential and cannot be vectorized automatically.
By performing traversal and data dependency analysis on each basic block, a basic block including a plurality of scalar instructions having consecutive addresses and the same operation type can be determined from among a plurality of basic blocks as a candidate basic block. The candidate basic block is a basic block that can be automatically vectorized.
According to the number of scalar instructions of the candidate basic blocks, the target basic block according to the embodiment of the invention is determined.
In one embodiment, the determining the target basic block from the candidate basic blocks based on the number of scalar instructions of the candidate basic block includes:
a candidate basic block whose number of scalar instructions satisfies the number of vectorizable scalar instructions and is not an integer multiple of the number of vectorizable scalar instructions is determined as a target basic block.
Candidate basic blocks that satisfy the number of vectorizable scalar instructions and are not integer multiples of the number of vectorizable scalar instructions; wherein, the number of the scalar instructions meeting vectorizable is that the number of scalar instructions of the candidate basic block is not less than the number of vectorizable scalar instructions. For example, when the number of scalar instructions that can be vectorized is s=2, and the number of scalar instructions of the candidate basic block is n=3, the candidate basic block may be vectorized. For example, when the number of scalar instructions of the candidate basic block is n=3, 2 scalar instructions are typically combined into one vector instruction, and the original one scalar instruction is retained (i.e., the manner in which vectorization of the scalar instructions cannot be fully achieved). The manner in which the processing is split may result in scalar instructions of the candidate basic block not being sufficiently quantized to effectively implement hardware computing power.
Therefore, in the embodiment of the present invention, the candidate basic block of the above type is determined as the target basic block, so that the scalar instruction of the target basic block is completely quantized through the processing of the target number of loop expansion times.
In other embodiments, the determining the target basic block from the candidate basic blocks based on the number of scalar instructions for the candidate basic block includes:
And determining candidate basic blocks, the number of which does not meet the number of the vectorizable scalar instructions, as target basic blocks.
The number of scalar instructions does not satisfy the candidate basic block of the number of vectorizable scalar instructions, for example, in the case where the number of vectorizable scalar instructions is s=4, when the number of scalar instructions of the candidate basic block is 3, the 3 scalar instructions do not satisfy merging 4 scalar instructions into 1 vector instruction. Therefore, the candidate basic block cannot realize vectorization processing, so that each scalar instruction in the candidate basic block cannot carry out vectorization processing, program lines are reduced, program running speed is influenced, and hardware computing power cannot be well exerted.
Therefore, in the embodiment of the invention, the candidate basic blocks of the type are determined as the target basic blocks, and after the expansion processing of the target cyclic expansion times is utilized, the number of the expanded target scalar instructions is expanded to be an integral multiple of the number of vectorizable scalar instructions, so that the complete vectorization of the candidate basic blocks of the type can be realized, the program line is improved, the program running speed is improved, and the hardware computing power is better exerted.
Step S203, determining a target loop expansion number corresponding to the target basic block according to the number of the target scalar instructions and the number of the vectorizable scalar instructions.
Since SLP vectorizes feasibility analysis, a determination is made based on the number of vectorizable scalar instructions. Thus, there are target basic blocks that can be vectorized but cannot be fully quantized (e.g., target basic blocks including 3 scalar instructions in the case where the number of vectorizable scalar instructions is 2) and target basic blocks that cannot be SLP vectorized (e.g., target basic blocks including 3 scalar instructions in the case where the number of vectorizable scalar instructions is 4).
Therefore, in order to enable the target scalar instruction in the target basic block to fully implement vectorization, in the embodiment of the present invention, the target loop expansion number is calculated in combination with the number of vectorizable scalar instructions and the number of target scalar instructions included in the target basic block, so that in the basic block after loop expansion, the number of target scalar instructions after expansion is an integer multiple of the number of vectorizable scalar instructions.
For example, if the number S of vectorizable scalar instructions is 2 and the number N of target scalar instructions of the target basic block is 3, after the loop expansion of the target loop expansion number X (for example, x=2), the number of expanded target scalar instructions obtained is n×x=3×2=6, which is 3 times the number S of vectorizable scalar instructions. Thus, the expanded target scalar instructions may all be vectorized, resulting in 3 converted vector instructions, each formed of 2 expanded target scalar instructions.
In one embodiment, step S203 may include the steps of:
Determining a least common multiple of the number of the plurality of target scalar instructions and the number of vectorizable scalar instructions; dividing the least common multiple by the number of the target scalar instructions, and determining the obtained result as the target cycle expansion times corresponding to the target basic block.
For example, if the number N of target scalar instructions of the target basic block is 3 and the number S of vectorizable scalar instructions is 4, then the least common multiple can be calculated to be 12; the least common multiple 12 is then divided by the number of target scalar instructions N to yield a target loop unroll count x=4.
With continued reference to fig. 3, the method further includes:
And step S204, expanding the target basic block based on the target cyclic expansion times to obtain an expanded basic block.
The above description will be continued taking the target number of loop expansion x=4 calculated as an example.
The 3 target scalar instructions in the target basic block are circularly expanded for x=4 times, and the expanded basic block comprises 3×4=12 expanded target scalar instructions, which is 3 times as large as s=4. Therefore, according to the number s=4 of vectorizable scalar instructions, the expanded target scalar instructions may be divided and combined into 3 vector instructions, where each 4 expanded target scalar instructions are divided into one vector instruction, and all vectorization of the expanded target scalar instructions in the expanded basic block is completed, that is, all vectorization of the target scalar instructions of the target basic block may be achieved.
In one embodiment, step S204 may include:
Circularly expanding the destination operands of each target scalar instruction based on the target circularly expanding times to obtain expanded destination operands, wherein the number of the expanded destination operands is integer times of the number of the vectorizable scalar instructions; performing loop expansion on source operands of each target scalar instruction based on the target loop expansion times to obtain expanded source operands, wherein the number of the expanded source operands is integer times of the number of the vectorizable scalar instructions; and forming a target scalar instruction after expansion based on each target operand after expansion and each source operand after expansion, and obtaining an expanded basic block.
For convenience of description, the number of vectorizable scalar instructions is s=2, the number of target scalar instructions is n=3, and the target basic block is BB1, where the target scalar instructions of the target basic block BB1 are: total [1] + =a [ i ] [1], total [2] + =a [ i ] [2], total [3] + =a [ i ] [3].
According to the above-described manner of calculating the target number of loop expansions X, the target number of loop expansions at this time can be obtained as x=2.
In each target scalar instruction in BB1, total [1], total [2], total [3] are one-dimensional arrays with continuous addresses, a [ i ] [1], a [ i ] [2], a [ i ] [3] are source operands, a [ i ] [1] [ i ] [3] are two-dimensional arrays with continuous addresses (that is, addresses are continuous within the range of i rows and 3 columns, a [ i+1] [1] and a [ i ] [1] are discontinuous, and i is an arbitrary value).
The cyclic expansion of BB1 by x=2 times can be expressed as:
Performing cyclic expansion on the destination operand for x=2 times to obtain an expanded destination operand as follows: total [1], total [2], total [3], total [1], total [2], total [3]. Since total [1], total [2], and total [3] are one-dimensional arrays and the addresses are continuous, after the destination operands are circularly expanded, the addresses of the expanded destination operands are still continuous.
Performing cyclic expansion on the source operand for x=2 times to obtain an expanded source operand as follows: a < 1>, a < 2>, a < 1>, a < 3 >, a < 2> and a < 2> 3. Since a [ i ] [1], a [ i ] [2], a [ i ] [3] are continuous in addresses within the range of i rows and 3 columns, a [1] [1], a [1] [2], a [1] [3] are continuous, a [2] [1], a [2] [2], a [2] [3] are continuous, and a [1] [3] and a [2] [1] are discontinuous.
It can be seen that the number of post-unrolled destination operands and post-unrolled source operands are 6 and 3 times the number of vectorizable scalar instructions 2. After the post-expansion target scalar instruction is obtained based on the post-expansion destination operand and the post-expansion source operand, the number of post-expansion target scalar instructions is 6, which is also 3 times the number of vectorizable scalar instructions 2. Thus, the post-unrolled target scalar instruction can achieve full quantization in terms of the number of vectorizable scalar instructions.
In the embodiment of the present invention, the target basic block is circularly expanded by using the number of times of target cyclic expansion, but the method is different from cyclic vectorization. Loop vectorization is the expansion of loop structures formed by loop statements in a source program, which is based on loop expansion performed in the loop iteration direction of the source program. For example, when performing loop vectorization, where x+=yi occurs in one loop and x+=yi+1 occurs in the next loop, addresses of yi and yi+1 are consecutive, direct vectorization may be performed for x+=yi and x+=yi+1 that are consecutive addresses in the loop iteration direction of the source program.
In the technical solution provided in the embodiment of the present invention, vectorization is performed based on continuous accesses in a single loop, that is, in a single loop, for scalar instructions (target scalar instructions) with continuous addresses, loop expansion is performed so that the number of expanded elements (expanded target scalar instructions) is filled with an integer multiple of the number of vectorizable scalar instructions supported by a processor. For example, when the technical solution provided by the embodiment of the present invention is used for vectorization, x1+=yi and x2+=yi+1 occur in a round of loop, where y [ i ] and yi+1 ] addresses are continuous, x1+=yi and x2+=yi+1 may be used as the target scalar instruction in the embodiment of the present invention, and further, in the round of loop, the target scalar instruction is circularly expanded by using the determined target loop expansion times, so that the expanded target scalar instruction is an integer multiple of the number of vectorizable scalar instructions supported by the processor, and further, vectorization processing is performed on the expanded target scalar instruction.
It can be seen that loop vectorization is essentially a succession of addresses requiring scalar instructions that are spread out in the loop iteration direction, therefore, in vectorization processing, the expanded scalar instruction is vectorized based on the loop iteration direction; in the technical scheme provided by the embodiment of the invention, the vectorization processing is preferably performed on the continuous memory (the expanded target scalar instruction obtained after the target scalar instruction with continuous addresses is circularly expanded) in the expanded target basic block, and the process is assisted with circularly expanding to complete the rounding of the number of vectorizable scalar instructions supported by the processor.
With continued reference to fig. 3, the method further includes:
Step S205, performing vectorization processing on the expanded basic block to obtain a converted vector instruction.
The expanded basic block includes an integer multiple of the number of vectorizable scalar instructions of the expanded target scalar instruction, and thus, complete vectorization of the expanded basic block can be achieved.
In one embodiment, the implementation of performing complete quantization on the expanded target scalar instruction, that is, the implementation of step S205, may include:
Grouping the expanded destination operands according to the number of the vectorizable scalar instructions to obtain an expanded destination operation array; reading source operands according to the number of the vectorizable scalar instructions and according to the addresses of the expanded source operands, and processing the read source operands so that the read source operands are consistent with the expanded source operands to obtain groups of expanded source operation arrays, wherein each group of expanded source operation arrays comprises the number of the vectorizable scalar instructions; and combining each set of expanded destination operation arrays with each set of expanded source operation arrays to obtain a converted vector instruction.
Continuing with the above description, the number of vectorizable scalar instructions s=2, the number of target scalar instructions n=3, and the target loop expansion number x=2 are described as an example.
The expanded destination operands are: total [1], total [2], total [3], total [1], total [2], total [3]. When vectorization processing is performed, the method may divide the 2 expanded destination operands into 3 expanded destination operation arrays according to s=2, where the expanded destination operation arrays are a group: { Total [1], total [2] }, { Total [3], total [1] }, { Total [2], total [3] }.
The source operands after expansion are: a [1] [1], a [1] [2], a [1] [3], a [2] [1], a [2] [2], a [2] [3], then according to the number S=2 of vectorizable scalar instructions, access, i.e. reading, of the corresponding operand is performed according to the address of the expanded operand.
For example, in a specific implementation, the address of the source operand after expansion may be loaded into the vector register according to the number s=2 of the scalar instruction capable of vectorizing, and since the address is continuously loaded into the vector register, the vector register is used for the operands a [1] [1], a [1] [2], a [1] [3], a [2] [1], a [2] [2], a [2] [3], in loading addresses { a 1] [1], a1 [2] }, { a1 [3], a1 [4] } and { a 2 [1], a 2 [2] [2] }, { a 2 [3], a 2 [4] }, where a1 [4] is set to be the subsequent address continuous with a1 [3], and a 2 [4] is set to be the subsequent address continuous with a 2 [3 ]; the data read accordingly from the address loaded at this time will include the unwanted source operands: data read by the a [1] [4] address and data read by the a [2] [4] address (i.e., data that is inconsistent with the address read based on the unrolled source operand).
Thus, in the vector register, the read source operands are further processed according to the expanded source operands, for example, the data a 21 read by the a [2] [1] address is spliced to the data a 13 read by the a [1] [3] address by using the splice instruction, so as to replace the unnecessary data read by the a [1] [4] address.
Finally, a set of post-expansion source operation arrays is obtained that contains the number s=2 of read source operands of the vectorizable scalar instruction: { a [1] [1], a [1] [2] }, { a [1] [3] , a [2] [1] }, { a [2] [2], a [2] [3] }.
And combining each set of the expanded source operation arrays with each set of the expanded destination operation arrays to obtain a converted vector instruction.
The translated vector instruction may be expressed as:
for a first set { total [1], total [2] } and a first set of post-expansion source operation arrays { a [1] [1], a [1] [2] }, the resulting converted vector instruction 1 is: vect _total_1_2+= vect _a_1_2[ i ], i=1, 2.
For a second set { total [3], total [1] } and a second set of post-expansion source operation arrays { a [1] [3] , a [2] [1] }, the resulting converted vector instruction 2 is: vect _total_3_1+ = < a [ i ] [3], a [ i+1] [1] >, i=1, 2.
For the third set { total [2], total [3] } and the third set of post-expansion source operation arrays { a [2] [2], a [2] [3] }, the resulting converted vector instruction 3 is: vect _total_2_3+= vect _a_2_3[ i+1], i=1, 2.
The address of each group of expanded destination operands is continuous, and each group of expanded source operation array comprises expanded source operands with continuous addresses (vect _a_1_2[ i ], vect _a_2_3[ i+1 ]) and discontinuous addresses (a [ i ] [3], a [ i+1] [1 ]).
The above embodiment is a vectorization process corresponding to a case where the number of scalar instructions satisfies the number of vectorizable scalar instructions and candidate basic blocks that are not integer multiples of the number of vectorizable scalar instructions are determined as target basic blocks, the corresponding converted vector instructions including:
A vector instruction having consecutive addresses of destination operands and consecutive addresses of source operands;
And vector instructions in which addresses of destination operands are consecutive, but addresses of source operands are not consecutive.
Vector instructions with consecutive addresses of destination operands and consecutive addresses of source operands are vect _total_1_2+ = vect _a_1_2[ i ] and vect _total_2_3+ = vect _a_2_3[ i+1], where i=1, 2.
The vector instruction with consecutive addresses of the destination operand but non-consecutive addresses of the source operand is vect _total_3_1+ = < a [ i ] [3], a [ i+1] [1] >, i=1, 2.
Taking i < N (n=3) as an example of loop execution of the target basic block BB1 in the source program, that is, in the case where the number of target scalar instructions of the target basic block is n=3 and the number of vectorizable scalar instructions is s=2, the target basic block BB1 needs to execute 2 loops. Therefore, in the embodiment of the present invention, the target loop expansion times x=2 are expanded, and the loop step length is the target loop expansion times, that is, after one loop is executed, the loop can be equivalent to completing 2 loops, so after the instruction conversion method provided by the embodiment of the present invention is executed once, the execution of the target basic block BB1 is completed, which is equivalent to executing 2 loops in the original method.
Therefore, in the case of n=3 and s=2, in the original method, splitting the target scalar instruction of the target basic block BB1, in a round of loop, vectorizing two target scalar instructions of 2 to form a converted vector instruction, and remaining one still holds the scalar instruction, so that a converted vector instruction vect _total_1_2+ = vect _a_1_2[ i ] and a scalar instruction total [3] + =a [ i ] [3] can be obtained. That is, in the conventional method, after one round of loop execution, 2 accesses (accesses of source operands of a vector instruction and accesses of source operands of a scalar instruction) and 2 addition operations are required to be performed, and after all 2 rounds of loop execution, 4 accesses and 4 addition operations are required to be performed.
In the embodiment of the present invention, since the loop step length is the target loop expansion number, loop execution of the target basic block BB1 can be completed after one execution, and 3 converted vector instructions are obtained, where one is a vector instruction with discontinuous addresses of the source operands after expansion. Therefore, in the embodiment of the invention, only 2 accesses and 3 addition operations are needed to be executed, and compared with the original method, the embodiment of the invention can save the execution of 2 accesses and 1 addition operation, and effectively improve the running speed of the program.
When a candidate basic block whose number of scalar instructions does not satisfy the number of vectorizable scalar instructions is determined as a target basic block, for example, the number n=3 of target scalar instructions of the target basic block BB2 described in the foregoing embodiment, and the number s=4 of vectorizable scalar instructions, the corresponding vectorization processing procedure may refer to fig. 4 and fig. 5, fig. 4 is a schematic structural diagram of the target basic block, and fig. 5 is a schematic structural diagram of the result of the processing of the target basic block by the instruction conversion method provided in the embodiment of the present invention.
As shown in fig. 4, each target scalar instruction of the target basic block BB2 is: t1+=a11, t2+=a12, t3+=a13, where t1-t3 are one-dimensional arrays with consecutive addresses and a11-a13 are two-dimensional arrays with consecutive addresses.
Firstly, determining the target loop expansion times X=4 of BB2 through N=3 and S=4, and then performing loop expansion on the target scalar instruction of BB2 for 4 times to obtain expanded destination operands as follows: t1, t2, t3, t1, t2, t3. When vectorization processing is performed, the 12 expanded destination operands may be divided into 3 groups according to s=4, to obtain 3 expanded destination operation arrays: { t1, t2, t3, t1}, { t2, t3, t1, t2}, { t3, t1, t2, t3}.
The source operands after expansion are: a11, a12, a13, a21, a22, a23, a31, a32, a33, a41, a42, a43, access, i.e., reading, of the corresponding operand is performed according to the address of the expanded operand.
For example, the address of each unrolled source operand may be loaded into a vector register by the number s=4 of vectorizable scalar instructions: { a11, a12, a13, a14}, { a21, a22, a23, a24}, { a31, a32, a33, a34}, { a41, a42, a43, a44}, here it is assumed that a14 is a subsequent address to a13, a24 is a subsequent address to a23, a34 is a subsequent address to a33, and a44 is a subsequent address to a 43; then, reading each source operand according to the loaded address, and then performing splicing processing on each source operand to obtain each group of expanded source operation arrays containing S=4 expanded source operands of the vectorizable scalar instruction: { a11, a12, a13, a21}, { a22, a23, a31, a32}, { a33, a41, a42, a43}, so that the read source operands in the expanded source operation array of each group are consistent with the expanded source operands.
And finally, combining each group of the expanded source operation arrays with each group of the expanded destination operation arrays to obtain a converted vector instruction.
As shown in fig. 5, the converted vector instruction may be expressed as:
For the first set { t1, t2, t3, t1} and the first set of post-expansion source operation arrays { a11, a12, a13, a21}, a translated vector instruction is obtained, wherein addresses of a11, a12, a13 are consecutive and addresses of a13 and a21 are not consecutive.
For the second set { t2, t3, t1, t2} and the second set of post-expansion source operation arrays { a22, a23, a31, a32}, a translated vector instruction is obtained, wherein addresses of a22, a23 are consecutive, addresses of a31, a32 are consecutive, and addresses of a23 and a31 are not consecutive.
For the third set { t3, t1, t2, t3} and the third set of post-expansion source operation arrays { a33, a41, a42, a43}, a translated vector instruction is obtained, wherein addresses of a41, a42, a43 are consecutive and addresses of a33 and a41 are not consecutive.
When the target basic block is a candidate basic block whose number of scalar instructions does not satisfy the number of vectorizable scalar instructions, the converted vector instruction is: vector instructions with consecutive addresses of destination operands but non-consecutive addresses of source operands.
The number of target basic blocks that do not satisfy the number of vectorizable scalar instructions is then typically the number of target scalar instructions N is less than the number of vectorizable scalar instructions S. Therefore, in the post-expansion target scalar instruction obtained after expansion of the target cycle expansion number X, the destination operand and the source operand are typically obtained by concatenation, and since the source operands are two-dimensional arrays, the address continuity of each source operand in the case where i is the same can only be ensured. Thus, the translated vector instruction is a vector instruction with consecutive addresses of the destination operand but non-consecutive addresses of the source operand.
Therefore, in the embodiment of the invention, the basic block which cannot be vectorized and the basic block which can be vectorized but cannot be completely vectorized are determined as the target basic block, and the target loop unfolding times for loop unfolding of the target basic block are obtained by utilizing the least common multiple calculated by the number of target scalar instructions of the target basic block and the number of scalar instructions which can be vectorized, so that the number of the target scalar instructions after unfolding of the basic block after unfolding is an integer multiple of the number of scalar instructions which can be vectorized, the target scalar instructions after unfolding can be completely quantized, the program performance is improved, the program running speed is accelerated, and the hardware computing power is better exerted.
Compared with the automatic vectorization processing procedure shown in fig. 1, the method provided by the embodiment of the invention can circularly expand (s=2, n=3) the target basic blocks which can not be completely vectorized in the basic blocks successfully subjected to the data dependency analysis on the basis of the procedure shown in fig. 1 to obtain the expanded basic blocks which can be completely vectorized, and circularly expand the target basic blocks (s=4, n=3) which have vectorization potential (the basic blocks which are successfully analyzed by the data dependency but are smaller than the vectorizable scalar instructions to obtain the expanded basic blocks which can be completely vectorized, so that the complete vectorization of the target basic blocks can be realized on the basis of not influencing the automatic vectorization of a compiler.
Referring to fig. 6, fig. 6 is a flowchart of automatic vectorization corresponding to the instruction conversion method according to the embodiment of the present invention.
On the basis of the process of automatic vectorization shown in fig. 1, step 31 is added: the target basic block is expanded for the target circulation times, and the vector processing is carried out on the basic block after the expansion, namely the target basic block which successfully passes the data dependency analysis is successfully processed, whether the target basic block is successful or failed in the SLP feasibility analysis, in the embodiment of the invention, the target basic blocks can be expanded for target circulation times, so that the target basic blocks (including the target basic blocks with S=2 and N=3 or the target basic blocks with S=4 and N=3) which cannot be completely quantized can be completely vectorized by using the instruction conversion method provided by the embodiment of the invention.
Therefore, the instruction conversion method provided by the embodiment of the invention can be compatible with the original automatic vectorization processing flow, and can be used for circularly expanding basic blocks which cannot be completely quantized and vectorized in the original automatic vectorization process, so that the full vectorization of the target scalar instruction is realized.
The embodiment of the invention also provides an instruction conversion device which is used for realizing the instruction conversion method described in the embodiment.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an instruction converting apparatus according to an embodiment of the invention.
As shown in fig. 7, the apparatus includes:
a target basic block determination module 71 for determining a target basic block, the target basic block being determined based on a source program and the target basic block comprising a plurality of target scalar instructions;
a loop unrolling number determining module 72, configured to determine a target loop unrolling number corresponding to the target basic block at least according to the number of the target scalar instructions;
A loop expansion module 73, configured to expand the target basic block based on the target loop expansion times to obtain an expanded basic block, where the expanded basic block includes a plurality of expanded target scalar instructions, and the number of the plurality of expanded target scalar instructions is an integer multiple of the number of vectorizable scalar instructions supported by the processor;
the vector instruction generating module 74 is configured to perform vectorization processing on the expanded basic block, so as to obtain a converted vector instruction.
With continued reference to fig. 7, the apparatus may further include:
a processor configuration parameter obtaining module 700, configured to obtain configuration parameters of a processor;
A vectorizable scalar instruction number determination module 701 for determining the number of vectorizable scalar instructions supported by a processor according to configuration parameters of the processor;
The loop expansion number determining module 72 is configured to determine, at least according to the number of the target scalar instructions, a target loop expansion number corresponding to the target basic block, including:
And determining the target loop unfolding times corresponding to the target basic blocks according to the number of the target scalar instructions and the number of the vectorizable scalar instructions.
Optionally, the loop expansion number determining module 72 is configured to determine, according to the number of the target scalar instructions and the number of vectorizable scalar instructions, a target loop expansion number corresponding to the target basic block, including:
Determining a least common multiple of the number of the plurality of target scalar instructions and the number of vectorizable scalar instructions;
Dividing the least common multiple by the number of the target scalar instructions, and determining the obtained result as the target cycle expansion times corresponding to the target basic block.
Optionally, the target basic block determining module 71 is configured to perform expansion on the target basic block based on the target cyclic expansion times, to obtain an expanded basic block, and includes:
circularly expanding the destination operands of each target scalar instruction based on the target circularly expanding times to obtain expanded destination operands, wherein the number of the expanded destination operands is integer times of the number of the vectorizable scalar instructions;
Performing loop expansion on source operands of each target scalar instruction based on the target loop expansion times to obtain expanded source operands, wherein the number of the expanded source operands is integer times of the number of the vectorizable scalar instructions;
And forming a target scalar instruction after expansion based on each target operand after expansion and each source operand after expansion, and obtaining an expanded basic block.
Optionally, the target basic block determining module 71 is configured to perform vectorization processing on the expanded basic block to obtain a converted vector instruction, and includes:
Grouping the expanded destination operands according to the number of the vectorizable scalar instructions to obtain an expanded destination operation array;
Reading source operands according to the number of the vectorizable scalar instructions and according to the addresses of the expanded source operands, and processing the read source operands so that the read source operands are consistent with the expanded source operands to obtain groups of expanded source operation arrays, wherein each group of expanded source operation arrays comprises the number of the vectorizable scalar instructions;
and combining each set of expanded destination operation arrays with each set of expanded source operation arrays to obtain a converted vector instruction.
Optionally, the target basic block determining module 71 is configured to determine a target basic block, including:
Acquiring a source program, wherein the source program comprises a plurality of basic blocks, one basic block comprises a plurality of scalar instructions, and the basic blocks are positioned in a loop of the source program;
Determining a candidate basic block from the basic blocks, wherein the candidate basic block comprises a plurality of continuous scalar instructions, and the operation types of the scalar instructions are the same;
The target basic block is determined from the candidate basic blocks based on the number of scalar instructions for the candidate basic block.
Optionally, the target basic block determining module 71 is configured to determine, based on the number of scalar instructions of the candidate basic block, a target basic block from the candidate basic blocks, including:
a candidate basic block whose number of scalar instructions satisfies the number of vectorizable scalar instructions and is not an integer multiple of the number of vectorizable scalar instructions is determined as a target basic block.
Optionally, the target basic block determining module 71 is configured to determine, based on the number of scalar instructions of the candidate basic block, a target basic block from the candidate basic blocks, including:
And determining candidate basic blocks, the number of which does not meet the number of the vectorizable scalar instructions, as target basic blocks.
Therefore, when the instruction conversion device provided by the embodiment of the invention realizes automatic vectorization processing of scalar instructions, the target loop expansion times corresponding to the target basic blocks are calculated at least according to the number of target scalar instructions included in the target basic blocks; and performing loop expansion on each target scalar instruction of the target basic block by using the target loop expansion times, so that the number of the expanded target scalar instructions in the expanded basic block is as follows: an integer multiple of the number of vectorizable scalar instructions supported by the processor. Therefore, each expanded target scalar instruction in the expanded basic block can be divided according to the number of the vectorizable scalar instructions so as to be combined into a plurality of vector instructions, and vectorization processing of the expanded target scalar instructions is realized. Therefore, the embodiment of the invention can lead the number of the target scalar instructions after the expansion in the basic block after the expansion to be the integral multiple of the number of the scalar instructions capable of vectorizing, lead the target basic block to realize vectorizing processing after the expansion, reduce the limit of vectorizing the scalar instructions and improve the conversion number of converting the scalar instructions into vector instructions.
An embodiment of the present invention provides an electronic device, for example, a computer device such as a terminal device, a server device, or the like, including a memory storing a program and a processor calling the program stored in the memory to execute the instruction converting method according to any one of the foregoing embodiments.
An embodiment of the present invention provides a storage medium storing a program that when executed implements the instruction conversion method according to any one of the foregoing embodiments.
An embodiment of the present invention provides a computer program product comprising a computer program which, when executed by a processor, implements the instruction conversion method according to any of the preceding embodiments.
The foregoing describes several embodiments of the present invention, and the various alternatives presented by the various embodiments may be combined, cross-referenced, with each other without conflict, extending beyond what is possible embodiments, all of which are considered to be embodiments of the present invention disclosed and disclosed.
Although the embodiments of the present invention are disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.

Claims (21)

1. A method of instruction conversion, comprising:
Determining a target basic block, the target basic block determined based on a source program, and the target basic block comprising a plurality of target scalar instructions;
Determining target cycle expansion times corresponding to the target basic blocks at least according to the number of the target scalar instructions;
Expanding the target basic block based on the target cyclic expansion times to obtain an expanded basic block, wherein the expanded basic block comprises a plurality of expanded target scalar instructions, and the number of the plurality of expanded target scalar instructions is an integer multiple of the number of vectorizable scalar instructions supported by a processor;
and carrying out vectorization processing on the unfolded basic blocks to obtain a converted vector instruction.
2. The instruction converting method according to claim 1, further comprising:
Acquiring configuration parameters of a processor;
determining a number of vectorizable scalar instructions supported by a processor according to configuration parameters of the processor;
The determining, at least according to the number of the target scalar instructions, the target loop expansion times corresponding to the target basic block includes:
And determining the target loop unfolding times corresponding to the target basic blocks according to the number of the target scalar instructions and the number of the vectorizable scalar instructions.
3. The instruction conversion method as claimed in claim 2, wherein said determining a target loop expansion number corresponding to the target basic block according to the number of the plurality of target scalar instructions and the number of vectorizable scalar instructions includes:
Determining a least common multiple of the number of the plurality of target scalar instructions and the number of vectorizable scalar instructions;
Dividing the least common multiple by the number of the target scalar instructions, and determining the obtained result as the target cycle expansion times corresponding to the target basic block.
4. The instruction converting method according to claim 3, wherein expanding the target basic block based on the target number of times of loop expansion to obtain an expanded basic block comprises:
circularly expanding the destination operands of each target scalar instruction based on the target circularly expanding times to obtain expanded destination operands, wherein the number of the expanded destination operands is integer times of the number of the vectorizable scalar instructions;
Performing loop expansion on source operands of each target scalar instruction based on the target loop expansion times to obtain expanded source operands, wherein the number of the expanded source operands is integer times of the number of the vectorizable scalar instructions;
And forming a target scalar instruction after expansion based on each target operand after expansion and each source operand after expansion, and obtaining an expanded basic block.
5. The method of instruction conversion as claimed in claim 4, wherein said vectorizing said expanded basic block to obtain a converted vector instruction comprises:
Grouping the expanded destination operands according to the number of the vectorizable scalar instructions to obtain an expanded destination operation array;
Reading source operands according to the number of the vectorizable scalar instructions and according to the addresses of the expanded source operands, and processing the read source operands so that the read source operands are consistent with the expanded source operands to obtain groups of expanded source operation arrays, wherein each group of expanded source operation arrays comprises the number of the vectorizable scalar instructions;
and combining each set of expanded destination operation arrays with each set of expanded source operation arrays to obtain a converted vector instruction.
6. The instruction conversion method according to any one of claims 1 to 5, wherein the step of determining the target basic block includes:
Acquiring a source program, wherein the source program comprises a plurality of basic blocks, one basic block comprises a plurality of scalar instructions, and the basic blocks are positioned in a loop of the source program;
Determining a candidate basic block from the basic blocks, wherein the candidate basic block comprises a plurality of continuous scalar instructions, and the operation types of the scalar instructions are the same;
The target basic block is determined from the candidate basic blocks based on the number of scalar instructions for the candidate basic block.
7. The instruction conversion method of claim 6, wherein determining the target basic block from the candidate basic blocks based on the number of scalar instructions for the candidate basic block comprises:
a candidate basic block whose number of scalar instructions satisfies the number of vectorizable scalar instructions and is not an integer multiple of the number of vectorizable scalar instructions is determined as a target basic block.
8. The instruction conversion method of claim 7, wherein the converted vector instruction comprises:
A vector instruction having consecutive addresses of destination operands and consecutive addresses of source operands;
And vector instructions in which addresses of destination operands are consecutive, but addresses of source operands are not consecutive.
9. The instruction conversion method of claim 6, wherein determining the target basic block from the candidate basic blocks based on the number of scalar instructions for the candidate basic block comprises:
And determining candidate basic blocks, the number of which does not meet the number of the vectorizable scalar instructions, as target basic blocks.
10. The instruction conversion method of claim 9, wherein the converted vector instruction is: vector instructions with consecutive addresses of destination operands but non-consecutive addresses of source operands.
11. An instruction converting apparatus, comprising:
A target basic block determination module for determining a target basic block, the target basic block being determined based on a source program and the target basic block comprising a plurality of target scalar instructions;
The loop expansion times determining module is used for determining target loop expansion times corresponding to the target basic blocks at least according to the number of the target scalar instructions;
The loop expansion module is used for expanding the target basic block based on the target loop expansion times to obtain an expanded basic block, wherein the expanded basic block comprises a plurality of expanded target scalar instructions, and the number of the plurality of expanded target scalar instructions is an integer multiple of the number of vectorizable scalar instructions supported by a processor;
And the vector instruction generation module is used for carrying out vectorization processing on the unfolded basic blocks to obtain a converted vector instruction.
12. The instruction converting apparatus according to claim 11, further comprising:
The processor configuration parameter acquisition module is used for acquiring configuration parameters of the processor;
A vectorizable scalar instruction number determination module configured to determine a number of vectorizable scalar instructions supported by a processor according to configuration parameters of the processor;
the loop expansion times determining module is configured to determine, at least according to the number of the target scalar instructions, a target loop expansion times corresponding to the target basic block, where the determining module includes:
And determining the target loop unfolding times corresponding to the target basic blocks according to the number of the target scalar instructions and the number of the vectorizable scalar instructions.
13. The instruction conversion apparatus of claim 12, wherein the loop expansion number determination module configured to determine a target loop expansion number corresponding to the target basic block based on the number of the plurality of target scalar instructions and the number of vectorizable scalar instructions, comprises:
Determining a least common multiple of the number of the plurality of target scalar instructions and the number of vectorizable scalar instructions;
Dividing the least common multiple by the number of the target scalar instructions, and determining the obtained result as the target cycle expansion times corresponding to the target basic block.
14. The instruction converting apparatus according to claim 13, wherein the target basic block determining module configured to perform expansion on the target basic block based on the target number of cyclic expansion times, to obtain an expanded basic block, includes:
circularly expanding the destination operands of each target scalar instruction based on the target circularly expanding times to obtain expanded destination operands, wherein the number of the expanded destination operands is integer times of the number of the vectorizable scalar instructions;
Performing loop expansion on source operands of each target scalar instruction based on the target loop expansion times to obtain expanded source operands, wherein the number of the expanded source operands is integer times of the number of the vectorizable scalar instructions;
And forming a target scalar instruction after expansion based on each target operand after expansion and each source operand after expansion, and obtaining an expanded basic block.
15. The instruction converting apparatus of claim 14, wherein the target basic block determining module is configured to vector the expanded basic block to obtain the converted vector instruction, and the target basic block determining module comprises:
Grouping the expanded destination operands according to the number of the vectorizable scalar instructions to obtain an expanded destination operation array;
Reading source operands according to the number of the vectorizable scalar instructions and according to the addresses of the expanded source operands, and processing the read source operands so that the read source operands are consistent with the expanded source operands to obtain groups of expanded source operation arrays, wherein each group of expanded source operation arrays comprises the number of the vectorizable scalar instructions;
and combining each set of expanded destination operation arrays with each set of expanded source operation arrays to obtain a converted vector instruction.
16. The instruction converting apparatus according to any of claims 11-15, wherein the target basic block determining module is configured to determine a target basic block, comprising:
Acquiring a source program, wherein the source program comprises a plurality of basic blocks, one basic block comprises a plurality of scalar instructions, and the basic blocks are positioned in a loop of the source program;
Determining a candidate basic block from the basic blocks, wherein the candidate basic block comprises a plurality of continuous scalar instructions, and the operation types of the scalar instructions are the same;
The target basic block is determined from the candidate basic blocks based on the number of scalar instructions for the candidate basic block.
17. The instruction conversion apparatus of claim 16, wherein the target basic block determination module to determine the target basic block from the candidate basic blocks based on a number of scalar instructions of the candidate basic block comprises:
a candidate basic block whose number of scalar instructions satisfies the number of vectorizable scalar instructions and is not an integer multiple of the number of vectorizable scalar instructions is determined as a target basic block.
18. The instruction conversion apparatus of claim 16, wherein the target basic block determination module to determine the target basic block from the candidate basic blocks based on a number of scalar instructions of the candidate basic block comprises:
And determining candidate basic blocks, the number of which does not meet the number of the vectorizable scalar instructions, as target basic blocks.
19. An electronic device comprising a memory in which a program is stored and a processor that invokes the program stored in the memory to perform the instruction conversion method according to any one of claims 1 to 10.
20. A storage medium storing a program which, when executed, implements the instruction conversion method according to any one of claims 1 to 10.
21. A computer program product comprising a computer program which, when executed by a processor, implements the instruction conversion method of any of claims 1-10.
CN202410851860.XA 2024-06-27 2024-06-27 Instruction conversion method and device and related equipment Pending CN118626145A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410851860.XA CN118626145A (en) 2024-06-27 2024-06-27 Instruction conversion method and device and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410851860.XA CN118626145A (en) 2024-06-27 2024-06-27 Instruction conversion method and device and related equipment

Publications (1)

Publication Number Publication Date
CN118626145A true CN118626145A (en) 2024-09-10

Family

ID=92611854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410851860.XA Pending CN118626145A (en) 2024-06-27 2024-06-27 Instruction conversion method and device and related equipment

Country Status (1)

Country Link
CN (1) CN118626145A (en)

Similar Documents

Publication Publication Date Title
US10140251B2 (en) Processor and method for executing matrix multiplication operation on processor
US8549501B2 (en) Framework for generating mixed-mode operations in loop-level simdization
Tian et al. Practical simd vectorization techniques for intel® xeon phi coprocessors
CN103279327B (en) Towards the automatic vectorization method of isomery SIMD widening parts
US7386842B2 (en) Efficient data reorganization to satisfy data alignment constraints
Su et al. URPR—An extension of URCR for software pipelining
Zhai et al. ByteTransformer: A high-performance transformer boosted for variable-length inputs
US20070156685A1 (en) Method for sorting data using SIMD instructions
JP2009116854A (en) System, method and computer program product for performing scan operation
CN103336758A (en) Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same
JPH09282179A (en) Method and device for instruction scheduling in optimized compiler for minimizing overhead instruction
CN111292805B (en) Third generation sequencing data overlap detection method and system
US8423979B2 (en) Code generation for complex arithmetic reduction for architectures lacking cross data-path support
Tran et al. Bit-parallel approximate pattern matching: Kepler GPU versus Xeon Phi
Chowdhury et al. Autogen: Automatic discovery of efficient recursive divide-8-conquer algorithms for solving dynamic programming problems
CN113672232B (en) Program compiling method and device
KR102594770B1 (en) Matching continuous values in data processing devices
CN118626145A (en) Instruction conversion method and device and related equipment
CN114117896B (en) Binary protocol optimization implementation method and system for ultra-long SIMD pipeline
Shin et al. Exploiting superword-level locality in multimedia extension architectures
US9158511B2 (en) Scalable partial vectorization
CN112230995B (en) Instruction generation method and device and electronic equipment
Piccinotti et al. Solving write conflicts in GPU-accelerated graph computation: A PageRank case-study
Bi et al. Efficiently Running SpMV on Multi-Core DSPs for Block Sparse Matrix
CN116168765B (en) Gene sequence generation method and system based on improved stroboemer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination