WO2022160863A1 - 一种程序数据级并行分析方法、装置及相关设备 - Google Patents

一种程序数据级并行分析方法、装置及相关设备 Download PDF

Info

Publication number
WO2022160863A1
WO2022160863A1 PCT/CN2021/130179 CN2021130179W WO2022160863A1 WO 2022160863 A1 WO2022160863 A1 WO 2022160863A1 CN 2021130179 W CN2021130179 W CN 2021130179W WO 2022160863 A1 WO2022160863 A1 WO 2022160863A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
calculation
read
dependency
addresses
Prior art date
Application number
PCT/CN2021/130179
Other languages
English (en)
French (fr)
Inventor
宋昌
王炯
张勇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022160863A1 publication Critical patent/WO2022160863A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method, device and related equipment for parallel analysis of program data.
  • SIMD Single instruction multiple data
  • data vector data-level parallel computing.
  • processors integrate components including the SIMD instruction set to improve the parallelism of the application program, thereby improving the execution performance of the processor.
  • the present application provides a program data level parallel method, device and related equipment, which can improve the efficiency of finding SIMDizable codes in an application program and save manpower and time.
  • a program data-level parallel analysis method comprising:
  • the computation instruction is SIMDizable, wherein the computation instruction depends on the read instruction.
  • the program data level parallel analysis method provided by the embodiment of the present application can obtain the read instruction executed by the processor when the application program is running, and then obtain the required parameter according to the read instruction and the calculation instruction dependent on the read instruction.
  • Dependency which determines whether a computation instruction is SIMDizable.
  • the method can quickly determine the SIMDizable code in the application, improve the efficiency of finding the SIMDizable code, and save manpower and time.
  • the above method can be used as a sampling analysis process performed by the program data level parallel analysis device when the application program is running, and the program data level parallel analysis device can only analyze the calculation instruction in each sampling cycle to determine whether the calculation instruction is SIMD compatible It can reduce the impact of the analysis process on the performance of the application and reduce the analysis overhead.
  • the method before determining whether the calculation instruction can be SIMDable according to the dependency between the read instruction and the calculation instruction, the method further includes:
  • the determining whether the calculation instruction can be SIMDable according to the dependency between the read instruction and the calculation instruction includes:
  • the above method can obtain n addresses corresponding to n consecutive read operations performed by the read instruction on the basis of determining whether the calculation instruction can be SIMDized according to the dependency between the read instruction and the calculation instruction, and according to n The address further determines whether the calculation instruction can be SIMDable, which can improve the accuracy of determining whether the calculation instruction can be SIMDable.
  • the determining whether the calculation instruction can be SIMDable according to the dependency relationship between the read instruction and the calculation instruction and the n addresses includes:
  • the n addresses and the SIMD instruction set it is determined whether the calculation instruction is SIMDizable.
  • the above method can further determine whether the calculation instruction can be SIMDable according to the SIMD instruction set on the basis of determining whether the calculation instruction can be SIMDized according to the dependency relationship between the read instruction and the calculation instruction, and n addresses, and can further Improve the accuracy of determining whether a computation instruction is SIMDizable.
  • the method further includes:
  • the above method can generate prompt information when the number of times the acquired calculation instruction is determined to be SIMDizable reaches a preset threshold, and the prompt information can be presented or sent to the user to prompt the user, so that the user can be prompted as soon as possible. Knowing that there are SIMDizable code segments in the application, the user can refer to the prompt information to optimize the SIMDizable code segments in the application.
  • the determining whether the calculation instruction can be SIMDable according to the dependency relationship between the read instruction and the calculation instruction and the n addresses includes:
  • step size between every two adjacent addresses is equal and not 0, and the dependency between the read instruction and the calculation instruction, it is determined whether the calculation instruction is SIMDizable.
  • the determining whether the calculation instruction can be SIMDable according to the dependency between the read instruction and the calculation instruction includes:
  • Whether the computing instruction is SIMDizable is determined according to whether a dependency cycle exists in the dependency graph.
  • a program data level parallel analysis device comprising:
  • an acquisition module used to acquire a read instruction executed by the processor, where the read instruction is used to acquire parameters required by the calculation instruction;
  • a determination module configured to determine whether the calculation instruction can be SIMDized according to the dependency relationship between the read instruction and the calculation instruction.
  • the obtaining module is further configured to obtain n addresses corresponding to consecutive n read operations performed by the read instruction, wherein n is a natural number greater than 2;
  • the determining module is specifically used for:
  • the determining module is specifically used for:
  • the n addresses and the SIMD instruction set it is determined whether the calculation instruction is SIMDizable.
  • the apparatus further includes: a prompting module;
  • the obtaining module is further configured to obtain the number of times that the calculation instruction is determined to be SIMDizable when the read instruction executed by the processor is obtained multiple times;
  • the prompt module is configured to generate prompt information when the number of times reaches a preset threshold.
  • the determining module is specifically used for:
  • step size between every two adjacent addresses is equal and not 0, and the dependency between the read instruction and the calculation instruction, it is determined whether the calculation instruction is SIMDizable.
  • the determining module is specifically used for:
  • Whether the computing instruction is SIMDizable is determined according to whether a dependency cycle exists in the dependency graph.
  • a computer device in a third aspect, includes a processor and a memory; the memory is used for storing instructions, and the processor is used for executing the instructions, so as to implement the above-mentioned first aspect or the first aspect The method described in any specific implementation of .
  • a non-transitory computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed, the first aspect described above is performed. Or the method described in any specific implementation manner of the first aspect.
  • a computer program product comprising a computer program, when the computer program is read and executed by a computing device, causes the computing device to perform as described in the first aspect or any specific implementation manner of the first aspect. Describe the method.
  • FIG. 1 is a schematic structural diagram of a processor instruction execution pipeline involved in an embodiment of the present application
  • FIG. 2 is a schematic diagram of a dependency graph provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a dependency graph constructed according to a dependency relationship between a read instruction and a calculation instruction provided by an embodiment of the present application;
  • FIG. 4 is a schematic diagram of another dependency graph constructed according to a dependency relationship between a read instruction and a calculation instruction provided by an embodiment of the present application;
  • FIG. 5 is a schematic diagram of a user interface provided by a program data level parallel analysis device provided by an embodiment of the present application
  • FIG. 6 is a schematic flowchart of determining whether a calculation instruction satisfies condition 2 according to n addresses provided by an embodiment of the present application;
  • FIG. 7 is a schematic flowchart of a program data level parallel analysis method provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a program data level parallel analysis device provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • SIMD technology to optimize the performance of applications is the most commonly used method of application performance optimization.
  • the processor developer will integrate the SIMD instruction set in the processor when developing the processor.
  • components such as components that integrate the x86AVX instruction set, or components that integrate the advanced antibiotic machines (ARM) neon instruction set, etc.
  • the programmer or compiler needs to find out the SIMD-enabled (also called parallelizable) code in the application program, and then use the corresponding SIMD instruction in the SIMD instruction set to replace the application. SIMDizable code in the program to achieve data-level parallelism.
  • the prior art mainly scans the source static code of the application program through the compiler to find out the obvious SIMDizable code, but this method cannot determine some implicit SIMDizable code in the application program, and cannot achieve optimal performance. And the efficiency is low, or, using linux performance analysis tools (such as pref), Intel (intel) performance analysis tools (such as vtune) and other tools, determine the key code areas in the application that cause performance problems, and manually modify the key code by the developer. After the corresponding performance optimization of the code in the area, the performance analyst will read the code in the key code area and analyze it to determine the SIMD code.
  • this method requires a lot of manpower and labor when the scale of the application is large. time and inefficiency.
  • the embodiments of the present application provide a program data level parallel analysis method, apparatus, and related equipment.
  • Instructions are divided according to their functions and are mainly divided into control instructions, memory access instructions and calculation instructions.
  • Control instructions generally refer to transfer instructions, which refer to instructions that are not executed according to the statement flow of the program.
  • the branch transfer of the program can be realized through the transfer instruction.
  • Memory access instructions include read (load) instructions and write (store) instructions, and memory access instructions can directly access memory to complete data transfer between memory and data registers. Specifically, the read instruction is used to load the data in the memory into the data register, and the write instruction is used to write the data in the data register into the memory.
  • Computational instructions mainly include arithmetic instructions (including addition, subtraction, multiplication, division, square root, maximum value, minimum value, approximate reciprocal, inverse of square root, etc.), logical instructions, move instructions, shift instructions and bit extension instructions, etc.
  • arithmetic instructions including addition, subtraction, multiplication, division, square root, maximum value, minimum value, approximate reciprocal, inverse of square root, etc.
  • logical instructions move instructions, shift instructions and bit extension instructions, etc.
  • the calculation instruction takes out the parameters from the data register for calculation. After the calculation is completed, the calculation instruction writes the calculation result back to the data register. To write back to the memory, you need to design a write instruction for the calculation instruction that can write the calculation result back to the memory, and the write instruction writes the calculation result in the data register back to the memory. It can be seen that the execution of the calculation instruction needs to rely on the read instruction to obtain the parameters required by the calculation instruction.
  • the processor instruction execution pipeline is a method in which the operation of an instruction is divided into a plurality of small steps in order to improve the efficiency of the processor executing instructions, and each step is completed by a special circuit module.
  • the processor may be a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), or a digital signal processor (digital signal processor, DSP) or the like.
  • the processor instruction execution pipeline generally includes: an instruction fetch module 101, a decoding module 102, a waiting queue 103, an execution module 104, a reordering cache module 105, and a write-back module 106, wherein,
  • the instruction fetch module 101 is used to fetch the instruction to be executed from the memory and put it into the instruction register. Specifically, the instruction fetch module first obtains the program counter (program counter, PC) value (that is, the address of the instruction to be executed), and then the instruction fetch module 101 Find the to-be-executed instruction in the memory according to the PC value, and fetch the to-be-executed instruction from the memory into the instruction register.
  • program counter program counter
  • the decoding module 102 is used for fetching the instruction to be executed from the instruction register, and decoding the instruction to be executed to obtain an operation code and an address code corresponding to the instruction to be executed.
  • the opcode indicates the nature of the operation to be performed by the instruction to be executed, that is, what operation to perform, or what to do, such as a read operation, a write operation, an addition calculation, a subtraction calculation, a multiplication calculation, and the like.
  • the address code indicates the address of the parameter required by the instruction to be executed (referring to the position of the parameter in the data register). For example, when performing addition calculation, the addition instruction can find the required parameter in the data register according to the address code.
  • a computer When a computer executes a specified instruction, it must first analyze the operation code and address code of the instruction, determine the nature and method of the operation according to the operation code, and find the parameters according to the address code, and then control the other components of the computer to cooperate to complete the instruction. For the function of expression, the analysis work is completed by the decoding module 102 .
  • the decoding module 102 After the decoding module 102 completes the decoding of the instruction to be executed to obtain the operation code and the address code, the instruction to be executed enters the waiting queue 103 to wait for the execution resource.
  • the waiting queue will save the sequence and/or dependency relationship between multiple instructions to be executed, so that the waiting queue can reasonably allocate resources.
  • the execution module 104 is responsible for completing various operations specified by the to-be-executed instruction, and realizes the function of the to-be-executed instruction. For example, if the multiple instructions to be executed are read instructions, calculation instructions 1 and calculation instructions 2, the function to be implemented by the read instruction is to fetch x0, y0 and z0 from the memory, and the function to be implemented by the calculation instruction 1 is to combine x0 with y0 is added to obtain x0+y0, and the function to be implemented by calculation instruction 2 is to subtract x0+y0 from z0 to obtain x0+y0-z0.
  • the waiting queue 103 can successively allocate resources to read instructions, calculation instructions 1 and calculation instructions 2.
  • the execution module 104 first executes the read instruction, fetches x0, y0 and z0 from the memory and puts them into the data register, then executes the calculation instruction 1 to add x0 and y0 to obtain x0+y0, and finally executes the calculation instruction 2 to add x0+ Subtract z0 from y0 to get x0+y0-z0.
  • the execution module 104 may implement the function of the instruction to be executed through an arithmetic logic unit (arithmetic and logic unit, ALU).
  • the reordering cache module 105 is used for reordering the execution results obtained by the execution module 104 executing different instructions when the execution module 104 fails to execute the instructions in the order among the multiple to-be-executed instructions.
  • the write-back module 106 is configured to write the execution result obtained by executing the instruction by the execution module 104 or the execution result after reordering by the reordering cache module back to the data register.
  • the instruction fetch module 101 can then fetch the address of the next to-be-executed instruction from the PC, and start a new round of looping .
  • processor instruction execution pipeline shown in FIG. 1 is only illustrative. According to implementation requirements, the processor instruction execution pipeline may include other modules or more modules, and FIG. 1 should not be regarded as a specific limitation.
  • the parameters required by the instruction to perform the current operation are not the operation results of the historical operation performed by the instruction.
  • the instruction is an addition instruction
  • the parameter of the instruction to perform the third addition operation is the instruction to perform the first operation.
  • the operation result of the second addition operation or the operation result of the second addition operation indicates that the instruction has dependencies between different loop iterations, otherwise, it indicates that there is no dependency.
  • SIMD instruction uses an execution unit, and performs the same operation on each data in a set of data at the same time, so as to realize the technology of data-level parallel computing, if the instruction has dependencies between different loop iterations, it cannot be used.
  • the SIMD instruction SIMDizes the instruction. Therefore, to determine whether an instruction is SIMDizable, it is necessary to determine whether the instruction has no dependencies between different loop iterations, that is, to determine whether the instruction satisfies condition 1.
  • Condition 2 In the loop operation executed by the instruction, the step size between addresses accessed by two adjacent operations is equal and not 0.
  • the address accessed by the instruction execution operation can be understood as the location in the memory of the parameters required by the instruction execution operation, and the step size can be understood as the distance between the address that the instruction needs to access to perform the current operation and the address accessed by the instruction last execution operation.
  • the length can also be understood as the length of the parameters required by the instruction to perform the current operation, and can also be understood as the bytes occupied by the parameters in the memory.
  • the SIMD instruction is a single instruction, when the programmer or compiler uses the corresponding SIMD instruction in the SIMD instruction set to replace the SIMDizable code in the application program, it only gives the SIMD instruction a base address (that is, the SIMD instruction obtains the first data address) and the length of a set of data to be operated on, where each data in this set of data has the same length and is not 0, that is, the SIMD instruction does not know the addresses of other data in this set of data, so , the position of this set of data in the memory needs to be continuous, so that when the SIMD instruction is running, this set of data can be synchronously fetched from the memory for calculation. If the position of this set of data in the memory is not continuous, the SIMD instruction is running. When , only the first data can be obtained, and it is impossible to realize parallel computing without knowing where to go to obtain other data in this set of data.
  • Condition 3 The instruction has a corresponding SIMD instruction in the SIMD instruction set.
  • the calculation instructions have corresponding SIMD instructions in the SIMD instruction set, such as addition, subtraction, multiplication and other calculation instructions.
  • the corresponding SIMD instructions in the SIMD instruction set can be used to SIMDize the above calculation instructions. Control instructions and a few computing instructions do not have corresponding SIMD instructions in the SIMD instruction set, so these instructions cannot be SIMDized using the SIMD instructions in the SIMD instruction set.
  • the instructions that can be optimized by SIMD technology need to be satisfied, in the application program, only most computing instructions have corresponding SIMD instructions in the SIMD instruction set, that is, only these instructions may be SIMDized. Therefore, The method, apparatus and related equipment provided by the embodiments of the present application can only analyze the calculation instructions included in the application program when the application program is running, and determine whether the calculation instruction is It can be SIMDized without analyzing the instructions (such as control instructions) included in the application program that cannot be SIMDized. Filtering out this part of the instructions that cannot be SIMDized can improve the analysis efficiency and reduce the impact of the analysis process on the application. Program performance impact and reduced analysis overhead.
  • the calculation instruction dependent on the read instruction determines whether the calculation instruction satisfies the condition 1, and if the calculation instruction does not satisfy the condition 1 , then it is determined that the calculation instruction cannot be SIMDized. If the calculation instruction satisfies the condition 1, it can be further judged whether the calculation instruction satisfies the condition 2 and the condition 3, so as to determine whether the calculation instruction can be SIMDized.
  • the following describes the process of acquiring the read instruction executed by the processor, and the process of determining whether the calculation instruction satisfies Condition 1 according to the dependency between the read instruction and the calculation instruction dependent on the read instruction.
  • the processor when an application is running, the processor usually executes a large number of instructions, and the large number of instructions usually includes multiple read instructions, each of which corresponds to a unique PC value. Therefore, multiple read instructions can be distinguished according to the PC value. .
  • the program data level parallel analysis apparatus acquires the read instruction executed by the processor, which may be any one or more read instructions among the above-mentioned multiple read instructions, which is not specifically limited here.
  • the subsequent operation process performed by the program data level parallel analysis device for each read instruction is similar, which is convenient for simple description. In the following embodiments, description is made by taking one read instruction obtained as an example.
  • the program data level parallel analysis apparatus may acquire a read instruction executed by the processor in a sampling manner when the processor executes multiple read instructions, and the sampling period of the apparatus may be set to a preset duration Or the number of instructions executed by the processor, which is not specifically limited here.
  • the program data level parallel analysis device acquires a read instruction executed by the processor in a sampling manner, if the multiple read instructions executed by the processor in one sampling period include read instructions that can be cyclically iterated, then The read command with the fastest loop iteration frequency (that is, the maximum loop iteration number in the current sampling period) has the greatest probability of being acquired.
  • the dependency between the read instruction and the calculation instruction dependent on the read instruction may be obtained in the waiting queue 103, and a dependency graph is constructed according to the dependency between the read instruction and the calculation instruction dependent on the read instruction, Then, according to whether there is a dependency cycle in the dependency graph, it is determined whether the calculation instruction satisfies condition 1. Specifically, in the case where there is no dependency cycle in the dependency graph, it is determined that the calculation instruction satisfies the condition 1, and in the case where there is a dependency cycle in the dependency graph, it is determined that the calculation instruction does not meet the condition 1. It can be understood that it is determined that the calculation instruction does not meet the condition 1. Condition 1, that is, it is determined that the calculation instruction cannot be SIMDized.
  • the dependency graph is a graph used to reflect the dependency relationship between instructions.
  • the dependency graph is used to reflect the dependency relationship between a read instruction and a calculation instruction that depends on the read instruction.
  • the dependency loop if an instruction is An instruction that can be looped and iterated. In each loop, the execution of the current loop depends on the execution result of the previous loop, and different loop iterations of this instruction will form a dependency cycle.
  • the dependency relationship between instruction A in m loop iterations is: when instruction A executes the first operation, the first operation is obtained. After the result, the instruction A uses the first result as the parameter required for the second operation to perform the second operation. After the instruction A obtains the second result, the instruction A uses the second result as the parameter required for the third operation.
  • the third operation is performed, after instruction A obtains the third result, and so on, until m loop iterations are completed, as shown in Figure 2, which shows the dependency between m loop iterations according to the above instruction A
  • the built dependency graph as can be seen from Figure 2, the dependency graph includes the instruction A that executes the first operation, the instruction A that executes the second operation, ..., the instruction A that executes the mth operation, which are connected in a ring.
  • a cycle is a dependency cycle.
  • the read instruction can be iterated cyclically
  • the calculation instruction dependent on the read instruction can also be iterated cyclically
  • the number of iterations of the calculation instruction loop is the same as the number of iterations of the read instruction loop.
  • the read instruction usually iterates at least 2 times, and as many as one thousand times, ten thousand times or even more. If the dependency graph is constructed according to the dependency between the read instruction and the computation instruction dependent on the read instruction in all loop iterations, it will take more time to construct the dependency graph, and the construction process will also take up a great deal of overhead.
  • an instruction is an instruction that can be iterated in a loop
  • the dependency graph constructed according to the dependency between some loop iterations of the instruction can reflect the dependency between all loop iterations. Therefore, in the specific embodiment of the present application, when the read instruction can be iterated in a loop, a dependency graph can be constructed according to the dependency between the read instruction and the calculation instruction dependent on the read instruction in some loop iterations, and then the calculation instruction can be determined according to the dependency graph Whether condition 1 is satisfied can reduce the time and overhead of the process of constructing the dependency graph.
  • the program data level parallel analysis apparatus may acquire the dependency relationship between the read instructions included in the waiting queue 103 and the calculation instructions dependent on the read instructions in some iterations in one sampling period.
  • the read instruction B obtained by the program data level parallel analysis device is cyclically iterated 3 times in total
  • the calculation instruction C depends on the read instruction B
  • the read instruction B and the calculation instruction C are iterated in 3 loops.
  • the dependencies between are:
  • B1 executes the first read operation to obtain the first parameter, passes the first parameter to C1, C1 executes the first calculation operation to obtain the first calculation result, writes the first calculation result to the data register, and B2 executes the second read operation Get the second parameter, pass the second parameter to C2, C2 executes the second calculation operation to obtain the second calculation result, writes the second calculation result to the data register, B3 executes the third read operation to obtain the third parameter,
  • the three parameters are passed to C3, C3 performs the third calculation operation to obtain the third calculation result, and writes the third calculation result to the data register, wherein the second parameter is not the first calculation result, and the third parameter is not the first operation result or the third calculation result.
  • B1, B2, and B3 correspond to the read instruction B that executes the first read operation
  • the read instruction B that executes the second read operation and the read instruction B that executes the third read operation
  • C1, C2, and C3 correspond to It represents the calculation instruction C for executing the first calculation operation
  • the calculation instruction C for executing the second calculation operation and the calculation instruction C for executing the third calculation operation.
  • FIG. 3 is a dependency graph constructed according to the dependency between the read instruction B and the calculation instruction C in three loop iterations in the above example.
  • the dependency graph includes B1 and C1, B2 and C2 connected in series into a line, and B3 and C3 connected in series into a line, it can be seen that there is no dependency cycle in the dependency graph. In this case, it can be determined that the calculation instruction satisfies condition 1.
  • B1 executes the first read operation to obtain the first parameter, passes the first parameter to C1, C1 executes the first calculation operation to obtain the first calculation result, passes the first calculation result to B2, and B2 uses the first calculation result as the second
  • the parameter is passed to C2, C2 performs the second calculation operation to obtain the second calculation result, passes the second calculation result to B3, B3 passes the second calculation result as the third parameter to C3, and C3 performs the third calculation operation to obtain the first calculation result.
  • FIG. 4 is a dependency graph constructed according to the dependency between the read instruction B and the calculation instruction C in 3 loop iterations in the above example.
  • the dependency graph includes B1, C1, B2, C2, B3, and C3 have dependency cycles in the dependency graph. In this case, it can be determined that the calculation instruction does not satisfy condition 1, that is, it is determined that the calculation instruction cannot be SIMDized.
  • n addresses corresponding to n consecutive read operations performed by the read instruction can be obtained, and then it is determined according to the n addresses whether the calculation instruction dependent on the read instruction satisfies condition 2, if If the calculation instruction does not satisfy condition 2, it is determined that the calculation instruction cannot be SIMDized. If the calculation instruction satisfies condition 2, it can be further determined whether the calculation instruction satisfies condition 1 and condition 3, so as to determine whether the calculation instruction can be SIMDized.
  • n is a natural number greater than 2.
  • the following describes the process of acquiring n addresses corresponding to n consecutive read operations performed by the read instruction, and the process of determining whether the calculation instruction satisfies Condition 2 according to the n addresses.
  • the value of n can be preset by the user or set before each acquisition of n addresses corresponding to n consecutive read operations performed by a read instruction.
  • the program data level parallel analysis device can provide Figure 5 The interface shown is for the user. In the interface shown in FIG. 5, n is 3 by default, and the user can input a specific number through an input device such as a keyboard or a touch screen to preset n, and the embodiment of the present application does not take n.
  • the value is specifically limited.
  • the process includes but is not limited to the following steps:
  • n the higher the accuracy of determining whether the calculation instruction satisfies the condition 2 according to the n addresses, but the greater the cost of analysis, the smaller the value of n, and the calculation based on the n addresses.
  • the accuracy of whether the instruction satisfies condition 2 is lower, but the analysis overhead is also smaller.
  • n is any value such as 3, 5 or 8, the method of determining whether the calculation instruction satisfies condition 2 according to the n addresses is similar.
  • n is any value such as 3, 5 or 8
  • the calculation results of the n addresses The process of step size between every two adjacent addresses is described in detail, and the process may include the following steps:
  • the first address is the address corresponding to the first read operation among the three consecutive read operations performed by the read instruction
  • the second address is the address corresponding to the second read operation among the three consecutive read operations performed by the read instruction. address.
  • the first length can be calculated according to the second address and the first address. After the address, record the first address to the pre-configured collection table, and then when the second address is obtained, find the first address from the collection table, and then connect the second address to the first address. The difference is determined as the first step length.
  • S1012 Calculate and obtain the second step size according to the third address, the first address and the first step size among the three addresses.
  • the second step size in order to facilitate the program data level parallel analysis device to obtain the third address, can be calculated according to the third address, the first address and the first step size, and can be After obtaining the first step length, record the first step length to the collection table, and then when the third address is obtained, find the first address and the first step length from the collection table, and then compare the third address with the The difference between the first address and the sum of the first step size is determined as the second step size.
  • the program data level parallel analysis device can also record the second address in the acquisition table after acquiring the second address, and then look up from the acquisition table when acquiring the third address. the second address, and then the difference between the third address and the second address is determined as the second step size.
  • the program data level parallel analysis device may acquire different read instructions for analysis, in order to facilitate the analysis of different read instructions acquired in different sampling periods, and different The first address corresponding to the read command is distinguished from the first step length.
  • the program data-level parallel analysis device can, in each sampling cycle, assign the corresponding read command to the read command in the acquisition table after acquiring the read command executed by the processor. Table entry, and then record the address of the read command to the table entry corresponding to the read command. After obtaining the first address and the first step length corresponding to the read command, record the first address and the first step length to the corresponding table entry of the read command. table entry.
  • Table 1 is an exemplary collection table provided by this embodiment of the application. As shown in Table 1, the collection table includes entries such as entry1, entry2, and entry3. Each table entry can include the address column, base address column and step column of the read command, etc.
  • the address column of the read command is used to record the address of the read command
  • the base address column is used to record the first address
  • the step column is used to record the address of the read command.
  • Table 2 provides an exemplary entry1 entry in the embodiment of the present application. It is assumed that the entry allocated by the program data level parallel analysis device for the read instruction is the entry1 entry.
  • the address of the instruction is address A'
  • the first address is B'
  • the first step length is C'
  • the program data level parallel analysis device can record A', B', C in the entry1 entry shown in Table 2 '.
  • Table 1 and Table 2 are only used as examples, and should not be regarded as specific limitations on the collection table and table items.
  • S102 Determine whether the step size between every two adjacent addresses is equal and not 0, and if the step size between every two adjacent addresses is equal and not 0, perform S103, and in every two adjacent addresses, perform S103. If the step size between the adjacent addresses is not equal and/or is 0, execute S104.
  • the program data level parallel analysis device can determine whether the second step size is equal to the first step size, and determine whether the first step size and the second step size are not equal to each other. 0, if the second step size is equal to the first step size and neither the first step size nor the second step size is 0, execute S103, if the second step size is not equal to the first step size and/or the first step size is not equal to If the step size and/or the second step size is 0, execute S104.
  • the program data level parallel analysis device can collect data from the The entry corresponding to the read command is deleted from the table to reduce resource occupation and save storage space.
  • the program data level parallel analysis apparatus can determine whether the calculation instruction dependent on the read instruction satisfies condition 3 according to the SIMD instruction set.
  • the program data level parallel analysis apparatus may include a preconfigured SIMD instruction set, the instruction set may be stored in the program data level parallel analysis apparatus in the form of instructions, and pre-established SIMD instruction set The correspondence between each SIMD instruction and its corresponding calculation instruction.
  • the program data level parallel analysis device can check whether there is a SIMD instruction in the SIMD instruction set that has a corresponding relationship with the calculation instruction. If there is a corresponding SIMD instruction If there is no SIMD instruction with a corresponding relationship, it is determined that the calculation instruction does not meet the condition 3. It can be understood that if it is determined that the calculation instruction does not meet the condition 3, it is determined that the calculation instruction cannot be SIMD. change.
  • the SIMD instruction set may be stored in the form of a SIMD instruction table, and the SIMD instruction table may include keywords corresponding to each SIMD instruction in the SIMD instruction set.
  • the SIMD instruction is capable of adding If the instruction is to be SIMDized, the keyword corresponding to the SIMD instruction included in the SIMD instruction table may be "add" and/or "+”, etc.
  • the program data level parallel analysis device can extract The addition instruction includes keywords "add” and/or "+”, etc., when it is found that there are keywords "add” and/or "+” in the SIMD instruction table, it is determined that the calculation instruction satisfies condition 3, otherwise, it is determined to calculate The instruction does not satisfy condition 3.
  • the program data level parallel analysis device can determine whether the calculation instruction satisfies the conditions in the same sampling period.
  • Condition 2 and Condition 3 This embodiment of the present application does not limit the order in which the device determines whether the calculation instruction satisfies Condition 1, Condition 2, and Condition 3.
  • the program data level parallel analysis device determines that a calculation instruction satisfies Condition 1, Condition 2 and Condition 3 simultaneously in one sampling period, the device can determine that the calculation instruction can be SIMDized.
  • FIG. 7 is a schematic flowchart of a program data level parallel analysis method provided by an embodiment of the present application. The method can be applied to processors that support SIMD technology such as CPU, GPU, and DSP, and is not specifically limited here.
  • SIMD technology such as CPU, GPU, and DSP
  • the method includes but is not limited to the following steps:
  • n addresses corresponding to n consecutive read operations performed by the read instruction may also be obtained.
  • n addresses it is determined whether the calculation instruction can be SIMDized according to the dependency between the read instruction and the calculation instruction and the n addresses, where n is a natural number greater than 2.
  • the program data level parallel analysis device may include a pre-configured SIMD instruction set, and determine whether the calculation instruction can be SIMDized according to the dependency relationship between the read instruction and the calculation instruction and n addresses At the same time, according to the SIMD instruction set, it is further determined whether the calculation instruction can be SIMDized.
  • the specific process of determining whether the calculation instruction can be SIMDable includes:
  • the specific process of determining whether the calculation instruction can be SIMDable includes:
  • the embodiment shown in FIG. 7 can be used as a sampling analysis process performed by the program data level parallel analysis device when the application program is running, and the program data level parallel analysis device determines a calculation instruction for the first time
  • the number of times the calculation instruction is determined to be SIMDizable can be counted as 1. If the device also determines that the calculation instruction can be SIMDizable in the subsequent sampling period, the calculation instruction can be determined to be SIMDizable. The times of SIMDization are accumulated.
  • the device determines that the calculation instruction cannot be SIMDized, the number of times the calculation instruction is determined to be SIMDable can be decremented by 1 in turn, or the calculation instruction can be marked as not SIMDable. instruction, the calculation instruction will not be analyzed in subsequent sampling cycles.
  • a prompt message may be generated, prompting
  • the information can be presented or sent to the user to prompt the user, so that the user can know as soon as possible that there is a SIMDable code segment in the application, and the user can refer to the prompt information to optimize the SIMDizable code segment in the application.
  • the prompt information generated by the program data-level parallel analysis device may be to highlight the address of the calculation instruction on the interface and the number of times the calculation instruction can be determined to be SIMDable, or it may be to send a message including the calculation instruction to the user.
  • the address, the number of times the calculation instruction is determined to be SIMDizable, and other information such as text messages or emails are not specifically limited here; the sampling period can be set to a preset duration or the number of instructions executed by the processor, which is not specifically limited here.
  • this embodiment does not include the specific process of acquiring the read instruction executed by the processor, the specific process of determining whether the calculation instruction can be SIMDized according to the dependency between the read instruction and the calculation instruction that depends on the read instruction, and the specific process of determining whether the calculation instruction can be SIMDized according to n
  • the specific process of determining whether the calculation instruction can be SIMDable by the address, and the specific process of determining whether the calculation instruction can be SIMDable according to the SIMD instruction set are described in detail. For details, please refer to the above and related descriptions, and will not be repeated here.
  • the program data-level parallel analysis method provided by the embodiment of the present application can obtain the read instruction executed by the processor when the application program is running, and then determine according to the dependency between the read instruction and the calculation instruction that depends on the read instruction. Whether the calculation instruction can be SIMDized can quickly determine the SIMDizable code in the application, improve the efficiency of finding the SIMDizable code, and save manpower and time.
  • the method provided by the embodiment of the present application can determine whether the calculation instruction can be SIMDized according to the dependency relationship between the read instruction and the calculation instruction, according to the corresponding n times of consecutive read operations performed by the read instruction. It is possible to further improve the accuracy of determining whether the calculation instruction can be SIMDable by further determining whether the calculation instruction can be SIMDable or not.
  • FIG. 7 can be used as a sampling analysis process performed by the program data level parallel analysis device when the application program is running. Performing analysis to determine whether a computing instruction can be SIMDized, without analyzing the instructions included in the application that cannot be SIMDized, can reduce the impact of the analysis process on the performance of the application and reduce analysis overhead.
  • a program data-level parallel method is described in detail above. Based on the same inventive concept, the program data-level parallel analysis device of the embodiment of the present application is continued to be provided below.
  • the program data-level parallel analysis device provided by the present application can be It is applied to processors that support SIMD technology, such as CPU, GPU, and DSP, and is not specifically limited here.
  • FIG. 8 is a schematic structural diagram of a program data-level parallel analysis device 100 provided by an embodiment of the present application.
  • the device 100 includes: an acquisition module 110, a determination module 120, and a prompt module 130, wherein,
  • the obtaining module 110 is used to obtain the read instruction executed by the processor, and the read instruction is used to obtain the parameters required by the calculation instruction;
  • the determination module 120 is configured to determine whether the calculation instruction can be SIMD-implemented according to the dependency between the read instruction and the calculation instruction.
  • the obtaining module 110 is further configured to obtain the corresponding n consecutive read operations performed by the read instruction. n addresses, where n is a natural number greater than 2;
  • the determining module 120 is specifically used for:
  • the determining module 120 is specifically configured to:
  • n addresses and the SIMD instruction set, it is determined whether the computation instruction can be SIMDized.
  • the program data level parallel analysis apparatus 100 further includes: a prompt module 130;
  • the acquiring module 110 is further configured to acquire the number of times that the calculation instruction is determined to be SIMDizable when the read instruction executed by the processor is acquired multiple times;
  • the prompt module 130 is configured to generate prompt information when the number of times reaches a preset threshold.
  • the determining module 120 is specifically configured to:
  • the determining module 120 is specifically configured to:
  • program data level parallel analysis apparatus 100 is only an example provided by the embodiment of the present application, and the program data level parallel analysis apparatus 100 may have more or less components than those shown in FIG. 8 , and may combine two one or more components, or may be implemented with different configurations of components.
  • FIG. 9 is a schematic structural diagram of a computer device 200 provided by an embodiment of the present application.
  • the computer device 200 includes a processor 210, a memory 220, and a communication interface 230. , wherein the processor 210 , the memory 220 and the communication interface 230 can be connected to each other through the bus 240 . in,
  • the processor 210 can read the program code stored in the memory 220, and cooperate with the communication interface 230 to execute some or all of the steps of the method executed by the program data level parallel analysis apparatus 100 in the above embodiments of the present application.
  • the processor 210 may have various specific implementation forms, for example, the processor 210 may be a CPU or a GPU, and the processor 910 may also be a single-core processor or a multi-core processor.
  • the processor 210 may be a combination of a CPU and a hardware chip.
  • the above-mentioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof.
  • the above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general-purpose array logic (generic array logic, GAL) or any combination thereof.
  • the processor 210 can also be independently implemented by a logic device with built-in processing logic, such as an FPGA or a DSP.
  • the memory 220 may store program codes and program data.
  • the program code includes: the code of the acquisition module 110, the code of the determination module 120, the code of the prompt module 130, etc.
  • the program data includes: the read instruction, the dependency between the read instruction and the calculation instruction, the consecutive n times of the read instruction execution The n addresses corresponding to the read operation, the first step length, and so on.
  • the memory 220 may be a non-volatile memory, such as a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (erasable). PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), or flash memory.
  • ROM read-only memory
  • PROM programmable read-only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory volatile memory, which may be random access memory (RAM), which acts as an external cache.
  • Communication interface 230 may be a wired interface (eg, an Ethernet interface) or a wireless interface (eg, a cellular network interface or using a wireless local area network interface) for communicating with other computing nodes or devices.
  • the communication interface 230 may use a protocol family above transmission control protocol/internet protocol (TCP/IP), for example, remote function call (RFC) protocol, the simple object access protocol (SOAP) protocol, the simple network management protocol (SNMP) protocol, the common object request broker architecture (CORBA) protocol, and the distributed protocol and many more.
  • TCP/IP transmission control protocol/internet protocol
  • RRC remote function call
  • SOAP simple object access protocol
  • SNMP simple network management protocol
  • CORBA common object request broker architecture
  • the bus 240 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus 240 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in FIG. 9, but it does not mean that there is only one bus or one type of bus.
  • the foregoing computer device 200 is configured to execute the methods performed in the foregoing method embodiments, and belongs to the same concept as the foregoing method embodiments, and the specific implementation process thereof is detailed in the foregoing method embodiments, which will not be repeated here.
  • the computer device 200 is only an example provided by the embodiments of the present application, and the computer device 200 may have more or less components than those shown in FIG. 9 , two or more components may be combined, or Different configurations of components are possible.
  • Embodiments of the present application also provide a non-transitory computer-readable storage medium, where instructions are stored in the non-transitory computer-readable storage medium, and when the non-transitory computer-readable storage medium is run on a processor, the program data level described in the foregoing embodiments can be implemented Part or all of the steps of the analysis method in parallel.
  • Embodiments of the present application further provide a computer program product, which, when the computer program product is read and executed by a computer, can implement some or all of the steps of the program data level parallel analysis method described in the above method embodiments.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media, or semiconductor media, and the like.
  • the steps in the method of the embodiment of the present application may be sequentially adjusted, combined or deleted according to actual needs; the units in the device of the embodiment of the present application may be divided, combined or deleted according to actual needs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

本申请提供一种程序数据级并行分析方法,该方法包括:首先获取处理器执行的读指令,然后根据读指令和计算指令之间的依赖关系,确定计算指令是否可SIMD化,其中,读指令用于获取计算指令所需的参数。该方法可以快速确定应用程序中的可SIMD化代码,提高了查找可SIMD化代码的效率,节省人力和时间。

Description

一种程序数据级并行分析方法、装置及相关设备
本申请要求于2021年1月30日提交中国专利局、申请号为202110131344.6、发明名称为“一种程序数据级并行分析方法、装置及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种程序数据级并行分析方法、装置及相关设备。
背景技术
单指令多数据(single instruction multiple data,SIMD)是采用一个执行单元,同时对一组数据(又称“数据向量”)中的每一个数据分别执行相同的运算操作,从而实现数据级并行计算的技术。目前,几乎所有的处理器都集成了包括SIMD指令集的部件来提高应用程序的并行能力,从而提升处理器执行性能。但是,若要在应用程序中实现SIMD优化,需要先找出应用程序中的可SIMD化代码,然后采用SIMD指令集中对应的SIMD指令进行优化。
但是,如何提高查找应用程序中的可SIMD化代码的效率,是本领域技术人员亟待解决的技术问题。
发明内容
本申请提供了一种程序数据级并行方法、装置及相关设备,可以提高查找应用程序中的可SIMD化代码的效率,节省人力和时间。
第一方面,提供了一种程序数据级并行分析方法,所述方法包括:
获取处理器执行的读指令,所述读指令用于获取计算指令所需的参数;
根据所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化,其中,所述计算指令依赖所述读指令。
可以看出,本申请实施例提供的程序数据级并行分析方法可以在应用程序运行时,通过获取处理器执行的读指令,然后根据读指令和依赖读指令获取所需参数的计算指令之间的依赖关系,确定计算指令是否可SIMD化。该方法可以快速确定应用程序中的可SIMD化代码,提高了查找可SIMD化代码的效率,节省人力和时间。
另外,上述方法可以作为程序数据级并行分析装置在应用程序运行时进行的一次采样分析过程,程序数据级并行分析装置可以在每一个采样周期中仅对计算指令进行分析,确定计算指令是否可SIMD化,无需对应用程序中包括的不可能SIMD化的指令进行分析,可以减少分析过程对应用程序性能的影响以及减少分析开销。
在一种可能的实现方式中,在根据所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化之前,所述方法还包括:
获取所述读指令执行的连续n次读操作对应的n个地址,其中,n为大于2的自然数;
所述根据所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化,包括:
根据所述读指令和所述计算指令之间的依赖关系和所述n个地址,确定所述计算指令是否可SIMD化。
可以看出,上述方法可以在根据读指令和计算指令之间的依赖关系,确定计算指令是否可SIMD化的基础上,获取读指令执行的连续n次读操作对应的n个地址,并根据n个地址进一步确定计算指令是否可SIMD化,可以提升确定计算指令是否可SIMD化的准确性。
在一种可能的实现方式中,所述根据所述读指令和所述计算指令之间的依赖关系和所述n个地址,确定所述计算指令是否可SIMD化,包括:
根据所述读指令和所述计算指令之间的依赖关系、所述n个地址和SIMD指令集,确定所述计算指令是否可SIMD化。
可以看出,上述方法可以在根据读指令和计算指令之间的依赖关系、n个地址,确定计算指令是否可SIMD化的基础上,根据SIMD指令集进一步确定计算指令是否可SIMD化,可以进一步提升确定计算指令是否可SIMD化的准确性。
在一种可能的实现方式中,所述方法还包括:
在多次获取所述处理器执行的所述读指令的情况下,获取所述计算指令被确定可SIMD化的次数;
在所述次数达到预设阈值的情况下,生成提示信息。
可以看出,上述方法可以在获取到的计算指令被确定可SIMD化的次数达到预设阈值的情况下,生成提示信息,该提示信息可以被呈现或者发送给用户对用户进行提示,便于用户尽快获知应用程序中有可SIMD化的代码段,用户可以参考该提示信息对应用程序中可SIMD化的代码段进行优化。
在一种可能的实现方式中,所述根据所述读指令和所述计算指令之间的依赖关系和所述n个地址,确定所述计算指令是否可SIMD化,包括:
计算得到所述n个地址中每两个相邻的地址之间的步长;
根据所述每两个相邻的地址之间的步长是否相等且不为0,以及所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化。
在一种可能的实现方式中,所述根据所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化,包括:
根据所述读指令和所述计算指令之间的依赖关系构建依赖图,其中,所述依赖图用于反映所述读指令和所述计算指令之间的依赖关系;
根据所述依赖图中是否存在依赖环,确定所述计算指令是否可SIMD化。
第二方面,提供了一种程序数据级并行分析装置,所述装置包括:
获取模块,用于获取处理器执行的读指令,所述读指令用于获取计算指令所需的参数;
确定模块,用于根据所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化。
在一种可能的实现方式中,在根据所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化之前,
所述获取模块,还用于获取所述读指令执行的连续n次读操作对应的n个地址,其中,n为大于2的自然数;
所述确定模块具体用于:
根据所述读指令和所述计算指令之间的依赖关系和所述n个地址,确定所述计算指令是否可SIMD化。
在一种可能的实现方式中,所述确定模块具体用于:
根据所述读指令和所述计算指令之间的依赖关系、所述n个地址和SIMD指令集,确定所述计算指令是否可SIMD化。
在一种可能的实现方式中,所述装置还包括:提示模块;
所述获取模块,还用于在多次获取所述处理器执行的所述读指令的情况下,获取所述计算指令被确定可SIMD化的次数;
所述提示模块,用于在所述次数达到预设阈值的情况下,生成提示信息。
在一种可能的实现方式中,所述确定模块具体用于:
计算得到所述n个地址中每两个相邻的地址之间的步长;
根据所述每两个相邻的地址之间的步长是否相等且不为0,以及所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化。
在一种可能的实现方式中,所述确定模块具体用于:
根据所述读指令和所述计算指令之间的依赖关系构建依赖图,其中,所述依赖图用于反映所述读指令和所述计算指令之间的依赖关系;
根据所述依赖图是否存在依赖环,确定所述计算指令是否可SIMD化。
第三方面,提供了一种计算机设备,所述计算机设备包括处理器和存储器;所述存储器用于存储指令,所述处理器用于执行所述指令,以实现如上述第一方面或者第一方面的任意具体实现方式中所描述方法。
第四方面,提供了一种非瞬态计算机可读存储介质,所述非瞬态计算机可读介质存储有计算机可读指令,当所述计算机可读指令被运行时,执行如上述第一方面或者第一方面的任意具体实现方式中所描述方法。
第五方面,一种计算机程序产品,包括计算机程序,当所述计算机程序被计算设备读取并执行时,使得所述计算设备执行如上述第一方面或者第一方面的任意具体实现方式中所描述方法。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例涉及的一种处理器指令执行流水线的结构示意图;
图2是本申请实施例提供的一种依赖图的示意图;
图3是本申请实施例提供的一种根据读指令和计算指令之间的依赖关系构建的依赖图的示意图;
图4是本申请实施例提供的另一种根据读指令和计算指令之间的依赖关系构建的依赖图的示意图;
图5是本申请实施例提供的一种程序数据级并行分析装置提供的用户界面的示意图;
图6是本申请实施例提供的一种根据n个地址确定计算指令是否满足条件2的流程示意图;
图7是本申请实施例提供的一种程序数据级并行分析方法的流程示意图;
图8是本申请实施例提供的一种程序数据级并行分析装置的结构示意图;
图9是本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
下面先对本申请实施例涉及的应用场景进行简要说明。
随着软件技术的飞速发展,应用程序的规模变得越来越庞大,例如linux内核代码量已达到2700万行,开源数据库mysql代码量也已达到百万行的级别,应用程序代码量的大量增加给程序性能优化带来了新的挑战。
利用SIMD技术对应用程序进行性能优化是目前最常用的应用性能优化方法,目前,为了在处理器中实现SIMD优化功能,处理器开发人员在开发处理器时会在处理器中集成包括SIMD指令集的部件,如集成x86AVX指令集的部件,或者集成高级精简指令集机器(advanced risc machines,ARM)neon指令集的部件等。但是,在利用SIMD技术实现SIMD优化功能之前,需要程序员或者编译器先找出应用程序中的可SIMD化(也可以称为可并行化)代码,然后采用SIMD指令集中相应的SIMD指令替代应用程序中的可SIMD化代码,实现数据级并行功能。
现有技术主要通过编译器对应用程序的源静态代码进行扫描找出比较明显的可SIMD化代码,但该方法无法确定应用程序中某些隐含的可SIMD化代码,无法实现性能最优,且效率较低,或者,利用linux性能分析工具(如pref)、英特尔(intel)性能分析工具(如vtune)等工具,确定应用程序中引起性能问题的关键代码区域,由开发人员手动修改关键代码区域的代码进行相应的性能优化后,再由性能分析人员阅读关键代码区域的代码进行分析,确定可SIMD化代码,但该方法存在着在应用程序的规模较大时,需要耗费大量的人力与时间、效率低的问题。
针对上述问题,本申请实施例提供了一种程序数据级并行分析方法、装置及相关设备。
为了便于理解本申请实施例提供的程序数据级并行分析方法、装置及相关设备,下面先对本申请实施例涉及的访存指令、计算指令、处理器指令执行流水线,以及能够采用SIMD技术优化的指令需要满足的条件等概念进行解释说明。
指令按照功能进行划分,主要分为控制指令、访存指令和计算指令几种类型,其中,
控制指令一般指转移指令,是指不按程序的语句流程执行的指令,通过转移指令可以实现程序的分支转移。
访存指令包括读(load)指令和写(store)指令,访存指令可以直接访问内存,用来完成内存和数据寄存器之间的数据传输。具体地,读指令用于把内存中的数据加载到数据寄存器,写指令用于把数据寄存器中的数据写入内存。
计算指令主要包括算术指令(包括加法、减法、乘法、除法、开方、最大值、最小值、近似求倒数、求开方的倒数等等)、逻辑指令、移动指令、移位指令和位扩展指令等等。处理器在执行计算指令时,计算指令可以直接访问数据寄存器,但不能直接访问内存。但是在应用程序中,计算指令所需的参数(也可以称为计算参数、操作数或者操作对象等)通常都被存放在内存,因此需要为计算指令设计可以去内存获取参数的读指令,由读指令访问内存把计算指令所需的参数从内存加载到数据寄存器,计算指令再从数据寄存器中取出参数进行计算,在计算完成后,计算指令将计算结果写回数据寄存器,若需要将计算结果写回内存,则需要为计算指令设计可以将计算结果写回到内存的写指令,由写指令将数据寄存器中的计算 结果写回内存。可以看出,计算指令的执行需要依赖读指令获取计算指令所需的参数。
处理器指令执行流水线,是为提高处理器执行指令的效率,把一条指令的操作分成多个细小的步骤,每个步骤由专门的电路模块完成的方式。其中,处理器可以为中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或者数字信号处理器(digital signal processor,DSP)等。
如图1所示,处理器指令执行流水线通常包括:取指模块101、译码模块102、等待队列103、执行模块104、重排序缓存模块105和写回模块106,其中,
取指模块101,用于从内存中取出待执行指令放到指令寄存器中,具体地,取指模块先获取程序计数器(program counter,PC)值(即待执行指令的地址),然后取指模块101根据PC值在内存中找到待执行指令,将待执行指令由内存取到指令寄存器中。
译码模块102,用于从指令寄存器中取出待执行指令,并对待执行指令进行译码,得到待执行指令对应的操作码和地址码。操作码表示待执行指令要执行的操作性质,即执行什么操作,或做什么,如进行读操作、写操作、加法计算、减法计算、乘法计算等。地址码指示待执行指令所需的参数的地址(指参数在数据寄存器中的位置),如进行加法计算时,加法指令可以根据地址码在数据寄存器中找到所需的参数。计算机执行一条指定的指令时,必须首先分析这条指令的操作码和地址码是什么,根据操作码决定操作的性质和方法,以及根据地址码找到参数,然后才能控制计算机其他各部件协同完成指令表达的功能,这个分析工作由译码模块102来完成。
在译码模块102完成对待执行指令的译码得到操作码和地址码后,待执行指令进入等待队列103,等待执行资源。
在多个待执行指令之间具有先后顺序和/或依赖关系(如后一个待执行指令所需的参数为前一个待执行指令的计算结果)的情况下,在多个待执行指令进入等待队列103后,等待队列会保存多个待执行指令之间的先后顺序和/或依赖关系,便于等待队列对资源进行合理分配。
执行模块104,负责完成待执行指令所规定的各项操作,实现待执行指令的功能。例如,若多个待执行指令为读指令、计算指令1和计算指令2,其中,读指令要实现的功能是从内存中取出x0、y0和z0,计算指令1要实现的功能是将x0与y0相加得到x0+y0,计算指令2要实现的功能是将x0+y0减去z0得到x0+y0-z0,等待队列103可以先后将资源分配给读指令、计算指令1和计算指令2,也就是说执行模块104先执行读指令,从内存中取出x0、y0和z0放到数据寄存器,然后再执行计算指令1将x0与y0相加得到x0+y0,最后执行计算指令2将x0+y0减去z0得到x0+y0-z0。具体地,执行模块104可以通过算术逻辑单元(arithmetic and logic unit,ALU)实现待执行指令的功能。
重排序缓存模块105,用于在执行模块104未能按照多个待执行指令之间的先后顺序执行指令的情况下,对执行模块104执行不同的指令得到的执行结果进行重排序。
写回模块106,用于将执行模块104执行指令得到的执行结果或者将重排序缓存模块重排序后的执行结果写回数据寄存器。
在上述待执行指令执行完毕、执行结果写回之后,若无意外事件(如结果溢出等)发生,取指模块101可以接着从PC中取出下一条待执行指令的地址,开始新一轮的循环。
应理解,图1所示的处理器指令执行流水线仅仅是示意性的,根据实现需要,处理器指令执行流水线可以包括其他模块或者更多模块,图1不应视为具体限定。
能够采用SIMD技术优化的指令需要同时满足以下三个条件:
条件1、指令在不同的循环迭代之间没有依赖。
可以理解为,指令在循环迭代时,指令执行当前操作所需的参数不是指令执行历史操作的操作结果,例如,假设指令为加法指令,若指令进行第三次加法操作的参数为指令进行第一次加法操作的操作结果或者第二次加法操作的操作结果,则表明指令在不同的循环迭代之间有依赖,反之,则表示没有依赖。
由于SIMD指令是采用一个执行单元,同时对一组数据中的每一个数据分别执行相同的运算操作,从而实现数据级并行计算的技术,若指令在不同的循环迭代之间有依赖,则不能使用SIMD指令对该指令进行SIMD化。因此,若要确定一条指令是否可SIMD化,需要确定该指令在不同的循环迭代之间是否没有依赖,即确定该指令是否满足条件1。
条件2、指令执行的循环操作中每相邻两次操作访问的地址之间的步长相等且不为0。
其中,指令执行操作访问的地址,可以理解为指令执行操作所需的参数在内存中的位置,步长,可以理解为指令执行当前操作需要访问的地址与指令上一次执行操作访问的地址之间的长度,也可以理解为指令执行当前操作所需的参数的长度,还可以理解为参数在内存中所占的字节。
由于SIMD指令为单指令,程序员或者编译器在采用SIMD指令集中相应的SIMD指令替代应用程序中的可SIMD化代码时,仅给这个SIMD指令一个基地址(即该SIMD指令获取第一个数据的地址)和需要操作的一组数据的长度,其中,这组数据中的每个数据的长度相同且长度不为0,也就是说,SIMD指令不知道这组数据中其他数据的地址,因此,这组数据在内存中的位置需要是连续的,这样SIMD指令在运行时,才能从内存中同步取出这组数据进行计算,若这组数据在内存中的位置不连续,则SIMD指令在运行时,只能获取到第一个数据,不知道该去什么地址获取这组数据中的其他数据,也就无法实现并行计算。
因此,若要确定一条指令是否可SIMD化,需要确定该指令执行循环操作访问的地址是否连续,即确定该指令执行的循环操作中每相邻两次操作访问的地址之间的步长是否相等且不为0,即确定该指令是否满足条件2。
条件3、指令在SIMD指令集中有对应的SIMD指令。
目前,大多数的计算指令在SIMD指令集中都有对应的SIMD指令,例如加法、减法、乘法等计算指令,在具体实现中,可以使用SIMD指令集中对应的SIMD指令对上述计算指令进行SIMD化,控制指令和少数计算指令在SIMD指令集中没有对应的SIMD指令,因此,不可以使用SIMD指令集中的SIMD指令对这些指令进行SIMD化。
可以看出,由于并不是所有的指令在SIMD指令集中都有对应的SIMD指令,因此,若要确定一条指令是否可SIMD化,还需要确定该指令在SIMD指令集中是否有对应的SIMD指令,即确定该指令是否满足条件3。
由上述能够采用SIMD技术优化的指令需要满足的条件3可知,在应用程序中,只有大多数计算指令在SIMD指令集中有对应的SIMD指令,也就是说,只有这些指令可能被SIMD化,因此,本申请实施例提供的方法、装置及相关设备可以在应用程序运行时,仅对应用程序中包括的计算指令进行分析,通过确定计算指令是否满足上述条件1至条件3,从而确定计算指令是否是可SIMD化指令,无需对应用程序中包括的不可能被SIMD化的指令(如控制指令)进行分析,过滤掉这部分不可能被SIMD化的指令,可以提高分析效率,以及减少分析过程对应用程序性能的影响和减少分析开销。
下面对确定计算指令是否满足条件1至条件3的过程进行详细描述。
(一)确定计算指令是否满足条件1
由上文对计算指令和读指令的介绍可知,计算指令的执行依赖读指令获取计算指令所需的参数,因此,在本申请具体的实施例中,可以在应用程序运行时,获取处理器执行的读指令,然后根据读指令和依赖该读指令获取参数的计算指令(以下简称为依赖读指令的计算指令)之间的依赖关系,确定计算指令是否满足条件1,如果计算指令不满足条件1,则确定计算指令不可SIMD化,如果计算指令满足条件1,则可以进一步判断计算指令是否满足条件2和条件3,从而确定计算指令是否可SIMD化。
下面对获取处理器执行的读指令的过程,以及根据读指令和依赖读指令的计算指令之间的依赖关系,确定计算指令是否满足条件1的过程分别进行介绍。
(1)获取处理器执行的读指令的过程
可以理解,在应用程序运行时,处理器通常要执行大量指令,大量指令中通常包括多条读指令,多条读指令各自对应一个唯一的PC值,因此,可以根据PC值区分多条读指令。
在本申请具体的实施例中,程序数据级并行分析装置获取处理器执行的读指令,可以为获取上述多条读指令中的任意一条或者多条读指令,此处不作具体限定。在本实施例中,无论是获取上述多条读指令中的任意一条或者多条读指令,程序数据级并行分析装置对每一条读指令进行的后续操作过程是相似的,便于简便陈述,在接下来的实施例中,均以获取的读指令为一条为例进行描述。
在一种更具体的实施例中,程序数据级并行分析装置可以在处理器执行多条读指令时,以采样的方式获取处理器执行的一条读指令,装置的采样周期可以设置为预设时长或者为处理器执行的指令条数,此处不作具体限定。
可以理解,在程序数据级并行分析装置以采样的方式获取处理器执行的一条读指令的情况下,若处理器在一个采样周期中执行的多条读指令中包括可以循环迭代的读指令,则循环迭代的频率最快(即在当前采样周期中循环迭代的次数最多)的读指令被获取到的几率最大。
(2)根据读指令和依赖读指令的计算指令之间的依赖关系,确定计算指令是否满足条件1的过程
在本申请具体的实施例中,可以在等待队列103中获取读指令和依赖读指令的计算指令之间的依赖关系,根据读指令和依赖读指令的计算指令之间的依赖关系构建依赖图,然后根据依赖图中是否存在依赖环,确定计算指令是否满足条件1。具体地,在依赖图中不存在依赖环的情况下,确定计算指令满足条件1,在依赖图中存在依赖环的情况下,确定计算指令不满足条件1,可以理解,确定了计算指令不满足条件1,即确定了计算指令不可SIMD化。
其中,依赖图是用于反映指令之间的依赖关系的图形,在本实施例中,依赖图用于反映读指令和依赖读指令的计算指令之间的依赖关系;依赖环,若一个指令为可以循环迭代的指令,该指令在每次循环时,当前循环的执行依赖于前一个循环的执行结果,则这个指令的不同循环迭代会形成一个依赖环。
举例来讲,假设存在指令A,指令A循环迭代了m(m为大于1的自然数)次,指令A在m次循环迭代之间的依赖关系为:在指令A执行第一次操作得到第一结果之后,指令A将第一结果作为第二次操作所需的参数进行了第二次操作,在指令A得到第二结果之后,指令A又将第二结果作为第三次操作所需的参数进行了第三次操作,在指令A得到第三结果之后,以此类推,直至完成m次循环迭代,如图2所示,图2为根据上述指令A在m次循环迭代之间的依赖关系构建的依赖图,从图2可以看出,依赖图包括串联成一个环的执行第一次操作的指令A、执行第二次操作的指令A、…、执行第m次操作的指令A,该环即为依赖环。
可以理解,在读指令可以循环迭代的情况下,依赖读指令的计算指令也可以循环迭代,且计算指令循环迭代的次数与读指令循环迭代的次数相同,在应用程序运行时,若应用程序包括可以循环迭代的读指令,读指令通常至少会循环迭代2次,多则达千次、万次甚至更多。若根据读指令和依赖读指令的计算指令在全部的循环迭代之间的依赖关系,构建依赖图,则构建依赖图花费的时间较多,且构建过程占用的开销也较大。
从图2可以看出,若一个指令为可以循环迭代的指令,根据指令在部分循环迭代之间的依赖关系构建的依赖图,便可反映指令在全部循环迭代之间的依赖关系,因此,在本申请具体的实施例中,在读指令可以循环迭代的情况下,可以根据读指令和依赖读指令的计算指令在部分循环迭代之间的依赖关系,构建依赖图,然后根据该依赖图确定计算指令是否满足条件1,可以减少构建依赖图的过程花费的时间以及占用的开销。
在具体实现中,程序数据级并行分析装置可以获取一个采样周期中等待队列103包括的读指令和依赖读指令的计算指令在部分迭代之间的依赖关系。
举例来讲,假设在一个采样周期中,程序数据级并行分析装置获取到的读指令B一共循环迭代了3次,计算指令C依赖读指令B,读指令B和计算指令C在3次循环迭代之间的依赖关系为:
B1执行第一次读操作得到第一参数,将第一参数传递给C1,C1执行第一次计算操作得到第一计算结果,将第一计算结果写到数据寄存器,B2执行第二次读操作得到第二参数,将第二参数传递给C2,C2执行第二次计算操作得到第二计算结果,将第二计算结果写到数据寄存器,B3执行第三次读操作得到第三参数,将第三参数传递给C3,C3执行第三次计算操作得到第三计算结果,将第三计算结果写到数据寄存器,其中,第二参数不是第一计算结果,第三参数不是第一操作结果或者第二操作结果,B1、B2、B3对应表示执行第一次读操作的读指令B、执行第二次读操作的读指令B、执行第三次读操作的读指令B,C1、C2、C3对应表示执行第一次计算操作的计算指令C、执行第二次计算操作的计算指令C、执行第三次计算操作的计算指令C。
参见图3,图3为根据上述举例中读指令B和计算指令C在3次循环迭代之间的依赖关系构建的依赖图,从图3可以看出,依赖图包括串联成一条线的B1和C1、串联成一条线的B2和C2,以及串联成一条线的B3和C3,可以看出,依赖图中没有依赖环,在这种情况下,可以确定计算指令满足条件1。
继续以在一个采样周期中,程序数据级并行分析装置获取到的读指令C一共循环迭代了3次,计算指令C依赖读指令B为例,假设读指令B和计算指令C在3次循环迭代之间的依赖关系为:
B1执行第一次读操作得到第一参数,将第一参数传递给C1,C1执行第一次计算操作得到第一计算结果,将第一计算结果传递B2,B2将第一计算结果作为第二参数传递给C2,C2执行第二次计算操作得到第二计算结果,将第二计算结果传递给B3,B3将第二计算结果作为第三参数传递给C3,C3执行第三次计算操作得到第三计算结果。
参见图4,图4为根据上述举例中读指令B和计算指令C在3次循环迭代之间的依赖关系构建的依赖图,从图4可以看出,依赖图包括串联成一个环的B1、C1、B2、C2、B3、C3,依赖图中有依赖环,在这种情况下,可以确定计算指令不满足条件1,即确定计算指令不可SIMD化。
(二)确定计算指令是否满足条件2
在本申请具体的实施例中,可以在应用程序运行时,获取读指令执行的连续n次读操作 对应的n个地址,然后根据n个地址确定依赖读指令的计算指令是否满足条件2,如果计算指令不满足条件2,则确定计算指令不可SIMD化,如果计算指令满足条件2,则可以进一步确定计算指令是否满足条件1和条件3,从而确定计算指令是否可SIMD化。其中,n为大于2的自然数。
下面对获取读指令执行的连续n次读操作对应的n个地址的过程,以及根据n个地址确定计算指令是否满足条件2的过程分别进行介绍。
(1)获取读指令执行的连续n次读操作对应的n个地址的过程
在具体实现中,n的取值可以由用户预设或者在每次获取读指令执行的连续n次读操作对应的n个地址前进行设定,例如,程序数据级并行分析装置可以提供图5所示的界面给用户,在图5所示的界面中,n默认为3,用户可以通过键盘或者触摸屏等输入设备输入一个具体的数字,对n进行预设,本申请实施例不对n的取值作具体限定。
(2)根据n个地址确定计算指令是否满足条件2的过程
如图6所示,该过程包括但不限于如下步骤:
S101、计算得到n个地址中每两个相邻的地址之间的步长。
可以理解,n的取值越大,根据n个地址确定计算指令是否满足条件2的准确性越高,但是分析所占的开销也越大,n的取值越小,根据n个地址确定计算指令是否满足条件2的准确性越低,但是分析所占的开销也越小。
在具体实现中,无论n为3、5或者8等任意值,根据n个地址确定计算指令是否满足条件2的方式是相似的,接下来以n为3为例,对计算得到n个地址中每两个相邻的地址之间的步长的过程进行详细描述,该过程可以包括如下步骤:
S1011、根据3个地址中的第一个地址和第二个地址计算得到第一步长。
其中,第一个地址为读指令执行的连续3次读操作中的第一次读操作对应的地址,第二个地址为读指令执行的连续3次读操作中的第二次读操作对应的地址。
在本申请具体的实施例中,为了便于程序数据级并行分析装置在获取到第二个地址时,根据第二个地址和第一个地址计算得到第一步长,可以在获取到第一个地址之后,记录第一个地址到预先配置好的采集表,然后在获取到第二个地址时,从采集表中查找到第一个地址,进而将第二个地址与第一个地址之间的差值确定为第一步长。
S1012、根据3个地址中的第三个地址、第一个地址和第一步长计算得到第二步长。
在本申请具体的实施例中,为了便于程序数据级并行分析装置在获取到第三个地址时,根据第三个地址、第一个地址和第一步长计算得到第二步长,可以在获取到第一步长之后,记录第一步长到采集表,然后在获取到第三个地址时,从采集表中查找到第一个地址和第一步长,进而将第三个地址与第一个地址和第一步长的和之间的差值确定为第二步长。
可以理解,在具体实现中,程序数据级并行分析装置也可以在获取到第二个地址之后,记录第二个地址到采集表,然后在获取到第三个地址时,从采集表中查找到第二个地址,进而将第三个地址与第二个地址之间的差值确定为第二步长。
在本申请具体的实施例中,由于在不同的采样周期中,程序数据级并行分析装置可能会获取到不同的读指令进行分析,为了便于对不同采样周期中获取到的不同读指令,以及不同读指令对应的第一个地址和第一步长进行区分,程序数据级并行分析装置可以在每个采样周期中,获取到处理器执行的读指令之后,在采集表中为读指令分配对应的表项,然后记录读指令的地址到读指令对应的表项,后续在获取到读指令对应的第一个地址和第一步长后,记录第一个地址和第一步长到读指令对应的表项。
请参见表1,表1为本申请实施例提供的一种示例性采集表,如表1所示,采集表包括entry1、entry2、entry3等表项,其中,entry1、entry2、entry3等表项中的每个表项可以包括读指令的地址列、基地址列和步长列等,读指令的地址列用于记录读指令的地址,基地址列用于记录第一个地址,步长列用于记录第一步长,请参见表2,表2为本申请实施例提供的一种示例性的entry1表项,假设程序数据级并行分析装置为读指令分配的表项为entry1表项,读指令的地址为地址A',第一个地址为B',第一步长为C',则程序数据级并行分析装置可以在表2所示的entry1表项中记录A'、B'、C'。
表1 采集表
entry1
entry2
entry3
表2 entry1
读指令的地址 基地址 步长
A' B' C'
需要说明的是,表1和表2仅仅是作为一种示例,不应视为对采集表以及表项的具体限定。
S102、确定每两个相邻的地址之间的步长是否相等且不为0,在每两个相邻的地址之间的步长相等且不为0的情况下,执行S103,在每两个相邻的地址之间的步长不相等和/或为0的情况下,执行S104。
继续以S101中所举的n为3的例子为例,程序数据级并行分析装置可以确定第二步长与第一步长是否相等,且确定第一步长和第二步长是否均不为0,在第二步长与第一步长相等且第一步长和第二步长均不为0的情况下,执行S103,在第二步长与第一步长不相等和/或第一步长和/或第二步长为0的情况下,执行S104。
S103、确定计算指令是否满足条件1和条件3。
S104、确定计算指令不满足条件2。
在本申请具体的实施例中,在根据n个地址确定计算指令不满足条件2的情况下,即在根据n个地址确定计算指令不可SIMD化的情况下,程序数据级并行分析装置可以从采集表中删除读指令对应的表项,以减少资源占用,节省存储空间。
(三)确定计算指令是否满足条件3
在本申请具体的实施例中,程序数据级并行分析装置在获取到处理器执行的读指令之后,可以根据SIMD指令集,确定依赖读指令的计算指令是否满足条件3。
在一种可能的实施方式中,程序数据级并行分析装置中可以包括预先配置好的SIMD指令集,该指令集可以以指令的形式存储于程序数据级并行分析装置,并预先建立SIMD指令集中的每条SIMD指令与其对应的计算指令之间的对应关系,程序数据级并行分析装置在获取到计算指令后,可以查找SIMD指令集中是否存在与该计算指令有对应关系的SIMD指令,若存在有对应关系的SIMD指令,则确定计算指令满足条件3,若不存在有对应关系的SIMD指令,则确定计算指令不满足条件3,可以理解,确定计算指令不满足条件3,即确定了计算指令不可SIMD化。
在另一种可能的实施方式中,SIMD指令集可以以SIMD指令表的形式存储,SIMD指令表中可以包括SIMD指令集中的每条SIMD指令对应的关键字,例如,假设SIMD指令为可以对加法指令进行SIMD化的指令,则SIMD指令表中包括的与该SIMD指令对应的关键字可以为“add”和/或“+”等,程序数据级并行分析装置在获取到加法指令后,可以提取加法指令包括的关键字“add”和/或“+”等,在查看到SIMD指令表中有关键字“add”和/或“+”时,确定计算指令满足条件3,反之,则确定计算指令不满足条件3。
需要说明的是,上述两种SIMD指令集在程序数据级并行分析装置的存储形式仅仅是作为示例,不应视为具体限定。
可以理解,为了便于程序数据级并行分析装置一次性确定计算指令是否可SIMD化,提高分析效率,在具体实现中,程序数据级并行分析装置可以在同一个采样周期中,确定计算指令是否满足条件1、条件2和条件3,本申请实施例不对装置确定计算指令是否满足条件1、条件2和条件3的先后顺序进行限定。
还可以理解,如果程序数据级并行分析装置在一个采样周期中,确定了一条计算指令同时满足条件1、条件2和条件3,则装置可以确定该计算指令可SIMD化。
请参见图7,图7为本申请实施例提供的一种程序数据级并行分析方法的流程示意图,该方法可以应用于CPU、GPU、DSP等支持SIMD技术的处理器,此处不作具体限定。
如图7所示,该方法包括但不限于如下步骤:
S201、获取处理器执行的读指令,读指令用于获取计算指令所需的参数。
S202、根据读指令和计算指令之间的依赖关系,确定计算指令是否可SIMD化。
在一种可能的实施方式中,在根据读指令和计算指令之间的依赖关系,确定计算指令是否可SIMD化之前,还可以获取读指令执行的连续n次读操作对应的n个地址,在获取到n个地址的情况下,根据读指令和计算指令之间的依赖关系以及n个地址,确定计算指令是否可SIMD化,其中,n为大于2的自然数。
在一种可能的实施方式中,程序数据级并行分析装置中可以包括预先配置好的SIMD指令集,在根据读指令和计算指令之间的依赖关系和n个地址,确定计算指令是否可SIMD化的同时,根据SIMD指令集进一步确定计算指令是否可SIMD化。
在一种可能的实施方式中,根据读指令和计算指令之间的依赖关系和n个地址,确定计算指令是否可SIMD化的具体过程包括:
计算得到n个地址中每两个相邻的地址之间的步长;
根据每两个相邻的地址之间的步长是否相等且不为0,以及读指令和计算指令之间的依赖关系,确定计算指令是否可SIMD化。在一种可能的实施方式中,根据读指令和计算指令之间的依赖关系,确定计算指令是否可SIMD化的具体过程包括:
根据读指令和计算指令之间的依赖关系构建依赖图,其中,依赖图用于反映读指令和计算指令之间的依赖关系;
根据依赖图中是否存在依赖环,确定计算指令是否可SIMD化。
在本申请具体的实施例中,图7所示实施例可以作为程序数据级并行分析装置在应用程序运行时进行的一次采样分析过程,在程序数据级并行分析装置第一次确定了一条计算指令可SIMD化的情况下,可以将该计算指令被确定可SIMD化的次数统计为1,若在后续采样周期中,装置也确定了该计算指令可SIMD化,则可以对该计算指令被确定可SIMD化的次数进行累加,若在后续采样周期中,装置确定了该计算指令不可SIMD化,则可以将该计算指令被确定可SIMD化的次数依次减1,或者将该计算指令标记为不可SIMD指令,在后续 采样周期中不再对该计算指令进行分析。
在本申请具体的实施例中,在程序数据级并行分析装置进行了多次采样周期后,若装置统计到的计算指令被确定可SIMD化的次数达到预设阈值,则可以生成提示信息,提示信息可以被呈现或者发送给用户对用户进行提示,便于用户尽快获知应用程序中有可SIMD化的代码段,用户可以参考该提示信息对应用程序中可SIMD化的代码段进行优化。
具体地,程序数据级并行分析装置生成的提示信息,可以为在界面上高亮显示该计算指令的地址以及该计算指令被确定可SIMD化的次数,还可以为向用户发送包括该计算指令的地址、该计算指令被确定可SIMD化的次数等信息的短信或者邮件等,此处不作具体限定;采样周期可以设置为预设时长或者为处理器执行的指令条数,此处不作具体限定。
为了简便陈述,本实施例没有对获取处理器执行的读指令的具体过程、根据读指令和依赖读指令的计算指令之间的依赖关系,确定计算指令是否可SIMD化的具体过程、根据n个地址确定计算指令是否可SIMD化的具体过程、根据SIMD指令集确定计算指令是否可SIMD化的具体过程等进行展开描述,具体请参见上文以及相关描述,此处不再展开赘述。
可以看出,本申请实施例提供的程序数据级并行分析方法可以在应用程序运行时,通过获取处理器执行的读指令,然后根据读指令和依赖读指令的计算指令之间的依赖关系,确定计算指令是否可SIMD化,可以快速确定应用程序中的可SIMD化代码,提高查找可SIMD化代码的效率,节省人力和时间。
还可以看出,本申请实施例提供的方法可以在根据读指令和计算指令之间的依赖关系,确定计算指令是否可SIMD化的基础上,根据读指令执行的连续n次读操作对应的n个地址和/或SIMD指令集,进一步确定计算指令是否可SIMD化,可以进一步提升确定计算指令是否可SIMD化的准确性。
另外,可以理解,上述图7所示实施例可以作为程序数据级并行分析装置在应用程序运行时进行的一次采样分析过程,程序数据级并行分析装置可以在每一个采样周期中,仅对计算指令进行分析,确定计算指令是否可SIMD化,无需对应用程序中包括的不可能SIMD化的指令进行分析,可以减少分析过程对应用程序性能的影响以及减少分析开销。
上文详细阐述了本申请实施例的一种程序数据级并行方法,基于相同的发明构思,下面继续提供本申请实施例的程序数据级并行分析装置,本申请提供的程序数据级并行分析装置可以应用于CPU、GPU、DSP等支持SIMD技术的处理器,此处不作具体限定。
参见图8,图8是本申请实施例提供的一种程序数据级并行分析装置100的结构示意图,该装置100包括:获取模块110、确定模块120和提示模块130,其中,
获取模块110,用于获取处理器执行的读指令,读指令用于获取计算指令所需的参数;
确定模块120,用于根据读指令和计算指令之间的依赖关系,确定计算指令是否可单指令多数据SIMD化。
在一种可能的实施方式中,在根据读指令和计算指令之间的依赖关系,确定计算指令是否可SIMD化之前,获取模块110,还用于获取读指令执行的连续n次读操作对应的n个地址,其中,n为大于2的自然数;
确定模块120具体用于:
根据读指令和计算指令之间的依赖关系和n个地址,确定计算指令是否可SIMD化。
在一种可能的实施方式中,确定模块120具体用于:
根据读指令和计算指令之间的依赖关系、n个地址和SIMD指令集,确定计算指令是否可SIMD化。
在一种可能的实施方式中,程序数据级并行分析装置100还包括:提示模块130;
获取模块110,还用于在多次获取处理器执行的读指令的情况下,获取计算指令被确定可SIMD化的次数;
提示模块130,用于在次数达到预设阈值的情况下,生成提示信息。
在一种可能的实施方式中,确定模块120具体用于:
计算得到n个地址中每两个相邻的地址之间的步长;
根据每两个相邻的地址之间的步长是否相等且不为0,以及读指令和计算指令之间的依赖关系,确定计算指令是否可SIMD化。在一种可能的实施方式中,确定模块120具体用于:
根据读指令和计算指令之间的依赖关系构建依赖图,其中,依赖图用于反映读指令和计算指令之间的依赖关系;
根据依赖图是否存在依赖环,确定计算指令是否可SIMD化。
具体地,上述程序数据级并行分析装置100执行各种操作的具体实现,可参照上述程序数据级并行方法实施例中相关内容中的描述,为了说明书的简洁,这里不再赘述。
应当理解,程序数据级并行分析装置100仅为本申请实施例提供的一个例子,并且,程序数据级并行分析装置100可具有比图8示出的部件更多或更少的部件,可以组合两个或更多个部件,或者可具有部件的不同配置实现。
本申请实施例还提供一种计算机设备200,参见图9,图9是本申请实施例提供的一种计算机设备200的结构示意图,该计算机设备200包括:处理器210、存储器220以及通信接口230,其中,处理器210、存储器220以及通信接口230之间可以通过总线240相互连接。其中,
处理器210可以读取存储器220中存储的程序代码,与通信接口230配合执行本申请上述实施例中由程序数据级并行分析装置100执行的方法的部分或者全部步骤。
处理器210可以有多种具体实现形式,例如处理器210可以为CPU或GPU,处理器910还可以是单核处理器或多核处理器。处理器210可以由CPU和硬件芯片的组合。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。处理器210也可以单独采用内置处理逻辑的逻辑器件来实现,例如FPGA或DSP等。
存储器220可以存储有程序代码以及程序数据。其中,程序代码包括:获取模块110的代码、确定模块120的代码和提示模块130的代码等,程序数据包括:读指令、读指令和计算指令之间的依赖关系、读指令执行的连续n次读操作对应的n个地址、第一步长等等。
在实际应用中,存储器220可以是非易失性存储器,例如,只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。存储器220也可以是易失性存储器,易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。
通信接口230可以为有线接口(例如以太网接口)或无线接口(例如蜂窝网络接口或使用无线局域网接口),用于与其他计算节点或装置进行通信。当通信接口230为有线接口时,通信接口230可以采用传输控制协议/网际协议(transmission control protocol/internet protocol,TCP/IP)之上的协议族,例如,远程函数调用(remote function call,RFC)协议、简单对象 访问协议(simple object access protocol,SOAP)协议、简单网络管理协议(simple network management protocol,SNMP)协议、公共对象请求代理体系结构(common object request broker architecture,CORBA)协议以及分布式协议等等。
总线240可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。所述总线240可以分为地址总线、数据总线、控制总线等。为便于表示,图9中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
上述计算机设备200用于执行上述方法实施例中执行的方法,与上述方法实施例属于同一构思,其具体实现过程详见上述方法实施例,这里不再赘述。
应当理解,计算机设备200仅为本申请实施例提供的一个例子,并且,计算机设备200可具有比图9示出的部件更多或更少的部件,可以组合两个或更多个部件,或者可具有部件的不同配置实现。
本申请实施例还提供一种非瞬态计算机可读存储介质,非瞬态计算机可读存储介质中存储有指令,当其在处理器上运行时,可以实现上述实施例中记载的程序数据级并行分析方法的部分或者全部步骤。
本本申请实施例还提供一种计算机程序产品,当计算机程序产品被计算机读取并执行时,可以实现上述方法实施例中记载的程序数据级并行分析方法的部分或者全部步骤。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
在上述实施例中,可以全部或部分地通过软件、硬件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如软盘、硬盘、磁带)、光介质、或者半导体介质等。
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并或删减;本申请实施例装置中的单元可以根据实际需要进行划分、合并或删减。
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (14)

  1. 一种程序数据级并行分析方法,其特征在于,所述方法包括:
    获取处理器执行的读指令,所述读指令用于获取计算指令所需的参数;
    根据所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可单指令多数据SIMD化。
  2. 根据权利要求1所述的方法,其特征在于,在根据所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化之前,所述方法还包括:
    获取所述读指令执行的连续n次读操作对应的n个地址,其中,n为大于2的自然数;
    所述根据所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化,包括:
    根据所述读指令和所述计算指令之间的依赖关系和所述n个地址,确定所述计算指令是否可SIMD化。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述读指令和所述计算指令之间的依赖关系和所述n个地址,确定所述计算指令是否可SIMD化,包括:
    根据所述读指令和所述计算指令之间的依赖关系、所述n个地址和SIMD指令集,确定所述计算指令是否可SIMD化。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    在多次获取所述处理器执行的所述读指令的情况下,获取所述计算指令被确定可SIMD化的次数;
    在所述次数达到预设阈值的情况下,生成提示信息。
  5. 根据权利要求2至4任一项所述的方法,其特征在于,所述根据所述读指令和所述计算指令之间的依赖关系和所述n个地址,确定所述计算指令是否可SIMD化,包括:
    计算得到所述n个地址中每两个相邻的地址之间的步长;
    根据所述每两个相邻的地址之间的步长是否相等且不为0,以及所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述根据所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化,包括:
    根据所述读指令和所述计算指令之间的依赖关系构建依赖图,其中,所述依赖图用于反映所述读指令和所述计算指令之间的依赖关系;
    根据所述依赖图中是否存在依赖环,确定所述计算指令是否可SIMD化。
  7. 一种程序数据级并行分析装置,其特征在于,所述装置包括:
    获取模块,用于获取处理器执行的读指令,所述读指令用于获取计算指令所需的参数;
    确定模块,用于根据所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否SIMD化。
  8. 根据权利要求7所述的装置,其特征在于,在根据所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化之前,
    所述获取模块,还用于获取所述读指令执行的连续n次读操作对应的n个地址,其中,n为大于2的自然数;
    所述确定模块具体用于:
    根据所述读指令和所述计算指令之间的依赖关系和所述n个地址,确定所述计算指令是否可SIMD化。
  9. 根据权利要求8所述的装置,其特征在于,所述确定模块具体用于:
    根据所述读指令和所述计算指令之间的依赖关系、所述n个地址和SIMD指令集,确定所述计算指令是否可SIMD化。
  10. 根据权利要求9所述的装置,其特征在于,所述装置还包括:提示模块;
    所述获取模块,还用于在多次获取所述处理器执行的所述读指令的情况下,获取所述计算指令被确定可SIMD化的次数;
    所述提示模块,用于在所述次数达到预设阈值的情况下,生成提示信息。
  11. 根据权利要求8至10任一项所述的装置,其特征在于,所述确定模块具体用于:
    计算得到所述n个地址中每两个相邻的地址之间的步长;
    根据所述每两个相邻的地址之间的步长是否相等且不为0,以及所述读指令和所述计算指令之间的依赖关系,确定所述计算指令是否可SIMD化。
  12. 根据权利要求7至11任一项所述的装置,其特征在于,所述确定模块具体用于:
    根据所述读指令和所述计算指令之间的依赖关系构建依赖图,其中,所述依赖图用于反映所述读指令和所述计算指令之间的依赖关系;
    根据所述依赖图中是否存在依赖环,确定所述计算指令是否可SIMD化。
  13. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器;所述存储器用于存储指令,所述处理器用于执行所述指令以实现如权利要求1至6任一项所述的方法。
  14. 一种非瞬态计算机可读存储介质,其特征在于,所述非瞬态计算机可读介质存储有计算机可读指令,当所述计算机可读指令被运行时,执行如权利要求1至6任一项所述的方法。
PCT/CN2021/130179 2021-01-30 2021-11-12 一种程序数据级并行分析方法、装置及相关设备 WO2022160863A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110131344.6 2021-01-30
CN202110131344.6A CN114840256A (zh) 2021-01-30 2021-01-30 一种程序数据级并行分析方法、装置及相关设备

Publications (1)

Publication Number Publication Date
WO2022160863A1 true WO2022160863A1 (zh) 2022-08-04

Family

ID=82561068

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/130179 WO2022160863A1 (zh) 2021-01-30 2021-11-12 一种程序数据级并行分析方法、装置及相关设备

Country Status (2)

Country Link
CN (1) CN114840256A (zh)
WO (1) WO2022160863A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117193861B (zh) * 2023-11-07 2024-03-15 芯来智融半导体科技(上海)有限公司 指令处理方法、装置、计算机设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110161642A1 (en) * 2009-12-30 2011-06-30 International Business Machines Corporation Parallel Execution Unit that Extracts Data Parallelism at Runtime
CN103049245A (zh) * 2012-10-25 2013-04-17 浪潮电子信息产业股份有限公司 一种基于cpu多核平台的软件性能优化方法
CN103279327A (zh) * 2013-04-28 2013-09-04 中国人民解放军信息工程大学 面向异构simd扩展部件的自动向量化方法
CN111124415A (zh) * 2019-12-06 2020-05-08 西安交通大学 一种开发循环代码中潜在可向量化循环的方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110161642A1 (en) * 2009-12-30 2011-06-30 International Business Machines Corporation Parallel Execution Unit that Extracts Data Parallelism at Runtime
CN103049245A (zh) * 2012-10-25 2013-04-17 浪潮电子信息产业股份有限公司 一种基于cpu多核平台的软件性能优化方法
CN103279327A (zh) * 2013-04-28 2013-09-04 中国人民解放军信息工程大学 面向异构simd扩展部件的自动向量化方法
CN111124415A (zh) * 2019-12-06 2020-05-08 西安交通大学 一种开发循环代码中潜在可向量化循环的方法

Also Published As

Publication number Publication date
CN114840256A (zh) 2022-08-02

Similar Documents

Publication Publication Date Title
US10789252B2 (en) Efficient evaluation of aggregate functions
US10824420B2 (en) Caching build graphs
US10572404B2 (en) Cyclic buffer pointer fixing
US9047077B2 (en) Vectorization in an optimizing compiler
US20090158017A1 (en) Target-frequency based indirect jump prediction for high-performance processors
JP7241121B2 (ja) 循環命令の処理方法、電子機器、コンピュータ可読記憶媒体及びコンピュータプログラム
EP3398113B1 (en) Loop code processor optimizations
US11775269B2 (en) Generating a synchronous digital circuit from a source code construct defining a function call
US11188348B2 (en) Hybrid computing device selection analysis
Sha et al. Power efficiency for hardware/software partitioning with time and area constraints on mpsoc
US10592252B2 (en) Efficient instruction processing for sparse data
WO2022160863A1 (zh) 一种程序数据级并行分析方法、装置及相关设备
Zhang et al. Predicting HPC parallel program performance based on LLVM compiler
Jin et al. Towards dataflow-based graph accelerator
RU2644528C2 (ru) Инструкция и логика для идентификации инструкций для удаления в многопоточном процессоре с изменением последовательности
Lv et al. Understanding parallelism in graph traversal on multi-core clusters
US20170192896A1 (en) Zero cache memory system extension
US10996960B1 (en) Iterating single instruction, multiple-data (SIMD) instructions
Fryza et al. Low level source code optimizing for single/multi/core digital signal processors
KR20210012886A (ko) 컴퓨팅 기기에 의해 수행되는 방법, 장치, 기기 및 컴퓨터 판독가능 저장 매체
US10572263B2 (en) Executing a composite VLIW instruction having a scalar atom that indicates an iteration of execution
CN115469931B (zh) 一种循环程序的指令优化方法、装置、系统、设备及介质
US11593114B1 (en) Iterating group sum of multiple accumulate operations
Zhao et al. Fractal Parallel Computing
US11663013B2 (en) Dependency skipping execution

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21922453

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21922453

Country of ref document: EP

Kind code of ref document: A1