WO2019080091A1 - 代码处理方法和设备 - Google Patents

代码处理方法和设备

Info

Publication number
WO2019080091A1
WO2019080091A1 PCT/CN2017/108003 CN2017108003W WO2019080091A1 WO 2019080091 A1 WO2019080091 A1 WO 2019080091A1 CN 2017108003 W CN2017108003 W CN 2017108003W WO 2019080091 A1 WO2019080091 A1 WO 2019080091A1
Authority
WO
WIPO (PCT)
Prior art keywords
code
branch
branch code
execution logic
execution
Prior art date
Application number
PCT/CN2017/108003
Other languages
English (en)
French (fr)
Inventor
林焕鑫
王卓立
马军超
沈伟锋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201780094772.8A priority Critical patent/CN111095197B/zh
Priority to PCT/CN2017/108003 priority patent/WO2019080091A1/zh
Publication of WO2019080091A1 publication Critical patent/WO2019080091A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs

Definitions

  • Embodiments of the present invention relate to the field of data processing, and in particular, to a code processing method and a code processing device.
  • a graphics processing unit creates a large number of threads when it receives a kernel boot call. For example, 64 or 32 threads can form a warp, threaded in Open Computing Language (OpenCL).
  • OpenCL Open Computing Language
  • the bundle is called Wavefront (referred to as wave), and the thread bundle is called warp in the Compute Unified Device Architecture (CUDA).
  • the threads in the warp are bound together and each command executes a uniform instruction.
  • the warp needs to serially execute the branch corresponding to its member thread due to the uniformity of the instruction, which is called a branch divergence problem.
  • the branches of each branch are all executed together, but the results of the thread operations that are not related to the currently executing branch are discarded, reducing the degree of parallelism and efficiency of execution.
  • the existing solution to the problem of branch divergence has a code matching scheme, which extracts the same instruction on the two branches under the same if statement to the two branches to reduce the repetition of the instruction due to the divergence. The operation.
  • Embodiments of the present invention provide a code processing method and a code processing device for reducing redundant code in a code.
  • a first aspect of the embodiments of the present invention provides a code processing method, including: acquiring code to be executed, where code to be executed includes a first branch code and a second branch code, where the first branch code includes a third branch code and a fourth The branch code, the second branch code and the third branch code comprise the same first execution logic, the fourth branch code does not include the first execution logic, the second branch code and the third branch code are mutually exclusive branch codes, the first When the two branch code and the third branch code are to be serially executed, the first execution logic is executed twice, and the first execution logic is redundant code.
  • the fifth branch code is generated using the target conditional expression and the first execution logic for controlling the execution of the first execution logic in the fifth branch code.
  • the first execution logic is not included in the second branch code and the third branch code, and the first execution logic is included in the generated fifth branch code, such that the code to be executed is The first execution logic is retained, and the number of first execution logics in the code to be executed is reduced from two to one, reducing the redundant code in the code to be executed.
  • the method of the implementation manner further includes: determining whether the overhead time is less than the first saving time, where the overhead time represents an execution time generated by the target condition determining formula, before the first execution logic is obtained by the second branch code and the third branch code,
  • the first time saved represents the time required to execute the first execution logic.
  • the second branch code is extracted and The first execution logic step in the third branch code.
  • the implementation step of extracting the first execution logic in the second branch code and the third branch code not only reduces the redundant code of the code to be executed, but also reduces The execution time required to execute the code.
  • determining whether the overhead time is less than the first saving time includes: calculating a target condition determining formula The code length and the code length of the first execution logic, and then, determine whether the code length of the target conditional expression is smaller than the code length of the first execution logic.
  • the first execution logic and the target conditional judgment are also codes.
  • the execution of the code takes time, and the execution time of the code with a long code length is often larger than the execution time required for the code with a short code length.
  • the code length of the target conditional expression is smaller than the code length of the first execution logic, indicating that the overhead time is less than the first saving time.
  • determining whether the overhead time is less than the first saving time includes: calculating a target condition determining formula The number of registers used, and then, determine whether the number of registers used by the target conditional judgment is less than the preset number of registers threshold, because the target conditional judgment may use the register, and if the number of registers used exceeds the register number threshold, each CU is reduced.
  • the number of workgroups on /SM causes performance loss. If the number of registers used in the target condition judgment type is smaller than the preset register number threshold, the performance loss is small.
  • the number of registers used in the target condition judgment type is smaller than the preset register number threshold. Indicates that the overhead time is less than the first savings time. By judging the comparison of the time into the number of registers, the number of registers used is easy to determine, so that the execution of the judgment can be simplified.
  • any one of the first to third implementation manners of the first aspect of the embodiments of the present invention in a fourth implementation manner of the first aspect of the embodiments of the present invention, Obtaining a target condition judgment formula, comprising: respectively setting an identifier at a position where the first execution logic in the second branch code and the third branch code is located, wherein the identifier is used in the branch code where the identifier is located A specific value is set when the conditional judgment that controls the execution of the first execution logic is satisfied.
  • the identifier can be recorded in the branch code in which the identifier is located, controlling the result of the conditional expression of execution of the first execution logic.
  • the identifier may be, for example, a status flag register flag flag or an integer variable. Then, the target conditional expression is generated using the identifier of the second branch code and the identifier of the third branch code.
  • generating the fifth branch code by using the target condition judgment formula and the first execution logic comprising: after the first branch code and the second branch code, generating the fifth branch code by using the target condition judgment formula and the first execution logic, the target The conditional expression is used to control execution of the first execution logic when the identifier of the target conditional expression satisfies the aforementioned specific value.
  • the thread executes the fifth branch code, if the identifier of the target conditional expression satisfies the specific value, the thread executes the first execution logic of the fifth branch code, and if the identifier of the target conditional expression does not satisfy the specific value, The thread does not execute the first execution logic of the fifth branch code.
  • the identifier is used to set a specific value when the conditional expression in the branch code in which the identifier is located is controlled, and the specific value can be recorded in the branch code in which the identifier is located, and the first execution logic is controlled.
  • the result of the conditional judgment of the execution After the target conditional expression is generated using the identifier of the second branch code and the identifier of the third branch code, the execution of the first execution logic in the second branch code and the third branch code may be determined by the target condition judgment formula.
  • the target conditional expression can effectively control the execution of the first execution logic.
  • any one of the first to third implementation manners of the first aspect of the embodiments of the present invention in a fifth implementation manner of the first aspect of the embodiment of the present invention And acquiring the target condition judgment formula, comprising: combining the first condition judgment formula and the second condition judgment formula to obtain the target condition judgment formula, and combining the first condition judgment formula and the second condition judgment formula by means of “or”.
  • the first conditional judgment formula is for controlling execution of the first execution logic in the second branch code
  • the second conditional judgment formula is for controlling execution of the first execution logic in the third branch code.
  • generating the fifth branch code using the target conditional expression and the first execution logic comprises: generating the fifth branch code using the target conditional expression and the first execution logic before the first branch code and the second branch code.
  • the fifth branch code can be set before the first branch code and the second branch code, and when the fifth branch code is executed, The target condition judgment formula of the fifth branch code can determine whether the first execution logic is to be executed.
  • any one of the first to fifth implementation manners of the first aspect of the embodiments of the present invention in a sixth implementation manner of the first aspect of the embodiments of the present invention
  • the method of the implementation manner further includes: setting a thread executing the second branch code and a thread executing the third branch code in the same warp.
  • the specific implementation can be implemented by TDR.
  • the thread executing the second branch code and the thread executing the third branch code can be set in the same warp, because the second branch code and The third branch code is extracted with the first execution logic, so that the length of the code for the two branch codes to be serially executed is reduced, thereby reducing the execution time of the second branch code and the third branch code.
  • any one of the first to sixth implementation manners of the first aspect of the embodiments of the present invention in a seventh implementation manner of the first aspect of the embodiments of the present invention
  • the second branch code and the fourth branch code include the same second execution logic.
  • the method of the implementation manner before the first execution logic in the second branch code and the third branch code is extracted, the method of the implementation manner further includes: calculating The first time saved and the second time saved, the first time saved represents the time required to execute the first execution logic, and the second time saved represents the time required to execute the second execution logic. Since the second branch code and the third branch code can extract the first execution logic, the second branch code and the fourth branch code can extract the second branch logic, and can determine which pair of branch codes are extracted according to the length of the saving time.
  • extracting the first execution logic in the second branch code and the third branch code comprises: extracting the first execution logic in the second branch code and the third branch code when the first saving time is greater than the second saving time.
  • any one of the first to seventh implementation manners of the first aspect of the embodiment of the present invention, in an eighth implementation manner of the first aspect of the embodiment of the present invention is kernel code of the graphics processor GPU program, and the first branch code, the second branch code, the third branch code, the fourth branch code, and the branch code of the fifth branch code is an if statement.
  • the kernel code needs to execute different branch codes serially when encountering branch divergence problems.
  • the execution of the method of the present implementation method can reduce the code amount of the branch code to be serially executed, and improve the execution efficiency of the code.
  • a second aspect of an embodiment of the present invention provides a code processing device having a function of performing the above method.
  • This function can be implemented in hardware or in hardware by executing the corresponding software.
  • the hardware or software includes one or more modules corresponding to the functions described above.
  • the code processing device includes:
  • a first obtaining unit configured to acquire code to be executed, where the code to be executed includes a first branch code and a second branch code, where the first branch code includes a third branch code and a fourth branch code, and the second branch code and the third branch code Including the same first execution logic, the fourth branch code does not include the first execution logic;
  • a second acquiring unit configured to acquire a target condition judgment formula
  • An extracting unit configured to extract first execution logic in the second branch code and the third branch code
  • a generating unit configured to generate a fifth branch code using the target condition judging formula and the first execution logic, wherein the target condition judging formula is used to control execution of the first execution logic in the fifth branch code.
  • the code processing device includes: a processor
  • the processor performs the following actions:
  • the code to be executed includes a first branch code and a second branch code, the first branch code includes a third branch code and a fourth branch code, and the second branch code and the third branch code include the same first execution logic
  • the fourth branch code does not include the first execution logic
  • the fifth branch code is generated using the target conditional expression and the first execution logic for controlling execution of the first execution logic in the fifth branch code.
  • a third aspect of an embodiment of the present invention provides a computer readable storage medium having stored therein instructions that, when run on a computer, cause the computer to perform the first aspect or the first aspect described above A method of implementation.
  • a fourth aspect of an embodiment of the present invention provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of the first aspect or the first aspect.
  • a fifth aspect of the embodiments of the present invention provides a computing device, including: a processor, a memory, and a bus; a memory for storing execution instructions, a processor and a memory connected through a bus, and when the computing device is running, the processor executes the save
  • the memory stores execution instructions to cause the computing device to perform the method of any of the first aspect or the first aspect.
  • the code to be executed is obtained, where the code to be executed includes a first branch code and a second branch code, where the first branch code includes a third branch code and a fourth branch code, and the second branch code And the third branch code includes the same first execution logic, and the fourth branch code does not include the first execution logic. Since the second branch code and the third branch code include the same first execution logic, the first execution logic includes two copies in the code to be executed, which is a redundant code.
  • the target condition judgment formula is acquired, and the first execution logic in the second branch code and the third branch code is extracted, and then the fifth branch code is generated using the target condition judgment formula and the first execution logic, the target condition judgment formula Used to control the execution of the first execution logic in the fifth branch code.
  • the first execution logic is not included in the second branch code and the third branch code due to the extraction of the first execution logic, and the first execution logic is included in the generated fifth branch code so as to be retained in the to-be-executed code
  • the first execution logic, and the number of first execution logics in the code to be executed is reduced from two to one, reducing the redundant code in the code to be executed.
  • FIG. 1 is a schematic diagram of an organization form of an OpenCL kernel on a GPU according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a TDR method according to another embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a code taking method according to another embodiment of the present invention.
  • FIG. 4 is a schematic diagram showing the execution effect of the code taking method shown in FIG. 3;
  • FIG. 5 is a schematic diagram of a usage scenario involved in a code processing method according to another embodiment of the present invention.
  • FIG. 6 is a flowchart of a method for processing a code according to another embodiment of the present invention.
  • Figure 7a is a schematic diagram of part of the code involved in the embodiment shown in Figure 6;
  • FIG. 7b is a schematic diagram of a post-synchronization processing result according to the embodiment shown in FIG. 6; FIG.
  • FIG. 7c is a schematic diagram of another post-processing result according to the embodiment shown in FIG. 6; FIG.
  • FIG. 7d is a schematic diagram of a pre-fetch processing result according to the embodiment shown in FIG. 6; FIG.
  • FIG. 8 is a schematic diagram of a code processing method according to another embodiment of the present invention.
  • FIG. 9 is a schematic flow chart of a code processing method of the embodiment shown in FIG. 8;
  • FIG. 10 is a schematic diagram of comparison of processing results of different code processing methods according to another embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of hardware of a code processing device according to another embodiment of the present invention.
  • FIG. 12 is a schematic structural diagram of a code processing device according to another embodiment of the present invention.
  • GPU also known as display core, visual processor, display chip, etc.
  • display core is a microprocessor for image computing work on personal computers, workstations, game consoles and some mobile devices (such as tablets, smart phones, etc.).
  • the purpose of the GPU is to convert and display the display information required by the computer system, and provide a line scan signal to the display to control the correct display of the display.
  • GPU has been widely used in general computing because of its increasingly powerful computing power.
  • a large number of programs in different fields are accelerated by GPUs, such as traditional computationally intensive scientific computing, file systems, network systems, database systems, and cloud computing.
  • a GPU consists of several identical but independent components.
  • Advanced Micro Devices refers to it as a Computing Unit (CU), which NVIDIA Corporation (NVIDIA) calls.
  • CU Computing Unit
  • NVIDIA NVIDIA Corporation
  • SM Stream Multiprocessor
  • the threads of the GPU core are scheduled to execute on a certain CU or SM. All CUs or SMs share GPU memory and L2 Cache, but other storage resources such as registers, shared memory (Local Data Share, LDS or shared memory) and computing components are independent of each other.
  • a GPU program can be divided into two parts: host code and kernel code.
  • the host-side code is executed sequentially on a Central Processing Unit (CPU), including initialization of the kernel code context, CPU call functions for CPU and GPU data exchange, and boot functions for the GPU core.
  • the kernel code of a GPU describes the behavior of a GPU thread.
  • a warp is a collection of threads.
  • the GPU acts as a coprocessor and creates a large number of threads when it receives a kernel start call. These threads are organized hierarchically.
  • 64 (AMD GPU) or 32 (Nvidia GPU) threads form a warp.
  • the warp is called wavefront, and in CUDA the warp is called warp.
  • FIG. 1 shows the organization of the OpenCL kernel on an AMD GPU.
  • Branch divergence is a common problem in GPU computing that introduces performance loss.
  • the identification number (identification, ID) of each thread and the data read are not the same, so different judgment conditions will be obtained when encountering branches.
  • ID identification
  • the warp will serially execute all the branches that the member thread needs to execute. This is called a branch divergence problem.
  • Each branch is executed with all threads, but the results of unrelated thread runs are discarded, which reduces the degree of parallelism and efficiency of execution. For example, a single-layer branch can reduce efficiency to 50%, while multi-level nested branches can cause exponential growth slowdown.
  • Thread-Data Remapping is a major software technology that resolves branch disparities. By adjusting the mapping relationship between threads and input data, the technology makes the judgment conditions obtained by the threads in the same warp consistent, thereby eliminating the root cause of branch divergence.
  • TDR is a technique for directly eliminating disagreements. It attempts to remap threads in the same warp to data of the same type, allowing them to calculate consistent judgment conditions. There are two specific means: one is to directly change the arrangement of the data, for example, to sort the A array in Figure 2; the other is to adjust the array subscript used when the thread reads the data, and also can change each thread. The data that is assigned.
  • TDR does not eliminate all branch differences because the number of pieces of data of each kind may not be divisible by the width of the warp.
  • Figure 2 assuming that there are only 3 copies of the data identified by the black background, even if they are remapped to the same warp, there is still one thread in the warp that reads the data identified by the white background, causing disagreement. .
  • Such a warp may appear on the boundary of different kinds of data after TDR, which is the "boundary problem" of TDR. The more branches there are, the more serious such "boundary problems" are, resulting in wasted computing.
  • TDR is roughly classified into on-GPU and on-CPU.
  • the traditional on-CPU TDR generally processes the data input to the GPU by inserting code in the host code, sorting the data according to the branch conditions generated by the data, so that the threads in the same thread bundle obtain the same type of data. To take the same branch.
  • Code sharing is another major software technology that resolves branch differences. This technique extracts the same instructions from branches (for example, two branches under the same if statement) out of the branch, reducing the number of repeated operations due to divergence, thereby mitigating the effects of branch divergence.
  • the code is similar to not only extracting exactly the same statement, it is a more mature compiler technology, which can extract similar parts of the instruction level.
  • the existing code takes only the similar parts on the two branches under the same if statement, and does not deeply optimize the nested and multi-branch branches.
  • the existing code takes the same technique first to handle the inner branch, then the outer one.
  • the code segment a shared by all four branches is extracted outside the entire branch.
  • the prior art does not Processed.
  • the code segment u is left in both the branch 1 and the branch 3.
  • the threads of the two branches still need to be executed separately when encountering a divergence.
  • the existing code taking the same method (that is, the code taking method shown in FIG. 3) is also referred to as a method for distinguishing the cross-layers involved in the code processing method provided by the embodiment of the present invention.
  • the same layer takes the same.
  • the if statement is a select statement, and the if statement is used to implement the selection structure of two branches.
  • the general form of the if statement is as follows:
  • Conditional execution statement 1 and conditional execution statement 2 can be a simple statement, or a compound statement, or another if statement (ie, another one or more embedded if statements in an if statement) .
  • the conditional judgment can also be called an expression, and the conditional execution statement can also be called a statement.
  • the if statement also includes one or more if statements called nesting of if statements.
  • nesting of if statements belongs to the nesting of the if statement, and its general form is as follows:
  • FIG. 5 is a schematic diagram of a usage scenario involved in a code processing method according to an embodiment of the present invention.
  • the usage scenario is specifically a compilation execution scenario for the GPU program.
  • the GPU program (including the host-side code and the kernel code) first implements the code processing method provided by the embodiment of the present invention by using a Branch Acceleration with TDR and Code Hoisting (BATCH). Then, the processed GPU program is compiled and sent to the GPU device for operation.
  • BATCH Branch Acceleration with TDR and Code Hoisting
  • the module for executing the code processing method provided by the embodiment of the present invention is BATCH, and the module can be applied as a middleware between the application program and the GPU programming architecture, or directly encapsulated into the GPU compiler.
  • the BATCH is a module on the CPU. After the CPU reads the instruction on the BATCH, the code processing method provided by the embodiment of the present invention can be executed. Alternatively, some of the BATCH instructions can also be executed by the GPU.
  • FIG. 5 is only a schematic illustration of a usage scenario involved in the code processing method of the embodiment of the present invention, and does not specifically limit the code processing method of the embodiment of the present invention.
  • the method of the embodiment of the present invention can also be applied to other usage scenarios.
  • FIG. 6 is a flowchart of a method for processing a code according to an embodiment of the present invention. The method is applicable to the scenario shown in FIG. 5, referring to the foregoing content and FIG. 6, the code processing method of the embodiment of the present invention includes :
  • Step 601 Acquire a code to be executed.
  • the code to be executed includes a first branch code and a second branch code
  • the first branch code includes a third branch code and a fourth branch code
  • the second branch code and the third branch code include the same first execution logic
  • a fourth The branch code does not include the first execution logic
  • the first branch code includes the third branch code and the fourth branch code, so that the second branch code and the third branch code are mutually exclusive branch codes.
  • the code processing device acquires the to-be-executed code, and the to-be-executed code may include the first to fourth branch codes described above. It may be understood that the to-be-executed code may further include more branch codes, for example, five, six, and the like branch codes. .
  • the execution of the branch code will generate a branch, and the branch code includes a conditional judgment formula and a conditional execution statement, and when the conditional judgment formula is satisfied, the conditional execution statement is executed.
  • the execution logic is the code in the conditional execution statement.
  • the same execution logic in different branch codes can be the same statement or a statement with similar instruction level. In the embodiment, the execution logic is described by taking the first execution logic and the second execution logic as an example.
  • the first execution logic is redundant code in the code to be executed, so that the first execution logic can perform subsequent processing of the embodiment of the present invention.
  • the code processing device may analyze the code to be executed to determine a branch code that includes the same execution logic and is mutually exclusive. Specifically, the code processing device may use a code parser or a gene sequencing algorithm to analyze the code to be executed, analyze which branch codes of the code to be executed are mutually exclusive, and perform different branch codes for two. Pairing pairs and analyzing whether different branch codes in each pair of branch codes include the same execution logic. Thus, the branch code including the same execution logic and mutually exclusive is determined. In the embodiment of the present invention, the second branch code and the third branch code are determined to perform subsequent operations.
  • the code to be executed is kernel code of the graphics processor GPU program, the first branch code, the second branch code, the third branch code, and the fourth branch code, and the fifth branch code described below is an if statement.
  • Branch code is an if statement.
  • the different branch code in the code to be executed can be under the same if statement or under different if statements.
  • Figure 7a shows a portion of the code (code segment a) in the code to be executed, which produces different branches for the execution of different branch codes in the code, in Figure 7a, in two
  • the outer branch code includes two branch codes respectively, that is, there are four inner layer branch codes, and the execution of the four branch codes generates four branches, that is, branches 1 to 4.
  • the codes included in the branches 1 to 4 are mutually different branch codes. Wherein, branch 1 and branch 2 are located under the same if statement, branch 3 and branch 4 are located under the same if statement, branch 1 and branch 3 are located under different if statements, and branch 1 and branch 4 are located.
  • branch 2 and branch 3 are located under different if statements, and branch 2 and branch 4 are located under different if statements.
  • the conditional execution statement of the branch 1 includes a code segment u
  • the conditional execution statement of the branch 3 also includes a code segment u, which is the same number included in the conditional execution statement of the branch 1 and the branch 3.
  • the if statement in which the branch is located includes the else clause. It can be understood that the if statement in which the branch of the code to be executed in the embodiment of the present invention is located may not include the else clause.
  • branch code can also be a branch of a switch statement, a do statement, a for statement, or a while statement. Code, these statements can be statements in C language.
  • Step 602 Acquire a target condition judgment formula.
  • the code processing device may perform cross-layer code matching on the two branch codes, that is, extract the same first execution logic in the two branch codes. Come out to generate the fifth branch code. In order to generate the fifth branch code, it is also necessary to obtain a target conditional expression.
  • the target conditional expression can be used to control execution of the first execution logic in the fifth branch code.
  • the different acquisition manners of the target condition judgment formula may cause the fifth branch code to be generated in different manners, so the specific implementation manners of step 602 and step 604 will be hereinafter. A detailed description will be given together.
  • Step 603 Extract the first execution logic in the second branch code and the third branch code.
  • the first execution logic is redundant code in the code to be executed, and if the second branch code and the third branch code need to be serially executed, for example
  • the code to be executed is the kernel code of the GPU program
  • the second branch code and the third branch code need to be serially executed, and the first execution logic needs to be executed twice.
  • the code processing device may extract the first execution logic in the second branch code and the third branch code such that the second branch code and the third The branch code does not include the first execution logic.
  • extraction can be understood as “intercept”, that is, the execution logic of the branch code is extracted, and the execution logic is intercepted from the branch code; or "extract” is understood as “copy” plus “delete”, that is, extraction The execution logic of the branch code, after copying the execution logic from the branch code, then deleting the execution logic from the branch code.
  • step 603 the first execution logic in the second branch code and the third branch code is extracted, so that the second branch code and the third branch code after extracting the first execution logic do not include the first execution logic, and the second The amount of code of the branch code and the third branch code is reduced.
  • the second branch code and the third branch code are serially executed, the execution time of executing the two branch codes is reduced because the first execution logic is extracted.
  • step 603 can be performed after step 602 or before step 602.
  • Step 604 Generate a fifth branch code using the target conditional expression and the first execution logic.
  • the target condition judgment formula is used to control execution of the first execution logic in the fifth branch code.
  • the code processing device may generate the fifth branch code using the target condition judgment formula and the first execution logic.
  • the execution of the first execution logic is thereby effected by executing the fifth branch code.
  • the code to be executed retains the first execution logic by including the fifth branch code.
  • the fifth branch code may be set outside the first branch code and the second branch code.
  • the fifth branch code is the newly generated branch code.
  • the fifth branch code is in the form of a branch, and the thread that needs to execute the first execution logic can be booted to execute the fifth branch code.
  • the target condition judgment formula is used to control execution of the first execution logic in the fifth branch code, and when the target condition judgment formula is satisfied, execute the first execution logic in the fifth branch code, when the target condition judgment formula is not satisfied, Execute the first execution logic in the fifth branch code.
  • the fifth branch code can be an if statement.
  • the second branch code and the third branch code of the code to be executed do not include the first execution logic
  • the first execution logic is located in the newly generated fifth branch code
  • the number of the first execution logic is two The number is reduced to one, which reduces the code redundancy of the code to be executed.
  • the same first execution logic included in the second branch code and the third branch code is extracted, which is a similar operation, because the second branch code and the third branch code are mutually exclusive, so that The merging operation is the same as the cross-layer.
  • the cross-layer merging represents the scheme of steps 602 to 604.
  • the specific implementation of the step 604 is related to the target conditional judgment.
  • the following describes the steps 602 and 604 in some embodiments of the present invention. For details, refer to the following two examples.
  • Example 1 the post is taken the same.
  • step 602 specifically includes: respectively setting an identifier at a location where the first execution logic in the second branch code and the third branch code is located, wherein the identifier is used in the branch code where the identifier is located
  • the conditional judgment that the first execution logic execution is controlled is set to a specific value; then, the target condition judgment formula is generated using the identifier of the second branch code and the identifier of the third branch code.
  • the step 604 specifically includes: after the first branch code and the second branch code, generate a fifth branch code by using the target condition judgment formula and the first execution logic, and the target condition judgment formula is used for the identifier of the target condition judgment formula When a particular value is met, control executes the first execution logic.
  • the identifier is set on a branch code, and the identifier records the branch code because the identifier is located at the position of the first execution logic, thereby controlling the condition of the first execution logic on the branch code of the set identifier
  • a specific value is set for the identifier.
  • the identifier is read outside the branch code, and if the identifier satisfies the specific value, the conditional judgment formula indicating that the branch code is controlled to execute the first execution logic is also satisfied.
  • an identifier is set on the second branch code and the third branch code.
  • a status flag register flag flag or an integer variable.
  • the types of the shaped variables include, but are not limited to, a basic integer (int), a short int, a long int, and the like.
  • the target conditional expression is generated using the identifier of the second branch code and the identifier of the third branch code, so that the target conditional expression can include the identifier of the second branch code and the identifier of the third branch code.
  • the fifth branch code generated using the target conditional expression and the first execution logic is after the first branch code and the second branch code, such that the fifth branch code is after the first branch code and the second branch code carried out.
  • the conditional expression of the second branch code and the third branch code that controls the first execution logic is also executed. Since the identifiers are respectively set at the positions where the first execution logics in the second branch code and the third branch code are located, the identifiers cause the identifiers to record the results of these conditional judgment formulas. After executing the second branch code and the third branch code, the identifier of the second branch code sets a specific value, and the identifier of the third branch code also sets a specific value.
  • the fifth branch code is executed again, because the target condition judgment formula includes the identifier of the second branch code and the identifier of the third branch code, and when the target condition judgment formula is executed, the target condition judgment formula is used for determining the target condition
  • the control executes the first execution logic, and if the identifier has a feature value due to the execution of the second branch code and the third branch code, the identifier of the target conditional expression satisfies the The feature value, that is, the target condition judgment formula is a true event, thereby executing the first execution logic of the fifth branch code.
  • the identifier does not set a specific value, so that the identifier of the target conditional expression does not satisfy the feature value, the target condition judgment formula is a false event, and the fifth branch code is not executed.
  • the first execution logic If the second branch code and the third branch code are not executed, the identifier does not set a specific value, so that the identifier of the target conditional expression does not satisfy the feature value, the target condition judgment formula is a false event, and the fifth branch code is not executed.
  • the result of the determination of the target conditional expression can use the result of the conditional expression of the first execution logic of the second branch code and the third branch code. It is not necessary to calculate these conditional expressions again in the fifth branch code.
  • the identifier is a flag identification bit.
  • the code segment a of Fig. 7b is obtained by performing the post-synchronization operation on the code segment a in Fig. 7a.
  • the code segment a comprises branches 1 to 4, wherein the branch 1 and the branch 3 comprise the same execution logic (code segment u), the code segment u is extracted for the branch 1 and the branch 3, and The flag flag is set at the position of the code segment u, respectively.
  • flag 13 is a target conditional expression for controlling the execution of the code segment u. For example, when a thread executes branch 1 or branch 3, the value of flag13 is obtained as true. When the thread executes the new branch code, the value of flag13 is true, and the target condition is judged. If flag13 is satisfied, the code segment u is executed. If a thread executes branch 2 or branch 4, the value of flag13 does not change. If it is the initialized value false (false), the thread does not execute the code segment u in the new branch code.
  • the identifier is a variable of type int, which in this example is named path.
  • the code segment a in Fig. 7a the code segment u is extracted for the branch 1 and the branch 3, and the variable path is set at the position of the code segment u, the variable path is used to record the branch where it is located, and the path variable can also be It is called the branch number.
  • the path variable is used to record which path the current thread is going to, and when the target condition judgment formula of the new branch code is determined to be equal to the branch 1 or branch
  • the identifier of one example is the flag flag, the flag is a bool variable, and a flag is generated every time the same is taken.
  • the flag is initialized to 0, and the flag is set to 1 only in the target branch of the target instruction being fetched.
  • the identifier is a variable of type int. In implementations where the identifier is an integer variable, the use of registers can be reduced.
  • Example 2 the preposition takes the same.
  • step 602 specifically includes: combining the first conditional judgment formula and the second conditional judgment formula to obtain a target condition judging formula, wherein the first condition judging formula is used to control the first execution logic in the second branch code Executing, the second conditional judgment formula is for controlling execution of the first execution logic in the third branch code;
  • step 604 specifically includes: generating a fifth branch code using the target conditional expression and the first execution logic before the first branch code and the second branch code.
  • the first conditional judgment formula is for controlling execution of the first execution logic in the second branch code, indicating that in the second branch code before extracting the first execution logic, when the first conditional expression is satisfied, performing the first execution logic.
  • the second conditional judgment formula is for controlling execution of the first execution logic in the third branch code, indicating that in the third branch code before extracting the first execution logic, when the second conditional judgment formula is satisfied, performing the first execution logic.
  • the code processing device may copy the first conditional judgment formula and the second conditional judgment formula from the second branch code and the third branch code, and retain the first condition in the second branch code and the third branch code. Judgment formula and second condition judgment formula. Then, the obtained first conditional judgment formula and the second conditional judgment formula are combined to obtain a target condition judgment formula.
  • the code processing device combines the first conditional judgment formula and the second conditional judgment formula to obtain a target condition judgment formula for establishing the fifth branch code.
  • the fifth branch code generated using the target conditional expression and the first execution logic is before the first branch code and the second branch code, so that when the code to be executed is executed, the thread executes the fifth branch code and then executes the first branch code and The second branch code.
  • the first conditional judgment formula and the second conditional judgment formula may be combined in an "OR" manner to obtain a target condition judgment formula.
  • the target condition judgment formula may be first calculated, and if the first condition judgment formula and the second condition judgment formula satisfy one of them, the first execution logic of the fifth branch code is executed.
  • the code segment a of FIG. 7d is obtained after the preamble of the code segment a of FIG. 7a is obtained.
  • the code segment a of FIG. 7d includes branches 1 to 4, wherein the branch 1 and the branch 3 comprise the same execution logic code segment u, from which the condition of execution of the control code segment u is copied
  • the first conditional judgment formula "A[tid]&&tid%2" is obtained; from the branch 3, the conditional judgment expression of the execution of the control code segment u is copied, and the second conditional judgment formula "!A[tid]&&tid is obtained.
  • the code processing device also extracts the code segment u from the branch 1 and the branch 3. Then, the code processing device generates the first slave branch using the target conditional expression and the code segment u before extracting the first main branch of the target instruction:
  • the target condition judging formula is generated by using the identifier; in the pre-fetching scheme, the target condition judging formula is merged.
  • the first conditional judgment formula and the second conditional judgment formula are obtained, and the target condition judging formula is often longer than the post-determination target condition judging formula, so that the post-matching scheme is lower than the pre-set scheme to reduce the target condition. Judgment calculation.
  • the above description of the code processing method of the embodiment of the present invention focuses on the scheme of cross-layer matching.
  • the cross-layer solution can also perform other processing in the context of the warp to further reduce the execution time required for the code to be executed.
  • the specific implementation is as follows:
  • the method of the embodiment further includes: setting a thread that executes the second branch code and a thread that executes the third branch code in the same warp.
  • the thread that executes the second branch code and the thread that executes the third branch code are set in the same warp.
  • the thread that will execute the second branch code and the thread that executes the third branch code are set in the same warp.
  • the implementation can be implemented by TDR.
  • the second branch code and the third branch code include the same first execution logic, the second branch code and the third branch code may be adjacently arranged at the time of TDR, so that the second branch code is executed.
  • the thread and the thread executing the third branch code are set in the same warp.
  • the code to be executed is the kernel code of the GPU program
  • the conventional on-CPU TDR scheme since the second branch code and the third branch code include the same first execution logic, it can be obtained when the TDR is performed.
  • a branch arrangement scheme in which the second branch code and the third branch code are arranged adjacent to each other.
  • the code processing device inserts code into the host-side code of the GPU program to which the code belongs, processes the data input to the GPU, sorts the data according to the branch condition generated by the data, and implements TDR for the code to be executed.
  • the thread executing the second branch code and the thread executing the third branch code are set in the same warp.
  • TDR For a detailed implementation of the TDR, refer to the detailed description of the technical terminology section "5, TDR" above.
  • the cross-layer fetching scheme and the TDR scheme described above are combined to process the execution code, thereby reducing the execution time of the code to be executed and speeding up the execution of the branch code.
  • TDR is used to remap threads in the same warp to data of the same type, but the number of data of each kind may not be bound by the warp.
  • the width is divisible so that threads in the same warp are not necessarily remapped to the same type of data, especially in the case of different types of data on the warp.
  • two branch codes including the same first execution logic may be arranged as adjacent branches at TDR, which is a branch arrangement scheme.
  • TDR is required for the thread executing the code to be executed
  • the code processing device may perform TDR on the original branch code of the code to be executed according to the branch arrangement scheme, the original branch code including the branch from which the first execution logic is extracted
  • the code processing device executes the thread adjacent execution of the second branch code and the third branch code from which the first execution logic is extracted.
  • the thread executing the code to be executed may be assigned the same type of data in the same warp, thereby eliminating the influence of branch divergence; or, because of the "boundary problem", the same type of data cannot be allocated in the same warp, but
  • the thread executing the second branch code and the thread executing the third branch code are set in the same warp, and the amount of code that the two branch codes need to be serially executed is reduced due to the extraction of the first execution logic.
  • the use of TDR ensures that the newly generated fifth branch code does not introduce new disagreements.
  • the results of FIGS. 7b to 7d can be obtained. If the thread is tagged according to the branch taken by the thread, for example, the thread of the tag 1 executes the branch 1, the thread of the tag 2 executes the branch 2, the thread of the tag 3 executes the branch 3, and the thread of the tag 4 executes the branch 4.
  • the thread of label 1 and the thread of label 3 are executed together. Therefore, after TDR, it is equivalent to grouping the threads with the same label. Or put together the threads that execute the branch code that extracted the same execution logic. The threads that need to execute the execution logic in the newly generated branch code are grouped together, so the newly generated branch code does not introduce new differences.
  • the thread must execute the original branch code and execute the newly generated branch code. If the thread calculates the conditional judgment, it will execute the relevant code. For two threads executing different two branch codes in the same warp, after performing the code processing of the two branch codes on the two branch codes, the two threads are executed separately in the original branch code of the code to be executed, but In the new branch code (using the branch code generated by the same first execution logic extracted from the two branch codes), both threads satisfy the target conditional judgment of the new branch code at the same time, so they are taken out The execution logic, these threads will execute together.
  • the execution sequence of the step of setting the thread that executes the second branch code and the thread that executes the third branch code in the same warp is not limited, for example, may be performed before step 602 or 603, or may be performed. Executed after step 604. Alternatively, it can be executed before the cross-layer is taken, or after the cross-layer is taken.
  • the code to be executed is TDR
  • the thread executing the second branch code and the thread executing the third branch code are set in the same warp.
  • the code to be executed is TDR
  • the thread executing the second branch code and the thread executing the third branch code are set in the same warp, and then steps 602 to 604 are performed.
  • the TDR of the code to be executed is specifically TDR for the original branch code of the code to be executed, and the original branch code of the code to be executed does not include the newly generated branch code according to the cross-layer fetch.
  • the embodiment of the present invention further provides the following code processing method.
  • the code processing method of the embodiment of the present invention further includes: determining whether the overhead time is less than the first saving time, wherein the overhead time indicates that the target condition determining formula is generated.
  • Execution time the first saving time represents the time required to execute the first execution logic. If the overhead time is less than the first saving time, step 603 is performed.
  • step 604 the number of first execution logics in the code to be executed is reduced by one compared to the number of code to be executed in step 601, so that the code to be executed that is taken across the layer is more than the code to be executed of step 601.
  • the time during which the first execution logic is executed is reduced, and the first time saved represents the time required to execute the first execution logic.
  • the code to be executed has more fifth branch code than the code to be executed of step 601, to retain the first execution logic, to ensure the integrity of the information, and the target condition judgment formula of the fifth branch code makes The code to be executed across the same layer increases the execution time, and the overhead time represents the execution time generated by the target conditional judgment.
  • the code processing device determines whether the overhead time is less than the first saving time. If the overhead time is less than the first saving time, the execution time required to execute the code to be executed after the cross-layer is less than the span. The layer takes the execution time required for the same code to be executed, so that the code processing device performs steps 603 and 604. To reduce the redundant code of the code and reduce the execution time required by the code. If the overhead time is not less than the first saving time, step 603 and step 604 may not be performed.
  • the execution time generated by the target condition judging formula can be directly calculated, thereby obtaining the overhead time; and calculating the execution time required for executing the first execution logic. Time to get the first time saved. Then, the overhead time is compared with the first time saved.
  • parameters that reflect the execution time generated by the target conditional expression may be compared with parameters reflecting the time required to execute the first execution logic, or the overhead time and the first time saved may be compared.
  • such a conversion determination method can simplify the execution of the judgment, thereby improving the execution speed. Two examples are described below.
  • the step of determining whether the overhead time is less than the first saving time comprises: calculating a code length of the target condition judging formula and a code length of the first execution logic, and then determining whether the code length of the target condition judging formula is smaller than the code of the first execution logic length.
  • the code length of the target condition judging formula is smaller than the code length of the first execution logic, indicating that the overhead time is less than the first saving time.
  • the target condition judgment formula and the first execution logic are both codes.
  • the execution of the code takes time.
  • the length of the code has a great influence on the execution time required by the code.
  • the long execution time requires the execution time to be shorter than the short code. Long execution time.
  • the comparison of the execution time of the code generation can be converted into a comparison of the code lengths.
  • the code length of the target condition judgment formula and the code length of the first execution logic can be calculated, if the target condition judgment formula The code length is smaller than the code length of the first execution logic, indicating that the overhead time generated by the target condition judgment formula is smaller than the first save time; otherwise, the overhead time is not less than the first save time.
  • This method of judging is especially applicable to the above-mentioned pre-fetching scheme. Because the pre-fetching is the same, the target condition judging formula is obtained by combining the first conditional judging formula and the second conditional judging formula, so that the execution of the target condition judging formula is performed.
  • the time is mainly the execution time required for the code of the target conditional expression.
  • the step of determining whether the overhead time is less than the first saving time includes: calculating a number of registers used by the target condition judging formula, and then determining whether the number of registers used by the target condition judging formula is less than a preset register number threshold.
  • the number of registers used in the target condition judgment formula is smaller than the preset register number threshold, indicating that the overhead time is less than the first saving time.
  • the target condition judgment formula it is possible for the target condition judgment formula to use the register, so that the code to be executed across the layer increases the usage of the register.
  • the flag flag bit causes an increase in the number of registers used. . If the number of registers used by the code increases, the number of workgroups on each CU/SM is reduced, resulting in performance loss and increased execution time required by the code. If the increase in the number of registers results in an execution time increment greater than the time required to execute the first execution logic once, the execution time of the code to be executed after the cross-layer fetch is greater than the execution time required before the cross-layer fetch.
  • Whether the increment of the execution time caused by the increase in the number of registers is greater than the time required to execute the first execution logic once, can be achieved by setting the threshold of the number of registers, and the execution time required for the code to be executed after the cross-layer is the same.
  • the amount of the register used by the target condition judgment type so that if the target condition judgment formula is used
  • the number of registers is less than the preset number of registers, indicating that the execution time increment caused by the number of registers used by the target conditional expression is less than the time required to execute the first execution logic once, that is, the overhead time is less than the first saving time, and vice versa. Not less than the first time saved.
  • the code processing device can check how many registers are declared in the specific instruction, thereby detecting the use of the register. the amount.
  • the implementation method of the second embodiment is particularly suitable for the scheme in which the identifier of the "post-fetch" scheme is the flag flag.
  • the code processing method described above may be applied to two branch codes in the code to be executed, and may also be applied to a scenario in which the code to be executed includes multiple pairs of branch codes, as long as the pair of branch codes includes the same execution logic.
  • the code processing method provided by the above embodiments may be performed on the pair of branch codes.
  • the method of the embodiment shown in FIG. 6 may be separately performed on each of the pairs of branch codes, respectively.
  • the execution order of the cross-layers of the different pairs of branch codes may not be limited.
  • the execution order of the different layers of the branch code may be defined, for example, the one that saves the most time.
  • the method shown in Figure 6 is performed on the branch code. Specific examples are as follows:
  • Example 1 In the embodiment shown in Figure 6, the second branch code and the fourth branch code comprise the same second execution logic. Therefore, before the step 603, the method of the embodiment further includes: calculating the first saving time and the second saving time, wherein the first saving time indicates a time required to execute the first execution logic, and the second saving time indicates performing the second execution. The time required for the logic.
  • step 603 specifically includes: when the first saving time is greater than the second saving time, extracting the first execution logic in the second branch code and the third branch code.
  • the first execution logic can be extracted for the second branch code and the third branch code, and the new branch code is generated using the first execution logic, reducing the first The code redundancy generated by the execution logic is executed, and the execution time of the code to be executed is reduced by the first time.
  • the second branch code and the fourth branch code comprise the same second execution logic, and the code processing device may also extract the second execution logic for the second branch code and the fourth branch code, and generate a new one using the second execution logic.
  • Another branch code reduces the code redundancy generated by the second execution logic, and the execution time of the code to be executed is reduced by the second time.
  • extracting the first execution logic will further reduce the execution time required for the code to be executed, so the code processing device extracts the first execution of the second branch code and the third branch code.
  • a parser or a gene sequencing algorithm can be used to analyze the same execution logic of each pair of branch codes, and the execution calculation can be estimated using a time calculation model.
  • the execution time required by the logic can be calculated according to specific hardware parameters, such as the frequency of the GPU running, the read and write time of the video memory, and the like.
  • the execution time is also the time saved by the cross-layer, that is, the time saved by a pair of branch codes.
  • example 1 shows a time-saving comparison of two pairs of branch codes
  • the method of the first example can also be applied to a scenario of multiple pairs of branch codes, or cyclically using the scheme of the first example to select one of the most time-saving options.
  • branch code to perform a cross-layer approach.
  • examples 2 and 3 are shown.
  • Example 2 When the code to be executed includes multiple pairs of branch codes, if the pair of branch codes are mutually exclusive branch codes, and the different branch codes in each pair of branch codes include the same execution logic, that is, each pair of branches
  • the different branch code in the code is similar to the second branch code and the third branch code of the embodiment shown in FIG. 6, and the time saving of the multiple pairs of branch codes may be calculated first, and the saving time of the pair of branch codes is the pair of branch codes.
  • the different branch code in the code includes the execution time required for the same execution logic. Then, the method of the embodiment shown in FIG. 6 is executed for the longest saved pair of branch codes, that is, the method performed on the second branch code and the third branch code is executed. In this way, the method shown in FIG. 6 can be executed by the two branch codes that can reduce the maximum execution time after performing the extraction in the pair of branch codes.
  • Example 3 After the method of the second example above, if the saving time of the remaining pair of branch codes is affected after the execution logic is extracted, the other saved time of the branch code is calculated, and then the remaining other pairs are The branch code determines the longest saved pair of branch codes to perform the method shown in Figure 6, and so on, to form a loop process.
  • the pair of branch codes that save the longest time may first determine whether the overhead time of the pair of branch codes is less than the time saved by the pair of branch codes, if the pair of branch codes If the overhead time is less than the time saved by the pair of branch codes, the method for performing the cross-layer matching of the embodiment shown in FIG. 6 is performed on the pair of branch codes. Otherwise, the cross-layer matching of the pair of branch codes is not performed. Further, the method may not be correct. Others perform cross-layer fetching on branch code that saves less time.
  • the cross-layer fetching is performed on the pair of branch codes that save the longest time, and if the other pair of branch codes includes one of the branch codes of the pair of cross-layers.
  • the branch code can recalculate the saving time of a pair of branch codes including the branch code after extracting the execution logic, and then determine the pair of branches with the longest saving time among the multiple pairs of branch codes that do not cross the same layer. Code to perform cross-layer fetching.
  • the above examples 1 to 3 may further adjust the thread. For example, in the first example, when the first saving time is greater than the second saving time, the thread of the second branch code is executed. Execution The threads of the three-branch code are set in the same warp.
  • determining the pair of branch codes that save the longest time setting a pair of threads that execute the longest saved pair of branch codes in the same warp; or, in the third example, During the loop process, each time a pair of branch codes that save the longest time is determined, a branch arrangement scheme is obtained, in which a pair of threads executing the pair of branch codes are set in the same warp, and A pair of threads corresponding to a pair of branch codes that save a long time are first located in the same warp, and then a pair of threads corresponding to a pair of branch codes that save time is located in the same warp.
  • threads that execute a pair of branch codes that are strung across the same layer are set in the same warp according to the order of saving time from long to short.
  • the first and third cross-layer fetching schemes may be used. Because, in the post-fetching scheme, the target condition judgments of the generated different branch codes are basically the same style, so that the overhead time of the different pairs of branch codes can be considered to be equal, and at this time, the pair that saves the longest time is After the branch code is executed, the same way of cross-layering is taken, which can reduce the execution time of the code for the code. According to the order of saving time from long to short, the cross-layer matching of different pair of branch codes is performed, so that the pair of branch codes that require the most execution time to reduce the code are preferentially executed across layers. If the code processing method of this embodiment further includes a TDR processing scheme, a pair of threads corresponding to a pair of branch codes that require the most execution time to reduce the code are preferentially arranged adjacent to each other at the time of TDR.
  • step 602 is performed. For example, when the conditional execution statements of the two branch codes located under the same if statement respectively include the same execution logic, the same execution logic is extracted out of the two branch codes located under the same if statement.
  • the code to be executed is obtained, where the code to be executed includes a first branch code and a second branch code, and the first branch code includes a third branch code and a fourth branch.
  • the code, the second branch code and the third branch code comprise the same first execution logic, and the fourth branch code does not include the first execution logic. Since the second branch code and the third branch code include the same first execution logic, the first execution logic includes two copies in the code to be executed, which is a redundant code.
  • the target condition judgment formula is acquired, and the first execution logic in the second branch code and the third branch code is extracted, and then the fifth branch code is generated using the target condition judgment formula and the first execution logic, the target condition judgment formula Used to control the execution of the first execution logic in the fifth branch code.
  • the first execution logic is not included in the second branch code and the third branch code due to the extraction of the first execution logic, and the first execution logic is included in the generated fifth branch code so as to be retained in the to-be-executed code
  • the first execution logic, and the number of first execution logics in the code to be executed is reduced from two to one, reducing the redundant code in the code to be executed.
  • the following describes the code processing method of each embodiment in which the code to be executed is the kernel code of the GPU program and the branch code of the if statement is taken as an example.
  • the execution logic is taken as an example.
  • the scenario for which the code processing method is directed is the compilation execution of the GPU code.
  • the user-written GPU code (including the host-side code and the kernel code) is processed by the method of the embodiment of the present invention, and then compiled and executed as usual.
  • the method can be applied to a unit such as a BATCH, and the BATCH includes a same layer merging module, a cross layer merging module, and a TDR module.
  • the main points of the code processing method of the embodiment of the present invention are:
  • the same layer of the same module will be the first kernel code (the kernel code shown in Figure 8) for all the code of the two branches of the same if statement, that is, extract the same two-branch code included The instruction, thereby generating the second kernel code.
  • the cross-layer fetch module will analyze the chances of the existence of the two branches in the second kernel code under the different if statements. If the two branch codes in different if statements include the same instruction, then The same instruction is extracted from the two branches, and cross-layer fetching is performed. After extracting the same instruction, the same instruction is used to generate a new branch, thereby obtaining the third kernel code.
  • the cross-layer fetch module records that the pair of branch codes need to be adjacent in the TDR branch arrangement. Therefore, in addition to generating the third kernel code, the cross-layer fetch module also obtains the TDR branch arrangement scheme.
  • the TDR module completes the TDR according to the branch arrangement scheme obtained by taking the same module across the layers.
  • the module makes the necessary modifications to the data, host code, or third kernel code.
  • the code processing method of the embodiment of the present invention includes:
  • Step 901 Acquire kernel code of the GPU program.
  • the code processing device obtains the kernel code of the GPU program written by the user.
  • the kernel code includes at least two branch codes, and execution of the branch code will generate a branch, in other words the kernel code includes at least two branches.
  • the kernel code includes a branch code that is an branch code of an if statement. These branch codes can be located under the same if statement, or under different if statements. The if statements to which these branch codes belong can be nested if statements or not nested if statements.
  • Figure 7a shows a portion of the code (code segment a) in the kernel code, the code segment a comprising a plurality of branch codes, with four branches being indicated.
  • the kernel code of the GPU is processed by the traditional code first.
  • the same instruction in the branch code under the same if statement is successively extracted to the branch code under the same if statement.
  • the method shown in FIG. 3 is used.
  • the two branch codes located under the same if statement may be: one branch code is a branch code executed when the conditional judgment formula of the if statement is satisfied, and the other branch code is when the condition judgment formula of the if statement is not satisfied.
  • the branch code that is executed (for example, the branch of the else clause).
  • the if statement can also be called an if-else statement.
  • the conditional execution statements of the two branch codes respectively include the same instruction, the same instruction is extracted out of the two branch codes.
  • an if statement is also nested with one or more if statements
  • the conditional execution statement of the two branch codes under the same if statement also includes the same instruction in the nested if statement
  • the same instruction is extracted out of the two branch codes, and the same instruction in the two branch codes under the same if statement can be extracted from the inside out to the two branch codes one by one.
  • the instruction extraction method can be called the same layer. Wherein, the same layer can be used to generate a new branch, because the two branches that are taken together are located under the same if statement, and the two are executed.
  • the threads of the branch branch execute the code together before and after the if statement, so it is only necessary to put the same instruction outside the if statement.
  • Step 902 Pair the branch codes located under different if statements to obtain at least two branch groups.
  • Each branch group includes two branch codes located under different if statements.
  • the branch code located under different if statements indicates that the if statement directly belongs to different branch codes, and the if statement directly belonging to the branch code is the innermost layer. If statement.
  • Step 903 Determine at least two target branch groups from at least two branch groups, and calculate a saving time of each target branch group.
  • the target branch group belongs to at least two branch groups in step 902, but the two branch codes of the target branch group include the same target instruction, and the execution time required by the target instruction of the target branch group is the saving time of the target branch group.
  • the two branch codes included in the target branch group can be regarded as the second branch code and the third branch code of the embodiment shown in FIG. 6, except that in the embodiment, the first execution logic is the target instruction.
  • step 902 and step 903 are implemented by pairing branch codes located under different if statements into two groups to obtain multiple sets of branch groups. Then, determine the target branch group, and use the traditional parser to analyze the same number of instructions, and estimate the execution time required for these instructions. The execution time is also saved after performing cross-layer fetching. Save time.
  • a loop process may be performed to perform cross-layer fetching for each target branch group.
  • the specific process is as follows:
  • Step 904 Determine, from the at least two target branch groups, the same branch group that saves the longest time.
  • the branch group with the longest savings determines the branch group with the longest savings. Moreover, the two branch code neighbors of the same branch group can be marked in the TDR branch arrangement scheme.
  • the same branch group is one of the at least two target branch groups, and in the at least two target branch groups, the same branch group has the longest saving time.
  • step 905 it is determined whether the cost of the same branch group is less than the time saved by the same branch group. If the cost of the same branch group is less than the time saved by the same branch group, step 906 is performed.
  • the code processing device can determine whether the overhead time of the same branch group is less than the saving time of the same branch group. If the cost of the same branch group is less than the time saved by the same branch group, go to step 906 to perform the cross-layer fetch. If the cost of the same branch group is not less than the elapsed time of the same branch group, the same branch group is taken. The cross-layer fetching will not be performed, and the method of the embodiment of the present invention jumps out of the loop process, and the remaining target branch groups do not perform cross-layer fetching. Because the remaining target branch group saves time is diminishing, it is difficult to exceed the overhead time.
  • the overhead time of the same branch group indicates the execution time of the target condition judgment formula of the same branch group
  • the save time of the same branch group indicates the execution time required for the target instruction of the same branch group
  • a new branch code is generated according to the target instruction, and the new branch code is an if statement, and the new branch code includes a target condition judgment formula to For control Execution of the target instruction.
  • the target conditional judgment of the new branch code causes the GPU to generate an execution time when executing the kernel code, which is the overhead time of the same branch group.
  • the overhead time generated by the target conditional judgment is mainly derived from two aspects:
  • the execution target condition judgment formula needs execution time, and the code length reflects the length of execution time required by the code, so that it can be judged whether the code length of the target condition judgment formula is smaller than the code length of the target instruction of the same branch group. If the code length of the target condition judgment formula is smaller than the code length of the target instruction of the same branch group, it means that the overhead time of the same branch group is smaller than the save time of the same branch group, and the cross-layer acquisition may be performed, otherwise Perform cross-layer fetches.
  • the target conditional judgment may use registers to increase the registers used by the kernel code. If the number of registers used by the target conditional expression exceeds the register number threshold, the number of workgroups on the CU/SM on each GPU is reduced. Causes performance loss. Therefore, it can be determined whether the number of registers used in the target condition judgment formula is less than the preset register number threshold. If the number of registers used in the target condition judgment formula is less than the preset register number threshold, it means that the overhead time of the same branch group is smaller than that of the same branch group. save time.
  • the code processing device may calculate the overhead time by using the target conditional judgment formula after obtaining the target condition judgment formula.
  • the target condition judgment formula refer to the detailed description of the pre-equivalent and post-equal equivalent parts of the foregoing embodiment.
  • other judgment methods may be used instead of directly using the overhead time and saving time for comparison.
  • the judgment result obtained by the other judgment manners may be used to indicate whether the overhead time of the same branch group is smaller than the same branch. The group saves time.
  • the specific implementation of the step 905 is different.
  • the cost of taking the same branch group is calculated, and then the cost of the same branch group and the saving time of the same branch group are used to compare the sizes, or
  • the code length of the target condition judgment formula of the branch group is compared with the code length of the target instruction, or the number of registers used by the target condition judgment formula is compared with the preset register number threshold.
  • the specific implementation refer to the above description.
  • Step 906 Perform cross-layer matching on the same branch group.
  • the cross-layer is taken as the implementation of the above steps 602 to 604. Specifically, the target condition judgment formula is acquired, and the target instruction in the two branch codes of the same branch group is extracted, and then the target condition judgment formula and the target instruction are used. Generate new branch code, which is the branch code of the if statement.
  • Cross-layer fetching includes two implementations: pre-fetch and post-fetch.
  • pre-fetch includes two implementations: pre-fetch and post-fetch.
  • post-fetch For a detailed implementation process, refer to the detailed descriptions of "Example 1, Post-Collection” and “Example 2, Pre-fetching" above.
  • the preconditioner must copy and merge the conditional judgments of the two branch codes of the same branch group, and then take the same target instruction of the two branch codes in the same branch group.
  • An identifier such as flag, is set at the location, and then in the newly generated branch code, the identifier is used to control whether or not to execute the same target instruction.
  • Step 907 Calculate the saving time of the target branch group including one branch code of the same branch group.
  • the target instructions of the two branch codes of the same branch group are extracted, so that the two branch codes of the same branch group and the target branch group composed of other branch codes are the same.
  • the instruction may be affected. Therefore, if a target branch group includes a branch code of the same branch group, the saving time of the target branch group needs to be recalculated to make the time saving more accurate. Then, go back to the beginning of the loop and re-execute In step 904, the at least two target branch groups do not include the aforementioned fetch group, and the elapsed time of the target branch group including the branch code of the extracted target instruction has been recalculated.
  • the TDR branch arrangement scheme is also gradually obtained.
  • step 904 the same branch group with the longest saving time is determined, and the branching scheme can be marked in the branch arrangement scheme.
  • the two branch codes of the same branch group are arranged adjacent to each other. If the above-mentioned loop process is executed multiple times, each time determining the longest save branch group, the two branch codes of the same branch group of the previous loop are adjacent to each other when the TDR is marked in the branch arrangement scheme. After the arrangement, the two branch codes of the current branch that are the same branch group are arranged adjacent to each other. Thus, as the end of the cycle process, a branch arrangement scheme is obtained.
  • the original branch code of the kernel code is TDR, so that the threads that execute the two branch codes of the same branch group are set in the same warp, or the execution time is long.
  • the two threads of the two branch codes of the branch group are set in the same warp, and the two threads of the two branch codes of the same branch group that perform the short saving time are set in the same warp.
  • TDR first consider that the threads executing the same branch code are set in the same warp. If the same warp includes different threads executing different branch codes, the different threads are branch codes of the same branch group. the rout.
  • the kernel code includes the original branch code and the new branch code
  • the new branch code is a new branch code generated by using the target condition judgment formula and the target instruction
  • the original branch code is the kernel code at the cross-layer.
  • Cross-layer fetching new branch code introduced outside the original branch code does not require additional TDR. After TDR is performed on the original branch code, the new branch code no longer has additional differences.
  • the present invention it is also possible to determine different target branch groups from the foregoing at least two target branch groups according to the saving time from long to short, and then perform TDR on the kernel code.
  • the two branch codes of different target branch groups are adjacently arranged in order of saving time from long to short. Then, cross-layer fetching is performed on the two branch codes of the different target branch groups.
  • this embodiment may perform the target branch group that is taken across the same layer.
  • the two threads are set in the same warp, and the target instruction is extracted, so that the code segments executed by the two thread bundles are reduced in series.
  • the method of the embodiment of the invention comprehensively utilizes the TDR and the code to achieve the same, realizes the cross-layer matching of the branch code under different if statements, and uses the TDR to make the newly generated branch code not introduce new differences.
  • the cross-layer fetching guides the branch arrangement in the TDR, so that two threads executing the two-branch code that are obtained by the cross-layer may be located in the same warp, and the target instructions with the same two-branch code are extracted.
  • the branch code reduces the time for the two threads to execute the two-branch code, alleviating the "boundary problem" of TDR.
  • the "boundary problem" may occur after the TDR.
  • Two long-running two-branch code is more likely to appear on the boundary than the two-branch code that saves time, thus cross-layering the two branch codes to minimize the target instructions that the thread bundle needs to serially read. length.
  • the code segment a includes four branches, which may be referred to as a main branch, wherein the branch 1 and the branch 3 include the same code segment u , assuming that the code segment u requires an execution time of 1 ms (milliseconds).
  • 1, 2, 3, and 4 shown in Fig. 10 are the numbers of the branches.
  • the code segment u of the branch 1 and the branch 3 is extracted, and the branches arranged adjacent to the branch 1 and the branch 3 are obtained.
  • the TDR is performed according to the branch arrangement scheme.
  • each branch is TDR in the order of the branch 1, the branch 3, the branch 2, and the branch 4.
  • the obtained result is shown in FIG. 10, at this time,
  • the thread that needs to execute the code segment u is concentrated in the first two warps, and the execution of the code segment u takes only 2 ms.
  • the TDR can be regarded as a kind of sorting, and the threads of the same branch are naturally attributed to one warp.
  • the method of the embodiment shown in FIG. 9 further determines the neighboring relationship between different branches by the branch sorting information.
  • FIG. 11 is a schematic structural diagram of hardware of a code processing device according to an embodiment of the present invention.
  • the code processing device can be used to perform the code processing methods of the various embodiments shown in Figures 6, 8, and 9 above.
  • the code processing device is also referred to as a computing device.
  • the code processing device includes a central processing unit (CPU), a GPU, a memory, and a bus.
  • the CPU, GPU, and memory are communicatively coupled via a bus.
  • the code processing device may be a smart phone, a personal digital assistant (PDA) computer, a tablet computer, a laptop computer, a computer, a server, and the like.
  • PDA personal digital assistant
  • the embodiment of the present invention is described by taking a code processing device as a computer as an example.
  • the CPU is an integrated circuit, which is the computing core (Core) and the Control Unit of a computer. Its function is mainly to explain computer instructions and to process data in computer software.
  • the CPU is the control center of the computer, which connects various parts of the entire computer by various interfaces and lines, executes various kinds of computers by running or executing software programs and/or modules stored in the memory, and calling data stored in the memory. Function and process data to monitor the computer as a whole.
  • the CPU may implement or perform various illustrative logical blocks, modules and circuits described in connection with the present disclosure.
  • the CPU can also be a combination of computing functions, for example including one or more micro-locations Combination of processor, combination of DSP and microprocessor, etc.
  • the memory can be used to store software programs and modules, and the CPU executes various functional applications and data processing of the computer by running software programs and modules stored in the memory.
  • the memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function, and the like; the storage data area may store data created according to usage of the computer, and the like.
  • the memory may include volatile memory, such as random access memory (RAM), non-volatile volatile random access memory (NVRAM), phase change random access memory (Phase Change RAM).
  • PRAM magnetoresistive random access memory
  • MRAM magnetoresistive random access memory
  • non-volatile memory such as at least one disk storage device, read-only memory (ROM), electronic erasable In addition to programmable read-only memory (EEPROM), flash memory devices, such as NOR flash memory or NAND flash memory, semiconductor devices, such as solid state drives (Solid State) Disk, SSD) and so on.
  • ROM read-only memory
  • EEPROM electronic erasable In addition to programmable read-only memory
  • flash memory devices such as NOR flash memory or NAND flash memory
  • semiconductor devices such as solid state drives (Solid State) Disk, SSD) and so on.
  • Bus is a common communication trunk that transfers information between various functional components of a computer. It is a transmission harness composed of wires. According to the type of information transmitted by the computer, the bus of the computer can be divided into a data bus, an address bus, and a control bus. Used to transmit data, data addresses, and control signals, respectively.
  • the various components of the computer are connected by a bus, and the external device is connected to the bus through a corresponding interface circuit, thereby forming a computer hardware system.
  • a bus a common path for transferring information between various components is called a bus, and a microcomputer connects various functional components through a bus structure.
  • the computer may further include an input/output (I/O) interface, an output unit such as a display, an input unit such as a keyboard, and other units, and details are not described herein.
  • I/O input/output
  • an output unit such as a display
  • an input unit such as a keyboard
  • the CPU may be configured to: acquire code to be executed, where the code to be executed includes a first branch code and a second branch code, where the first branch code includes a third branch code and a fourth branch code, The second branch code and the third branch code include the same first execution logic, the fourth branch code does not include the first execution logic; the target condition judgment formula is acquired; and the first execution logic in the second branch code and the third branch code is extracted; The fifth branch code is generated using the target conditional expression and the first execution logic for controlling execution of the first execution logic in the fifth branch code.
  • the CPU may be further configured to: before extracting the first execution logic in the second branch code and the third branch code, determine whether the overhead time is less than the first saving time, and the overhead time indicates an execution time generated by the target condition determining formula.
  • the first saving time represents a time required to execute the first execution logic; if the overhead time is less than the first saving time, performing the step of extracting the first execution logic in the second branch code and the third branch code.
  • the CPU may further be configured to: calculate a code length of the target condition judgment formula and a code length of the first execution logic; determine whether the code length of the target condition judgment formula is smaller than a code length of the first execution logic, where the target condition The code length of the judgment formula is smaller than the code length of the first execution logic, indicating that the overhead time is less than the first saving time.
  • the CPU may further be configured to: calculate a number of registers used by the target condition determination formula; and determine whether the number of registers used by the target condition determination formula is less than a preset number of registers threshold, wherein the target condition judgment formula uses a smaller number of registers than the pre-predetermined Setting the register number threshold indicates that the overhead time is less than the first saving time.
  • the CPU may be further configured to: respectively set an identifier at a position where the first execution logic in the second branch code and the third branch code is located, where the identifier is used in the branch code where the identifier is located Setting a specific value when the conditional judgment formula of the first execution logic execution is satisfied; generating a target condition judgment formula using the identifier of the second branch code and the identifier of the third branch code; after the first branch code and the second branch code The fifth branch code is generated using the target condition judgment formula and the first execution logic, and the target condition judgment formula is used to control execution of the first execution logic when the identifier of the target condition judgment formula satisfies a specific value.
  • the CPU may be further configured to: combine the first conditional judgment formula and the second conditional judgment formula to obtain a target condition judgment formula, where the first conditional judgment formula is used to control the first execution logic in the second branch code Execution, a second conditional judgment formula for controlling execution of the first execution logic in the third branch code; generating a fifth branch using the target conditional judgment formula and the first execution logic before the first branch code and the second branch code Code.
  • the CPU may be further configured to: set the thread that executes the second branch code and the thread that executes the third branch code in the same warp.
  • the second branch code and the fourth branch code comprise the same second execution logic
  • the CPU may further be configured to: calculate the first save before extracting the first execution logic in the second branch code and the third branch code Time and second saving time, the first saving time represents the time required to execute the first execution logic, the second saving time represents the time required to execute the second execution logic; when the first saving time is greater than the second saving time, extracting The second branch code and the first execution logic in the third branch code.
  • the CPU acquires code to be executed, wherein the code to be executed includes a first branch code and a second branch code, the first branch code includes a third branch code and a fourth branch code, and the second branch code and the third branch
  • the code includes the same first execution logic, and the fourth branch code does not include the first execution logic. Since the second branch code and the third branch code include the same first execution logic, the first execution logic includes two copies in the code to be executed, which is a redundant code.
  • the CPU acquires the target condition judgment formula, and the CPU extracts the first execution logic in the second branch code and the third branch code, and then the CPU generates the fifth branch code using the target condition judgment formula and the first execution logic, the target The conditional expression is used to control the execution of the first execution logic in the fifth branch code.
  • the first execution logic is not included in the second branch code and the third branch code due to the extraction of the first execution logic, and the first execution logic is included in the generated fifth branch code so as to be retained in the to-be-executed code
  • the first execution logic, and the number of first execution logics in the code to be executed is reduced from two to one, reducing the redundant code in the code to be executed.
  • FIG. 12 is a schematic structural diagram of a code processing device according to an embodiment of the present invention.
  • the code processing apparatus shown in FIG. 12 can be used to execute the code processing method of each of the embodiments shown in FIGS. 6, 8, and 9, which can be integrated on the code processing apparatus shown in FIG.
  • a code processing device includes:
  • the first obtaining unit 1201 is configured to obtain code to be executed, where the code to be executed includes a first branch code and a second branch code, where the first branch code includes a third branch code and a fourth branch code, and the second branch code and the third branch The code includes the same first execution logic, and the fourth branch code does not include the first execution logic;
  • a second acquiring unit 1202 configured to acquire a target condition determination formula
  • the extracting unit 1203 is configured to extract the first execution logic in the second branch code and the third branch code
  • the generating unit 1204 is configured to generate a fifth branch code using the target condition judgment formula and the first execution logic, and the target condition judgment formula is used to control execution of the first execution logic in the fifth branch code.
  • the code processing device of the embodiment of the present invention further includes a determining unit 1205;
  • the determining unit 1205 is configured to determine whether the overhead time is less than the first saving time, the overhead time represents an execution time generated by the target condition determining formula, and the first saving time represents a time required to execute the first execution logic;
  • the extracting unit 1203 is further configured to perform the step of extracting the first execution logic in the second branch code and the third branch code if the overhead time is less than the first saving time.
  • the determining unit 1205 includes a calculating module 1206 and a determining module 1207;
  • a calculation module 1206, configured to calculate a code length of the target conditional judgment formula and a code length of the first execution logic
  • the judging module 1207 is configured to determine whether the code length of the target condition judging formula is smaller than the code length of the first execution logic, wherein the code length of the target condition judging formula is smaller than the code length of the first execution logic, indicating that the overhead time is less than the first saving time.
  • the determining unit 1205 includes a calculating module 1206 and a determining module 1207;
  • a calculation module 1206, configured to calculate a number of registers used by the target condition determination formula
  • the determining module 1207 is configured to determine whether the number of registers used by the target condition determining formula is less than a preset register number threshold, wherein the number of registers used by the target condition determining formula is less than the preset register number threshold, indicating that the overhead time is less than the first saving time.
  • the second obtaining unit 1202 includes a setting module 1208 and a generating module 1209;
  • a setting module 1208, configured to respectively set an identifier at a position where the first execution logic in the second branch code and the third branch code is located, the identifier being used to control the first execution logic in the branch code where the identifier is located
  • the specific condition is set when the conditional judgment formula of the execution is satisfied
  • a generating module 1209 configured to generate a target conditional judgment formula by using an identifier of the second branch code and an identifier of the third branch code;
  • the generating unit 1204 is further configured to generate, after the first branch code and the second branch code, the fifth branch code by using the target condition judgment formula and the first execution logic, where the target condition judgment formula is used when the identifier of the target condition judgment formula satisfies At a particular value, control executes the first execution logic.
  • the second obtaining unit 1202 includes a merging module 1210;
  • the merging module 1210 is configured to combine the first conditional judgment formula and the second conditional judgment formula to obtain a target condition judging formula, wherein the first condition judging formula is used to control execution of the first execution logic in the second branch code, and second The conditional judgment formula is for controlling execution of the first execution logic in the third branch code;
  • the generating unit 1204 is further configured to generate the fifth branch code by using the target conditional judgment formula and the first execution logic before the first branch code and the second branch code.
  • the code processing device of the embodiment of the present invention further includes a setting unit 1211;
  • the setting unit 1211 is configured to set the thread that executes the second branch code and the thread that executes the third branch code in the same warp.
  • the second branch code and the fourth branch code include the same second execution logic; the code processing device of the embodiment of the present invention further includes a calculating unit 1212;
  • the calculating unit 1212 is configured to calculate a first saving time and a second saving time, where the first saving time represents a time required to execute the first execution logic, and the second saving time represents a time required to execute the second execution logic;
  • the extracting unit 1203 is further configured to: when the first saving time is greater than the second saving time, extract the first execution logic in the second branch code and the third branch code.
  • the code to be executed is kernel code of the graphics processor GPU program, and the first branch code, the second branch code, the third branch code, the fourth branch code, and the branch code of the fifth branch code is an if statement.
  • the first obtaining unit 1201 obtains a code to be executed, where the code to be executed includes a first branch code and a second branch code, where the first branch code includes a third branch code and a fourth branch code, and the second branch code And the third branch code includes the same first execution logic, and the fourth branch code does not include the first execution logic. Since the second branch code and the third branch code include the same first execution logic, the first execution logic includes two copies in the code to be executed, which is a redundant code. To this end, the second obtaining unit 1202 acquires the target condition judging formula, and the extracting unit 1203 extracts the first execution logic in the second branch code and the third branch code, and then the generating unit 1204 uses the target condition judging formula and the first execution logic.
  • a fifth branch code is generated, the target conditional expression being used to control execution of the first execution logic in the fifth branch code.
  • the first execution logic is not included in the second branch code and the third branch code due to the extraction of the first execution logic, and the first execution logic is included in the generated fifth branch code so as to be retained in the to-be-executed code
  • the first execution logic, and the number of first execution logics in the code to be executed is reduced from two to one, reducing the redundant code in the code to be executed.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transfer to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.).
  • wire eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that can be stored by a computer or a data storage device such as a server, data center, or the like that includes one or more available media.
  • the usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium (such as a solid state disk (SSD)).

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Executing Machine-Instructions (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

本发明实施例提供了一种代码处理方法和代码处理设备,该代码处理方法,包括:获取待执行代码,所述待执行代码包括第一分支代码和第二分支代码,所述第一分支代码包括第三分支代码和第四分支代码,所述第二分支代码和所述第三分支代码包括相同的第一执行逻辑,所述第四分支代码不包括所述第一执行逻辑;获取目标条件判断式;提取所述第二分支代码和所述第三分支代码中的所述第一执行逻辑;使用所述目标条件判断式和所述第一执行逻辑生成第五分支代码,所述目标条件判断式用于在所述第五分支代码中控制所述第一执行逻辑的执行。在该待执行代码中第一执行逻辑的数量由两份减少为一份,减少了待执行代码中的冗余代码。

Description

代码处理方法和设备 技术领域
本发明实施例涉及数据处理领域,尤其涉及一种代码处理方法和代码处理设备。
背景技术
图形处理器(graphics processing unit,GPU)在接收到内核启动调用时,会创建大量的线程,例如,64或者32个线程可组成一个线程束,在开放运算语言(Open Computing Language,OpenCL)中线程束称之为Wavefront(简称wave),在计算统一设备架构(Compute Unified Device Architecture,CUDA)中线程束称之为warp。线程束中的线程绑定在一起执行,每一时刻都执行统一的指令。当线程束中的线程遇到分支且判断条件不唯一的时候,由于指令的统一性,该线程束需要串行执行其成员线程对应的分支,这被称为分支分歧问题。此时,每条分支的支路都是全部线程一起执行,但是与当前执行的支路无关的线程运行的结果会被舍弃,从而降低了并行度以及执行的效率。
现有解决分支分歧问题的方案有代码取同方案,该方案将位于同一if语句下的两条分支上相同的指令提取到该两条分支之外,以减少由于分歧造成串行时对指令重复的运算。
然而,对互斥的两条分支,例如位于不同的if语句下的两条分支,即使存在可取同的指令,现有的代码取同方案对此不作处理,从而对代码的优化效果仍不够理想,GPU在遇到分支分歧时仍需串行执行较长的冗余代码。
发明内容
本发明实施例提供了一种代码处理方法和代码处理设备,用于减少代码中的冗余代码。
本发明实施例的第一方面提供一种代码处理方法,包括:获取待执行代码,其中,待执行代码包括第一分支代码和第二分支代码,第一分支代码包括第三分支代码和第四分支代码,第二分支代码和第三分支代码包括相同的第一执行逻辑,第四分支代码不包括第一执行逻辑,该第二分支代码和第三分支代码为互斥的分支代码,该第二分支代码和该第三分支代码需串行执行时,第一执行逻辑被执行了两次,该第一执行逻辑为冗余的代码。为了减少冗余代码,减少对第一执行逻辑的重复执行,获取目标条件判断式,以及提取第二分支代码和第三分支代码中的第一执行逻辑,从而提取第一执行逻辑后,第二分支代码和第三分支代码不包括第一执行逻辑。然后,使用目标条件判断式和第一执行逻辑生成第五分支代码,目标条件判断式用于在第五分支代码中控制第一执行逻辑的执行。
这样,因第一执行逻辑的提取,在第二分支代码和第三分支代码中不包括第一执行逻辑,而在生成的第五分支代码中包括该第一执行逻辑,使得在该待执行代码中保留了第一执行逻辑,且在该待执行代码中第一执行逻辑的数量由两份减少为一份,减少了待执行代码中的冗余代码。
结合本发明实施例的第一方面,在本发明实施例的第一方面的第一种实现方式中,提 取第二分支代码和第三分支代码中的第一执行逻辑之前,本实现方式的方法还包括:判断开销时间是否小于第一节约时间,其中,开销时间表示目标条件判断式产生的执行时间,第一节约时间表示执行第一执行逻辑所需的时间。从第二分支代码和第三分支代码中提取第一执行逻辑后,虽然减少了冗余代码,但是生成第五分支代码使用的目标条件判断式会产生执行时间,从而待执行代码生成第五分支代码后所需的执行时间可能长也可能短。若开销时间小于第一节约时间,表示待执行代码生成第五分支代码后所需的执行时间比未提取第一执行逻辑的待执行代码所需的执行时间段,则执行提取第二分支代码和第三分支代码中的第一执行逻辑的步骤。
这样,本实现方式当开销时间小于第一节约时间时,才执行提取第二分支代码和第三分支代码中的第一执行逻辑的步骤,不但能减少待执行代码的冗余代码,还能减少待执行代码所需的执行时间。
结合本发明实施例的第一方面的第一种实现方式,在本发明实施例的第一方面的第二种实现方式中,判断开销时间是否小于第一节约时间,包括:计算目标条件判断式的代码长度和第一执行逻辑的代码长度,然后,判断目标条件判断式的代码长度是否小于第一执行逻辑的代码长度。第一执行逻辑和目标条件判断式也是代码,代码的执行需要时间,代码长度长的代码所需的执行时间往往大于代码长度短的代码所需的执行时间。从而,目标条件判断式的代码长度小于第一执行逻辑的代码长度表示开销时间小于第一节约时间。通过将时间的比较转化为代码长度的比较,因代码的长度容易判断,从而可以简化判断的执行。
结合本发明实施例的第一方面的第一种实现方式,在本发明实施例的第一方面的第三种实现方式中,判断开销时间是否小于第一节约时间,包括:计算目标条件判断式使用的寄存器数量,然后,判断目标条件判断式使用的寄存器数量是否小于预设寄存器数量阈值,因目标条件判断式可能使用到寄存器,若寄存器使用的数量超过寄存器数量阈值,则会减少每个CU/SM上的workgroup数目,造成性能损失,若目标条件判断式使用的寄存器数量小于预设寄存器数量阈值,造成的性能损失较小,从而,目标条件判断式使用的寄存器数量小于预设寄存器数量阈值表示开销时间小于第一节约时间。通过将时间的比较转化为寄存器数量的判断,因使用的寄存器的数量易于确定,从而可以简化判断的执行。
结合本发明实施例的第一方面、本发明实施例的第一方面的第一种至第三种实现方式中的任意一种,在本发明实施例的第一方面的第四种实现方式中,获取目标条件判断式,包括:在第二分支代码和第三分支代码中的第一执行逻辑所在的位置处,分别设置标识符,其中,标识符用于当标识符所在的分支代码中的控制第一执行逻辑执行的条件判断式满足时设定特定值。标识符可以记录在该标识符所在的分支代码中,控制第一执行逻辑的执行的条件判断式的结果。其中,标识符例如可以为状态标志寄存器flag标志位或者整型变量。然后,使用第二分支代码的标识符和第三分支代码的标识符生成目标条件判断式。
相应地,使用目标条件判断式和第一执行逻辑生成第五分支代码,包括:在第一分支代码和第二分支代码之后,使用目标条件判断式和第一执行逻辑生成第五分支代码,目标条件判断式用于当目标条件判断式的标识符满足前述特定值时,控制执行第一执行逻辑。 换言之,若线程执行第五分支代码,若目标条件判断式的标识符满足特定值,则该线程执行第五分支代码的第一执行逻辑,若目标条件判断式的标识符不满足特定值,则该线程不执行第五分支代码的第一执行逻辑。
这样,因标识符用于当标识符所在的分支代码中的控制第一执行逻辑执行的条件判断式满足时设定特定值,可以记录在该标识符所在的分支代码中,控制第一执行逻辑的执行的条件判断式的结果。在使用第二分支代码的标识符和第三分支代码的标识符生成目标条件判断式后,通过目标条件判断式即可确定第二分支代码和第三分支代码中的控制第一执行逻辑的执行的条件判断式的结果,因第五分支代码在第一分支代码和第二分支代码之后,从而第二分支代码和包括第三分支代码的第一分支代码先执行,在第五分支代码中,该目标条件判断式可以有效控制第一执行逻辑的执行。
结合本发明实施例的第一方面、本发明实施例的第一方面的第一种至第三种实现方式中的任意一种,在本发明实施例的第一方面的第五种实现方式中,获取目标条件判断式,包括:合并第一条件判断式和第二条件判断式,得到目标条件判断式,可以是通过“或”的方式合并第一条件判断式和第二条件判断式。其中,第一条件判断式用于在第二分支代码中控制第一执行逻辑的执行,第二条件判断式用于在第三分支代码中控制第一执行逻辑的执行。合并第一条件判断式和第二条件判断式后,在第二分支代码和第三分支代码中仍保留第一条件判断式和第二条件判断式。
相应地,使用目标条件判断式和第一执行逻辑生成第五分支代码,包括:在第一分支代码和第二分支代码之前,使用目标条件判断式和第一执行逻辑生成第五分支代码。
这样,因目标条件判断式由合并第一条件判断式和第二条件判断式得到,从而可以将第五分支代码设于第一分支代码和第二分支代码之前,执行第五分支代码时,通过第五分支代码的目标条件判断式即可确定是否要执行第一执行逻辑。
结合本发明实施例的第一方面、本发明实施例的第一方面的第一种至第五种实现方式中的任意一种,在本发明实施例的第一方面的第六种实现方式中,本实现方式的方法还包括:将执行第二分支代码的线程和执行第三分支代码的线程设置在同一线程束中。具体的实现方式可以通过TDR实现。
这样,在同一线程束中的线程需串行执行不同的分支代码时,可将执行第二分支代码的线程和执行第三分支代码的线程设置在同一线程束中,因该第二分支代码和第三分支代码被提取了第一执行逻辑,从而该两条分支代码需串行执行的代码长度得到了减少,从而减少了对第二分支代码和第三分支代码的执行时间。
结合本发明实施例的第一方面、本发明实施例的第一方面的第一种至第六种实现方式中的任意一种,在本发明实施例的第一方面的第七种实现方式中,第二分支代码和第四分支代码包括相同的第二执行逻辑,本实现方式中,提取第二分支代码和第三分支代码中的第一执行逻辑之前,本实现方式的方法还包括:计算第一节约时间和第二节约时间,第一节约时间表示执行第一执行逻辑所需的时间,第二节约时间表示执行第二执行逻辑所需的时间。因第二分支代码和第三分支代码可以提取第一执行逻辑,第二分支代码和第四分支代码可以提取第二分支逻辑,可以根据节约时间的长度确定提取哪一对的分支代码。
从而,提取第二分支代码和第三分支代码中的第一执行逻辑,包括:当第一节约时间大于第二节约时间时,提取第二分支代码和第三分支代码中的第一执行逻辑。
这样,先对节约时间最长的一对分支代码进行执行逻辑的提取,可以使得待执行代码所需的执行时间减少得最多。
结合本发明实施例的第一方面、本发明实施例的第一方面的第一种至第七种实现方式中的任意一种,在本发明实施例的第一方面的第八种实现方式中,待执行代码为图形处理器GPU程序的内核代码,第一分支代码、第二分支代码、第三分支代码、第四分支代码、和第五分支代码为if语句的分支代码。内核代码在遇到分支分歧问题时需串行执行不同的分支代码,通过本实现方式的方法的执行可减少需串行执行的分支代码的代码量,提高代码的执行效率。
本发明实施例的第二方面提供了一种代码处理设备,该代码处理设备具有执行上述方法的功能。该功能可以通过硬件实现,也可能通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
一种可能的实现方式中,该代码处理设备包括:
第一获取单元,用于获取待执行代码,待执行代码包括第一分支代码和第二分支代码,第一分支代码包括第三分支代码和第四分支代码,第二分支代码和第三分支代码包括相同的第一执行逻辑,第四分支代码不包括第一执行逻辑;
第二获取单元,用于获取目标条件判断式;
提取单元,用于提取第二分支代码和第三分支代码中的第一执行逻辑;
生成单元,用于使用目标条件判断式和第一执行逻辑生成第五分支代码,目标条件判断式用于在第五分支代码中控制第一执行逻辑的执行。
另一种可能的实现方式中,该代码处理设备包括:处理器;
该处理器执行如下动作:
获取待执行代码,待执行代码包括第一分支代码和第二分支代码,第一分支代码包括第三分支代码和第四分支代码,第二分支代码和第三分支代码包括相同的第一执行逻辑,第四分支代码不包括第一执行逻辑;
获取目标条件判断式;
提取第二分支代码和第三分支代码中的第一执行逻辑;
使用目标条件判断式和第一执行逻辑生成第五分支代码,目标条件判断式用于在第五分支代码中控制第一执行逻辑的执行。
本发明实施例的第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任意一种实现方式的方法。
本发明实施例的第四方面提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任意一种实现方式的方法。
本发明实施例的第五方面提供了一种计算设备,包括:处理器、存储器和总线;存储器用于存储执行指令,处理器与存储器通过总线连接,当计算设备运行时,处理器执行存 储器存储的执行指令,以使计算设备执行第一方面或第一方面任意一种实现方式的方法。
本发明实施例提供的技术方案中,获取待执行代码,其中,待执行代码包括第一分支代码和第二分支代码,第一分支代码包括第三分支代码和第四分支代码,第二分支代码和第三分支代码包括相同的第一执行逻辑,第四分支代码不包括第一执行逻辑。因第二分支代码和第三分支代码包括相同的第一执行逻辑,从而该第一执行逻辑在待执行代码中包括两份,为冗余代码。为此,获取目标条件判断式,和提取第二分支代码和第三分支代码中的第一执行逻辑,然后,使用目标条件判断式和第一执行逻辑生成第五分支代码,该目标条件判断式用于在第五分支代码中控制第一执行逻辑的执行。因第一执行逻辑的提取,在第二分支代码和第三分支代码中不包括第一执行逻辑,而在生成的第五分支代码中包括该第一执行逻辑,使得在该待执行代码中保留了第一执行逻辑,且在该待执行代码中第一执行逻辑的数量由两份减少为一份,减少了待执行代码中的冗余代码。
附图说明
图1为本发明一实施例提供的一种GPU上OpenCL内核的组织形式示意图;
图2为本发明另一实施例提供的一种TDR方法的示意图;
图3为本发明另一实施例提供的一种代码取同方法的示意图;
图4为图3所示代码取同方法的执行效果示意图;
图5为本发明另一实施例提供的一种代码处理方法涉及的使用场景示意图;
图6为本发明另一实施例提供的一种代码处理方法的方法流程图;
图7a为图6所示实施例涉及的部分代码的示意图;
图7b为图6所示实施例涉及的一种后置取同处理结果示意图;
图7c为图6所示实施例涉及的另一种后置取同处理结果示意图;
图7d为图6所示实施例涉及的一种前置取同处理结果示意图;
图8为本发明另一实施例提供的一种代码处理方法的示意图;
图9为图8所示实施例的代码处理方法的流程示意图;
图10为本发明另一实施例提供的不同的代码处理方法处理结果比较示意图;
图11为本发明另一实施例提供的一种代码处理设备的硬件结构示意图;
图12为本发明另一实施例提供的一种代码处理设备的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这 里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
为了方便理解本发明的各实施例,下面先对本发明各实施例涉及到的一些技术术语进行介绍,后文的各实施例可以参考下面的技术术语介绍。
1、GPU
GPU又称显示核心、视觉处理器、显示芯片等,是一种在个人电脑、工作站、游戏机和一些移动设备(如平板电脑、智能手机等)等设备上进行图像运算工作的微处理器。GPU的用途是将计算机系统所需要的显示信息进行转换驱动,并向显示器提供行扫描信号,控制显示器的正确显示。
GPU作为一个大规模并行计算元件,因其日益强大的计算能力,已经被广泛应用于通用计算中。不同领域中的大量程序都用GPU进行加速,如传统计算密集型的科学计算、文件系统、网络系统、数据库系统和云计算等。
GPU由若干个结构相同但相互独立的元件组成,美国超微半导体公司(Advanced Micro Devices,AMD)将其称作计算单元(Computing Unit,CU),英伟达公司(NVIDIA Corporation,NVIDIA)将其称作流处理器(Stream Multiprocessor,SM)。GPU内核的线程会被调度到某一个CU或SM上执行。所有CU或SM共享GPU内存和二级缓存(L2 Cache),但其他存储资源,如寄存器、共享内存(Local Data Share,LDS或者shared memory)和计算部件都相互独立。
2、GPU程序
一个GPU程序可以分为两个部分:主机端代码(host code)和内核代码(kernel code)。主机端代码会顺序地在中央处理器(Central Processing Unit,CPU)上执行,包括内核代码上下文的初始化,CPU与GPU数据交换的CPU调用函数以及GPU内核的启动函数。一个GPU的内核代码描述的是一个GPU线程的行为。
3、线程束
线程束为由线程组成的集合体。GPU作为协处理器,在接收到内核启动调用时,会创建大量的线程。这些线程会分层的组织到一起。在OpenCL中,64(AMD的GPU)或者32(Nvidia的GPU)个线程组成一个线程束,在OpenCL中线程束称为wavefront,在CUDA中线程束称为warp。
线程束的线程在CU/SM上执行时,绑定在一起执行,每一时刻都执行统一的指令。一些线程束组成一个工作组(workgroup)。图1给出AMD GPU上OpenCL内核的组织形式。
4、分支分歧
分支分歧是GPU计算中常见的带来性能损耗的问题。
每个线程的识别号(identification,ID)以及读取到的数据都不尽相同,因此遭遇分支的时候会得出不同的判断条件。当同一个线程束中的线程需要执行不同的分支时,由 于执行指令的统一性,该线程束将串行执行成员线程需要执行的所有分支,这被称为分支分歧问题。每条分支都是全部线程一起执行,但是无关的线程运行的结果会被舍弃,这降低了并行度以及执行的效率。例如,单层的分支可以将效率降至50%,而多层嵌套分支更可造成指数级增长的减速。
5、TDR
线程数据重映射(Thread-Data Remapping,TDR)是解决分支分歧的一大软件技术。该技术通过调整线程与输入数据间的映射关系,使同一线程束中的线程得到的判断条件一致,从而消除分支分歧的根源。
TDR的概念如图2所示,其中,A[]表示数组下标。TDR是一种直接消除分歧的技术,它试图将同一个线程束中的线程重新映射到同类的数据,让它们计算出一致的判断条件。具体的手段有两种:一种是直接改变数据的排列,例如对图2中的A数组进行排序;另一种是调整线程读取数据时使用的数组下标,同样也可以改变每个线程被分配到的数据。
然而,TDR并不能消除所有的分支分歧,因为各个种类数据的数目不一定能被线程束的宽度整除。以图2为例,假设被黑底标识的数据只有3份,则即使它们被重新映射到同一个线程束,该线程束中仍有一个线程读取的是白底标识的数据,从而造成分歧。这样的线程束可能在TDR后的不同种类数据的边界上出现,此即为TDR的“边界问题”。分支的支路数目越多,这样的“边界问题”越为严重,从而造成运算的浪费。
TDR大致分为on-GPU和on-CPU两种。传统的on-CPU TDR一般通过在主机端代码(host code)插入代码,处理输入给GPU的数据,根据数据会产生的分支条件对数据进行排序,以使得同一个线程束里的线程获得同类数据,以走同样的分支。
6、代码取同
代码取同是解决分支分歧的另一大软件技术。该技术将分支(例如位于同一if语句下的两分支)上相同的指令提取到分支之外,减少由于分歧造成串行时重复的运算,从而减轻分支分歧的影响。
代码取同的概念如图3所示。代码取同不试图消除分歧,而是通过对代码的编译修改,将相同的指令尽可能提取到分支之外。代码取同后的代码在遭遇分歧时需要串行执行的部分变短了,因此分歧造成的性能损失减少了。
以图3为例,原本的代码遇到分支时,分歧的线程束需要依次执行代码段abac,代码段a被执行了两次;取同后的代码,分歧的线程束需要依次执行的是代码段bc,代码段a因为在分支之外总共只需执行1次。
代码取同并不只提取一模一样的语句,它是一种较为成熟的编译器技术,可以提取指令层面相似的部分。
现有的代码取同只提取位于同一if语句下的两分支上的相似部分,对嵌套、多支路分支未进行深入优化。
现有的代码取同的效果如图4所示。面对嵌套分支时,现有的代码取同技术先处理内层的分支,然后是外层的。例如图4中,4条支路都共有的代码段a被提取到了整个分支之外。然而,不在同一if语句下的分支间,即使存在可取同的部分,现有技术也是不作 处理的。例如图4中,支路1和支路3中都剩下代码段u,然而在遇到分歧时该两分支的线程仍需要分开执行。
为了与本发明实施例提供的代码处理方法涉及的跨层取同进行区别,在本发明实施例中,将现有的代码取同方法(即图3所示的代码取同方法)又称为同层取同。
7、if语句
if语句是选择语句,if语句用于实现两个分支的选择结构。if语句的一般形式如下:
“if(条件判断式)条件执行语句1
  [else条件执行语句2]”
其中,方括号内的部分(即else子句)为可选的,即可以有,也可以没有。条件执行语句1和条件执行语句2可以是一个简单的语句,也可以是一个复合语句,还可以是另一个if语句(即在一个if语句中又包括另一个或多个内嵌的if语句)。条件判断式也可以称之为表达式(expression),条件执行语句也可以称之为语句(statement)。
在if语句中又包括一个或多个if语句称为if语句的嵌套。在上述的(3)的形式属于if语句的嵌套,其一般形式如下:
Figure PCTCN2017108003-appb-000001
图5为本发明实施例提供的一种代码处理方法涉及的使用场景示意图。该使用场景具体为对GPU程序的编译执行场景。如图5所示,GPU程序(包括主机端代码和内核代码)先由结合TDR与代码取同的分支加速器(Branch Acceleration with TDR and Code Hoisting,BATCH)实施本发明实施例提供的代码处理方法,然后,被处理后的GPU程序被进行编译,再发送给GPU设备运行。
其中,执行本发明实施例提供的代码处理方法的模块为BATCH,该模块可以作为中间件的形式应用于应用程序和GPU编程架构之间,也可直接封装进GPU编译器当中。BATCH是在CPU上的模块,CPU读取BATCH上的指令后,即可执行本发明实施例提供的代码处理方法。或者,BATCH的部分指令也可由GPU执行。
可以理解,图5只是对本发明实施例的代码处理方法涉及的使用场景进行的示意性说明,并不对本发明实施例的代码处理方法构成具体限定。本发明实施例的方法还可以应用于其它的使用场景中。
图6为本发明实施例提供的一种代码处理方法的方法流程图,该方法可应用于图5所示的场景中,参考上文的内容和图6,本发明实施例的代码处理方法包括:
步骤601:获取待执行代码。
其中,待执行代码包括第一分支代码和第二分支代码,第一分支代码包括第三分支代码和第四分支代码,第二分支代码和第三分支代码包括相同的第一执行逻辑,第四分支代码不包括第一执行逻辑。
因第一分支代码和第二分支代码属于不同的分支,而第一分支代码包括第三分支代码和第四分支代码,从而第二分支代码和第三分支代码为互斥的分支代码。
代码处理设备获取该待执行代码,该待执行代码可以包括上述的第一至第四分支代码,可以理解,该待执行代码还可以包括更多的分支代码,例如5条、6条等分支代码。
对分支代码的执行将产生支路,分支代码包括条件判断式和条件执行语句,当条件判断式满足时,则执行条件执行语句。执行逻辑为条件执行语句中的代码,不同分支代码中的相同的执行逻辑可以是一模一样的语句,也可以是指令层面相似的语句。在本实施例中,以第一执行逻辑和第二执行逻辑为例对执行逻辑进行说明。
因第二分支代码和第三分支代码包括相同的第一执行逻辑,该第一执行逻辑在待执行代码中为冗余代码,从而可对该第一执行逻辑进行本发明实施例的后续处理。
在有的实施例中,代码处理设备可以对该待执行代码进行分析,以确定出包括相同的执行逻辑且互斥的分支代码。具体来说,代码处理设备可以使用代码解析器(parser)或者基因测序算法(gene sequencing algorithm),分析待执行代码,分析出待执行代码有哪些分支代码互斥,且将不同的分支代码进行两个一组的配对,并分析出每一对分支代码中的不同分支代码是否包括相同的执行逻辑。从而确定出包括相同的执行逻辑且互斥的分支代码,在本发明实施例中,确定出第二分支代码和第三分支代码,以执行后续操作。
可选地,待执行代码为图形处理器GPU程序的内核代码,第一分支代码、第二分支代码、第三分支代码、和第四分支代码、以及下述的第五分支代码为if语句的分支代码。待执行代码中的不同分支代码可以位于同一if语句下,也可以位于不同的if语句下。GPU程序的内核代码被GPU执行时,若同一线程束中的线程执行不同的分支代码,则会产生分支分歧问题。
例如,如图7a所示,图7a示出了待执行代码中的部分代码(代码段a),对代码中的不同分支代码的执行即产生不同的支路,在图7a中,在两个外层分支代码下分别包括两条分支代码,即有四条内层的分支代码,对该四条分支代码的执行产生四条支路,即支路1至4。支路1至4包括的代码互为不同的分支代码。其中,支路1和支路2位于同一if语句下,支路3和支路4位于同一if语句下,支路1和支路3位于不同的if语句下,支路1和支路4位于不同的if语句下,支路2和支路3位于不同的if语句下,支路2和支路4位于不同的if语句下。其中,支路1的条件执行语句包括代码段u,支路3的条件执行语句也包括代码段u,该代码段u即为支路1和支路3的条件执行语句中包括的相同的第一执行逻辑。
在图7a中,支路所处的if语句都包括else子句,可以理解,本发明实施例的待执行代码的支路所处的if语句也可以不包括else子句。
可以理解,分支代码还可以是switch语句、do语句、for语句、或while语句的分支 代码,这些语句可为C语言的语句。
可以理解,本发明实施例的方法还可以应用于其它的“单指令多数据”的架构,例如Xeon Phi(Xeon Phi是由英特尔公司推出的处理器),从而待执行代码也可以为其它架构下的代码。
步骤602:获取目标条件判断式。
因第二分支代码和第三分支代码包括相同的第一执行逻辑,代码处理设备可对该两条分支代码执行跨层代码取同,即将该两条分支代码中的相同的第一执行逻辑提取出来,以生成第五分支代码。为了生成该第五分支代码,还需要获取目标条件判断式。该目标条件判断式可用于在第五分支代码中控制第一执行逻辑的执行。
获取目标条件判断式的方式有多种,因在一些实施例中,目标条件判断式的不同获取方式会使得第五分支代码的生成方式不同,所以下文将对步骤602和步骤604的具体实现方式一并进行详细说明。
步骤603:提取第二分支代码和第三分支代码中的第一执行逻辑。
因第二分支代码和第三分支代码包括相同的第一执行逻辑,从而在待执行代码中该第一执行逻辑为冗余代码,若第二分支代码和第三分支代码需要串行执行,例如,在待执行代码为GPU程序的内核代码时,第二分支代码和第三分支代码需要串行执行,此时该第一执行逻辑需要执行两次。为了减少第二分支代码和第三分支代码需串行执行的代码量,代码处理设备可以提取第二分支代码和第三分支代码中的第一执行逻辑,以使得该第二分支代码和第三分支代码不包括该第一执行逻辑。
可以理解,可以将“提取”理解为“截取”,即提取分支代码的执行逻辑,为从分支代码中截取该执行逻辑;或者,将“提取”理解为“复制”加“删除”,即提取分支代码的执行逻辑,为从分支代码中复制该执行逻辑后,然后,从分支代码中删除该执行逻辑。
经过步骤603的执行,第二分支代码和第三分支代码中的第一执行逻辑被提取,从而提取第一执行逻辑后的第二分支代码和第三分支代码不包括第一执行逻辑,第二分支代码和第三分支代码的代码量减少,在该第二分支代码和第三分支代码被串行执行时,因第一执行逻辑被提取,执行该两条分支代码的执行时间减少。
可以理解,步骤603可以在步骤602之后或者步骤602之前执行。
步骤604:使用目标条件判断式和第一执行逻辑生成第五分支代码。
其中,目标条件判断式用于在第五分支代码中控制第一执行逻辑的执行。
代码处理设备在获取到目标条件判断式和第一执行逻辑后,可使用目标条件判断式和第一执行逻辑生成第五分支代码。从而通过执行第五分支代码实现对第一执行逻辑的执行。这样,待执行代码因包括了第五分支代码,保留了第一执行逻辑。其中,第五分支代码可以在第一分支代码和第二分支代码之外设置。该第五分支代码为新生成的分支代码。将第五分支代码成为分支的形式,可引导需要执行第一执行逻辑的线程执行该第五分支代码。目标条件判断式用于在第五分支代码中控制第一执行逻辑的执行,当目标条件判断式满足时,执行第五分支代码中的第一执行逻辑,当目标条件判断式不满足时,不执行该第五分支代码中的第一执行逻辑。
其中,该第五分支代码可以为if语句。
通过上述步骤的执行,使得待执行代码的第二分支代码和第三分支代码不包括第一执行逻辑,该第一执行逻辑位于新生成的第五分支代码中,第一执行逻辑的数量由两份减少为一份,从而减少了待执行代码的代码冗余度。
在本实施例中,对第二分支代码和第三分支代码包括的相同的第一执行逻辑进行了提取,此为取同操作,因第二分支代码和第三分支代码互斥,从而可将该取同操作成为跨层取同,在本发明实施例中,跨层取同表示步骤602至步骤604的方案。
在一些实施例中,步骤604的具体实现方式会与目标条件判断式相关,下面即对本发明的一些实施例中的步骤602和步骤604进行详细说明,具体可参考下述的两个举例。
例一、后置取同。
在本示例中,步骤602具体包括:在第二分支代码和第三分支代码中的第一执行逻辑所在的位置处,分别设置标识符,其中,标识符用于当标识符所在的分支代码中的控制第一执行逻辑执行的条件判断式满足时设定特定值;然后,使用第二分支代码的标识符和第三分支代码的标识符生成目标条件判断式。
相应地,步骤604具体包括:在第一分支代码和第二分支代码之后,使用目标条件判断式和第一执行逻辑生成第五分支代码,目标条件判断式用于当目标条件判断式的标识符满足特定值时,控制执行第一执行逻辑。
标识符设置在一分支代码上,则标识符对该分支代码进行记录,因该标识符位于第一执行逻辑的位置处,从而在设置标识符的分支代码上,当控制第一执行逻辑的条件判断式满足时,为标识符设定一特定值。从而在该分支代码之外读取到该标识符,若该标识符满足该特定值,则表示该分支代码的控制第一执行逻辑的条件判断式也满足。在本发明实施例中,在第二分支代码和第三分支代码上设置标识符。
其中,该标识符有多种的实现方式,例如可以是状态标志寄存器flag标志位或者整型变量。该整形变量的类型包括但不限于基本整型(int)、短整型(short int)、长整型(long int)等。
使用第二分支代码的标识符和第三分支代码的标识符生成目标条件判断式,从而可以使得目标条件判断式包括第二分支代码的标识符和第三分支代码的标识符。
在本示例中,使用目标条件判断式和第一执行逻辑生成的第五分支代码在第一分支代码和第二分支代码之后,从而该第五分支代码在第一分支代码和第二分支代码之后执行。在执行第一分支代码和第二分支代码时,也会执行第二分支代码和第三分支代码的控制第一执行逻辑的条件判断式。因在第二分支代码和第三分支代码中的第一执行逻辑所在的位置处分别设置了标识符,通过特定值的设定使得该标识符记录了这些条件判断式的结果。在执行第二分支代码和第三分支代码后,第二分支代码的标识符设定了特定值,第三分支代码的标识符也设定了特定值。然后,再执行第五分支代码,因目标条件判断式包括第二分支代码的标识符和第三分支代码的标识符,在执行目标条件判断式时,因目标条件判断式用于当目标条件判断式的标识符满足前述特定值时,控制执行第一执行逻辑,若因第二分支代码和第三分支代码的执行使得标识符有特征值,则目标条件判断式的标识符满足该 特征值,即目标条件判断式为真事件,从而执行第五分支代码的第一执行逻辑。若没执行第二分支代码和第三分支代码,则标识符没有设定特定值,从而目标条件判断式的标识符不满足特征值,目标条件判断式为假事件,不执行第五分支代码的第一执行逻辑。
因第五分支代码在第一分支代码和第二分支代码之后执行,从而目标条件判断式的判断结果可以使用第二分支代码和第三分支代码的控制第一执行逻辑的条件判断式的结果,无需在第五分支代码中再计算一次这些条件判断式。
在一个示例中,参考图7b,该标识符为flag标识位。图7b的代码段a为对图7a中的代码段a进行后置取同操作后得到。在图7a中,代码段a包括支路1至4,其中支路1和支路3包括相同的执行逻辑(代码段u),对该支路1和支路3提取代码段u,并在代码段u的位置处分别设置flag标识位。代码处理设备获取到该flag标识位和代码段u后,在提取了代码段u的支路1-4的分支代码之后,使用该flag标识位和代码段u生成新的分支代码“if(flag13)代码段u”,代码处理结果如图7b所示。在该新的分支代码中flag13为目标条件判断式,用于控制代码段u的执行。例如,一线程执行支路1或支路3时,得到flag13的值为ture(真),当该线程执行到该新的分支代码时,因flag13的值为ture(真),目标条件判断式flag13满足,则执行代码段u。若一线程执行支路2或支路4时,flag13的值不变,其为初始化的值false(假),则该线程不执行新的分支代码中的代码段u。
在另一个示例中,参考图7c,标识符为int类型的变量,该变量在本示例中命名为path。对图7a中的代码段a,对支路1和支路3提取代码段u,并在代码段u的位置处分别设置变量path,变量path用于记录其所在的支路,path变量也可以称之为支路编号。例如图7c所示,支路1被path=1记录,支路3被path=3记录。代码处理设备获取到标识符path=1和path=3,以及代码段u后,在提取了代码段u的支路1-4的分支代码之后使用该标识符和代码段u生成新的分支代码“if(path==1||path==3)代码段u”。代码处理结果如图7c所示。在该新的分支代码中“path==1||path==3”为目标条件判断式,用于控制代码段u的执行。标识符path=1和path=3为逻辑或的关系。从而在线程执行图7c所示的代码段a时,通过该path变量来记录当前线程走的是哪一支路,当确定新的分支代码的目标条件判断式中的path等于支路1或支路3中的path时,则执行新的分支代码中的条件执行语句(代码段u)。例如,线程执行支路1的分支代码时,path=1,从而该线程执行新的分支代码时,目标条件判断式中的path=1满足,从而执行新的分支代码中的代码段u。
在上述的两个示例中,一个示例的标识符为flag标识位,flag为布尔型(bool)变量,每取同一次就产生一个flag。flag初始化为0,只有在被提取目标指令的目标支路里面才将flag设置为1。在另一个示例中,标识符为int类型的变量。在标识符为整型变量的实现方式中,可以减少对寄存器的使用。
例二、前置取同。
在本示例中,步骤602具体包括:合并第一条件判断式和第二条件判断式,得到目标条件判断式,其中,第一条件判断式用于在第二分支代码中控制第一执行逻辑的执行,第二条件判断式用于在第三分支代码中控制第一执行逻辑的执行;
相应地,步骤604具体包括:在第一分支代码和第二分支代码之前,使用目标条件判断式和第一执行逻辑生成第五分支代码。
第一条件判断式用于在第二分支代码中控制第一执行逻辑的执行,表示在提取第一执行逻辑前的第二分支代码中,当第一条件判断式满足时,则执行第一执行逻辑。第二条件判断式用于在第三分支代码中控制第一执行逻辑的执行,表示在提取第一执行逻辑前的第三分支代码中,当第二条件判断式满足时,则执行第一执行逻辑。
具体来时,代码处理设备可以从第二分支代码和第三分支代码中复制得到第一条件判断式和第二条件判断式,在第二分支代码和第三分支代码中仍保留该第一条件判断式和第二条件判断式。然后合并得到的该第一条件判断式和第二条件判断式,得到目标条件判断式。
代码处理设备合并第一条件判断式和第二条件判断式,得到目标条件判断式,该目标条件判断式用于建立第五分支代码。使用目标条件判断式和第一执行逻辑生成的第五分支代码在第一分支代码和第二分支代码之前,从而在执行待执行代码时,线程先执行第五分支代码再执行第一分支代码和第二分支代码。
其中,第一条件判断式和第二条件判断式可以通过“或”的方式进行合并,以得到目标条件判断式。这样,在线程执行第五分支代码时,可以先计算目标条件判断式,若第一条件判断式和第二条件判断式满足其中之一,则执行第五分支代码的第一执行逻辑。
例如,如图7d所示,其为前置取同的具体示例,对图7a的代码段a进行前置取同后,得到图7d的代码段a。具体来说,图7a的代码段a,包括支路1至4,其中支路1和支路3包括相同的执行逻辑代码段u,从支路1中,复制控制代码段u的执行的条件判断式,得到第一条件判断式“A[tid]&&tid%2”;从支路3中,复制控制代码段u的执行的条件判断式,得到第二条件判断式“!A[tid]&&tid%2”,然后,合并第一条件判断式和第二条件判断式,得到目标条件判断式“(A[tid]&&tid%2)||(!A[tid]&&tid%2)”。代码处理设备还从支路1和支路3中提取代码段u。跟着,代码处理设备在提取了目标指令的第一主分支之前使用该目标条件判断式和代码段u生成第一从分支:
“if((A[tid]&&tid%2)||(!A[tid]&&tid%2))
  代码段u”。
关于上述的后置取同和前置取同的两种方案,在后置取同的方案中,目标条件判断式使用标识符生成得到;在前置取同的方案中,目标条件判断式由合并第一条件判断式和第二条件判断式得到,该目标条件判断式往往较后置取同的目标条件判断式长,从而后置取同的方案比前置取同的方案可以减少对目标条件判断式的计算。
上述对本发明实施例的代码处理方法的描述,着重描述了跨层取同的方案。该跨层取同的方案还可以在线程束的场景下进行其它的处理,以进一步减少待执行代码所需的执行时间。具体的实现方式如下:
可选地,在图6所示的实施例中,该实施例的方法还包括:将执行第二分支代码的线程和执行第三分支代码的线程设置在同一线程束中。
将执行第二分支代码的线程和执行第三分支代码的线程设置在同一线程束中有多种实 现方式,例如,确定了第二分支代码和第三分支代码包括相同的第一执行逻辑后,即将执行第二分支代码的线程和执行第三分支代码的线程设置在同一线程束中。
或者,该实现方式可以通过TDR实现。具体来说,因第二分支代码和第三分支代码包括相同的第一执行逻辑,从而在TDR时,可将第二分支代码和第三分支代码进行相邻排列,使得执行第二分支代码的线程和执行第三分支代码的线程设置在同一线程束中。
例如,在待执行代码为GPU程序的内核代码时,在传统的on-CPU TDR方案中,因第二分支代码和第三分支代码包括相同的第一执行逻辑,从而可得到在进行TDR时将第二分支代码和第三分支代码相邻排列的支路排列方案。代码处理设备在进行TDR时,对待执行代码所属的GPU程序的主机端代码插入代码,处理输入给GPU的数据,根据数据会产生的分支条件对数据进行排序,实现对待执行代码进行TDR,此时根据前述的支路排列方案,将执行第二分支代码的线程和执行第三分支代码的线程设置在同一线程束中。
关于TDR的具体实现方式可参阅上文的技术术语介绍部分“5、TDR”的详细描述。在本示例中,将上述的跨层取同方案和TDR方案进行了结合,以对待执行代码进行处理,从而减少待执行代码的执行时间,加快了执行分支代码的速度。
如上文对TDR的介绍所述,TDR方案有“边界问题”,对线程进行TDR是为了将同一线程束中的线程重新映射到同类的数据,但是因各个种类数据的数目不一定能被线程束的宽度整除,从而同一线程束中的线程不一定能重新映射到同类的数据,这种情况尤其可能在线程束的不同种类数据的边界上出现。
但是,在本实施例中,即使出现前述的“边界问题”,换言之,分别执行该第二分支代码和第三分支代码的两线程位于同一线程束中,在第二分支代码和第三分支代码需要串行执行时,若对第二分支代码和第三分支代码执行了上述实施例所述的跨层取同方案,因第一执行逻辑被提取,该第二分支代码和第三分支代码需要串行执行的代码量减少,从而减缓了TDR的“边界问题”。
在本实施例中,两条包括相同的第一执行逻辑的分支代码可以在TDR时作为相邻支路排列,此为支路排列方案。从而,在需要对执行待执行代码的线程进行TDR时,代码处理设备可以根据该支路排列方案对待执行代码的原来的分支代码进行TDR,该原来的分支代码包括提取了第一执行逻辑的分支代码,但不包括使用目标条件判断式和第一执行逻辑生成的第五分支代码,代码处理设备将执行被提取了第一执行逻辑的第二分支代码和第三分支代码的线程相邻排列。这样,执行待执行代码的线程,在同一线程束中可能分配到同类的数据,从而消除分支分歧的影响;或者,因“边界问题”在同一线程束中不能分配到同类的数据,但因将执行第二分支代码的线程和执行第三分支代码的线程设置在同一线程束中,因第一执行逻辑的提取,该两条分支代码需要串行执行的代码量减少。
另外,利用TDR可确保新生成的第五分支代码不会引入新的分歧。例如,对图7a中的代码段a,执行本发明实施例的代码处理方法后,可得到图7b至图7d的结果。如果根据线程走的支路给线程贴上标签,例如标签1的线程执行支路1、标签2的线程执行支路2、标签3的线程执行支路3、标签4的线程执行支路4。对新生成的分支代码,只是标签1的线程和标签3的线程一起执行。因此,进行TDR后,相当于将标签一致的线程归在一起, 或者将执行提取了相同的执行逻辑的分支代码的线程归在一起。需要执行新生成的分支代码中的执行逻辑的线程归在一起,因此新生成的分支代码没有引入新的分歧。
线程既要执行原来的分支代码,也要执行新生成的分支代码,线程计算出条件判断式满足,则会执行相关的代码。对于同一线程束中执行不同的两条分支代码的两条线程,对该两条分支代码进行本实施例的代码处理后,在待执行代码原来的分支代码中,该两条线程分开执行,但是到了新的分支代码中(使用从该两条分支代码提取的相同的第一执行逻辑生成的分支代码),该两条线程同时都满足新的分支代码的目标条件判断式,所以对取同出来的执行逻辑,这些线程会一起执行。
可以理解,本发明实施例对将执行第二分支代码的线程和执行第三分支代码的线程设置在同一线程束中的步骤的执行顺序不作限定,例如可以在步骤602或603之前执行,也可以在步骤604之后执行。或者是,可以在跨层取同之前执行,也可以在跨层取同之后执行。例如,在步骤604后,对待执行代码进行TDR,将执行第二分支代码的线程和执行第三分支代码的线程设置在同一线程束中。或者,在步骤602之前,对待执行代码进行TDR,将执行第二分支代码的线程和执行第三分支代码的线程设置在同一线程束中,然后,再执行步骤602至步骤604。此处,对待执行代码进行TDR具体为对待执行代码的原分支代码进行TDR,待执行代码的原分支代码不包括根据跨层取同新生成的分支代码。
通过图6所示的实施例的执行,可以减少待执行代码的冗余代码,但是图6所示实施例的方法并不能保证包括第五分支代码的待执行代码所需的执行时间比未提取第一执行逻辑的待执行代码所需的执行时间少。为了既能减少待执行代码的冗余代码,也能减少待执行代码所需的执行时间,本发明实施例还提供了如下代码处理方法。
可选地,在本发明的一些实施例中,在步骤603之前,本发明实施例的代码处理方法还包括:判断开销时间是否小于第一节约时间,其中,开销时间表示目标条件判断式产生的执行时间,第一节约时间表示执行第一执行逻辑所需的时间。若开销时间小于第一节约时间,则执行步骤603。
换言之,在从目标支路中提取目标指令之前,先判断执行跨层取同是否会减少代码所需的执行时间,若能减少,则执行跨层取同方案。
在执行步骤604后,第一执行逻辑在待执行代码中的数量比在步骤601的待执行代码中的数量减少了一份,从而跨层取同后的待执行代码比步骤601的待执行代码减少了执行一次第一执行逻辑的时间,第一节约时间即表示执行第一执行逻辑所需的时间。
另一方面,在执行步骤604后,待执行代码比步骤601的待执行代码多了第五分支代码,以保留第一执行逻辑,保证信息的完整性,第五分支代码的目标条件判断式使得跨层取同后的待执行代码增加了执行时间,用开销时间表示目标条件判断式产生的执行时间。
在本实施例中,在步骤603之前,代码处理设备判断开销时间是否小于第一节约时间,若开销时间小于第一节约时间,表示跨层取同后的待执行代码所需的执行时间小于跨层取同前的待执行代码所需的执行时间,从而代码处理设备执行步骤603和604。以减少代码的冗余代码,且减少代码所需的执行时间。若开销时间不小于第一节约时间,则可以不执行步骤603,以及步骤604。
关于判断开销时间是否小于第一节约时间的步骤的具体实现方式有多种,例如可以直接计算出目标条件判断式产生的执行时间,从而得到开销时间;以及计算出执行一次第一执行逻辑所需的时间,从而得到第一节约时间。然后,使用开销时间和第一节约时间进行比较。
在其它实施例中,还可以使用其它能反映目标条件判断式产生的执行时间的参数,和反映执行一次第一执行逻辑所需的时间的参数进行比较,或者将开销时间和第一节约时间比较转化为其它的比较方式,在有的实施例中,这样的转化判断方式可以简化判断的执行,从而提高执行速度。下面对此举出其中两个示例进行说明。
例一:
判断开销时间是否小于第一节约时间的步骤,具体包括:计算目标条件判断式的代码长度和第一执行逻辑的代码长度,然后,判断目标条件判断式的代码长度是否小于第一执行逻辑的代码长度。
其中,目标条件判断式的代码长度小于第一执行逻辑的代码长度表示开销时间小于第一节约时间。
目标条件判断式和第一执行逻辑皆为代码,代码的执行需要时间,代码的长度对代码所需的执行时间影响较大,长度长的代码所需的执行时间比长度短的代码所需的执行时间长。从而可以将代码产生的执行时间的比较转化为代码长度的比较。
具体来说,在代码处理设备获取到目标条件判断式和读取到第一执行逻辑后,即可计算出目标条件判断式的代码长度和第一执行逻辑的代码长度,若目标条件判断式的代码长度小于第一执行逻辑的代码长度,表示目标条件判断式产生的开销时间小于第一节约时间;反之,开销时间不小于第一节约时间。
这种判断方式尤其适用于上述的前置取同的方案,因前置取同中,目标条件判断式由第一条件判断式和第二条件判断式合并得到,从而目标条件判断式产生的执行时间主要为目标条件判断式的代码所需的执行时间。
例二:
判断开销时间是否小于第一节约时间的步骤,具体包括:计算目标条件判断式使用的寄存器数量,然后,判断目标条件判断式使用的寄存器数量是否小于预设寄存器数量阈值。
其中,目标条件判断式使用的寄存器数量小于预设寄存器数量阈值表示开销时间小于第一节约时间。
目标条件判断式有可能使用到寄存器,从而使得跨层取同后的待执行代码增加了寄存器的使用量,例如在上述的后置取同的方案中,flag标志位即导致寄存器使用数量的增加。若代码使用的寄存器的数量增加,会减少每个CU/SM上的workgroup数目,从而造成性能损失,增加了代码所需的执行时间。若寄存器数量的增加导致的执行时间增量大于执行一次第一执行逻辑所需的时间,则待执行代码在跨层取同后所需的执行时间大于跨层取同前所需的执行时间。寄存器数量的增加导致的执行时间增量是否大于执行一次第一执行逻辑所需的时间,可以通过寄存器数量阈值的设定实现,又因待执行代码在跨层取同后所需的执行时间增量由目标条件判断式使用的寄存器数量导致,从而若目标条件判断式使用的寄 存器数量小于预设寄存器数量阈值,表示目标条件判断式使用的寄存器数量导致的执行时间增量小于执行一次第一执行逻辑所需的时间,即开销时间小于第一节约时间,反之,开销时间不小于第一节约时间。
关于使用的寄存器的数量的计算方式,具体来说,因GPU上声明的私有数据基本都是以寄存器的形式保存,代码处理设备可以检查具体的指令当中声明了多少寄存器,从而检测出寄存器的使用量。
因flag标志位的使用会占用较多的寄存器,从而例二的实现方法尤其适用于上述的“后置取同”方案中的标识符为flag标识位的方案。
上文描述的代码处理方法可以应用于待执行代码中的两条分支代码中,也可以应用于待执行代码包括多对的分支代码的场景中,只要一对分支代码包括可取同的执行逻辑,即可对该对分支代码执行上述各实施例提供的代码处理方法。
在本发明的一些实施例中,在待执行代码包括多对可提取执行逻辑的分支代码时,可以分别对该多对分支代码中的每一对分别执行图6所示实施例的方法,对不同对的分支代码的跨层取同的执行顺序可以不作限定。
在本发明的另一些实施例中,为了能进一步减少待执行代码所需的执行时间,可以对不同对的分支代码的跨层取同的执行顺序进行限定,例如,先对节约时间最大的一对分支代码执行图6所示的方法。具体示例如下:
示例一:在图6所示的实施例中,第二分支代码和第四分支代码包括相同的第二执行逻辑。从而步骤603之前,本实施例的方法还包括:计算第一节约时间和第二节约时间,其中,第一节约时间表示执行第一执行逻辑所需的时间,第二节约时间表示执行第二执行逻辑所需的时间。
相应地,步骤603具体包括:当第一节约时间大于第二节约时间时,提取第二分支代码和第三分支代码中的第一执行逻辑。
因第二分支代码和第三分支代码包括相同的第一执行逻辑,从而可以对第二分支代码和第三分支代码提取第一执行逻辑,使用第一执行逻辑生成新的分支代码,减少第一执行逻辑产生的代码冗余,此时待执行代码减少的执行时间为第一节约时间。类似的,第二分支代码和第四分支代码包括相同的第二执行逻辑,代码处理设备也可以对第二分支代码和第四分支代码提取第二执行逻辑,并使用第二执行逻辑生成新的另一分支代码,减少第二执行逻辑产生的代码冗余,此时待执行代码减少的执行时间为第二节约时间。
当第一节约时间大于第二节约时间时,提取第一执行逻辑将更能减少待执行代码所需的执行时间,故,代码处理设备提取第二分支代码和第三分支代码中的第一执行逻辑,以执行图6所示实施例的步骤604。
关于节约时间的计算,具体来说,可以使用代码解析器(parser)或者基因测序算法(gene sequencing algorithm)分析每一对分支代码的可取同的执行逻辑,并可使用时间计算模型估算出这些执行逻辑所需的执行时间。具体地,可根据具体的硬件参数,比如GPU运行的频率,显存的读写时间等等,对该执行时间进行计算。该执行时间也是跨层取同可节省的时间,即一对分支代码的节约时间。
可以理解,在上述示例一的方案中,当第一节约时间大于第二节约时间时,还可以判断第二分支代码和第三分支代码的开销时间是否小于第一节约时间,若小于,则提取第二分支代码和第三分支代码中的第一执行逻辑,否则不提取第二分支代码和第三分支代码中的第一执行逻辑,进一步地,因第二节约时间小于第一节约时间,从而还可以不提取第二分支代码和第四分支代码包括的第二执行逻辑。
可以理解,在上述示例一的方案中,当对第二分支代码和第三分支代码执行跨层取同后,有可能影响到第二分支代码和第四分支代码的第二节约时间,例如,当第一执行逻辑和第二执行逻辑有部分代码重叠时,从第二分支代码中提取了第一执行逻辑后,第二分支代码和第四分支代码包括的相同的第二执行逻辑变短,从而第二节约时间会受影响,若后续需要使用准确的第二节约时间,则对第二分支代码提取了第一执行逻辑后,可以重新计算第二节约时间。
可以理解,前述示例一示出了两对分支代码的节约时间的比较,该示例一的方法还可以应用于多对分支代码的场景中,或者循环使用示例一的方案循环选择节约时间最大的一对分支代码以执行跨层取同的方案。例如如下的示例二、三所示。
示例二:在待执行代码包括多对分支代码时,若该多对分支代码为互斥的分支代码,且每一对分支代码中的不同分支代码包括相同的执行逻辑,即该每一对分支代码中的不同分支代码与图6所示实施例的第二分支代码和第三分支代码相似,则可先计算该多对分支代码的节约时间,一对分支代码的节约时间为该对分支代码中的不同分支代码包括的相同的执行逻辑所需的执行时间。然后,针对节约时间最长的一对分支代码执行图6所示实施例的方法,即执行对第二分支代码和第三分支代码所执行的方法。这样,即可对该多对分支代码中的提取执行逻辑后能减少最多执行时间的两条分支代码执行上述图6所示的方法。
示例三:在上述的示例二的方法之后,若提取执行逻辑后,影响剩余的其它对的分支代码的节约时间,则计算受影响的其它对分支代码的节约时间,然后再在剩余的其它对分支代码中确定节约时间最长的一对分支代码以执行图6所示的方法,以此类推,构成循环过程。
可选地,在上述的示意二和三中,针对节约时间最长的一对分支代码,可以先判断该对分支代码的开销时间是否小于该对分支代码的节约时间,若该对分支代码的开销时间小于该对分支代码的节约时间,则对该对分支代码执行图6所示实施例的跨层取同的方法,否则不对该对分支代码执行跨层取同,进一步地,还可以不对其它对节约时间较小的分支代码执行跨层取同。
可选地,在上述的示意二和三中,针对节约时间最长的一对分支代码执行跨层取同后,若其它对分支代码包括该对跨层取同后的分支代码的其中之一的分支代码,则可以重新计算包括提取了执行逻辑后的分支代码的一对分支代码的节约时间,然后,再在未跨层取同的多对分支代码中确定节约时间最长的一对分支代码,以执行跨层取同。
在本发明的一些实施例中,上述示例一至示例三还可以进一步对线程进行调整,例如,在示例一中,当第一节约时间大于第二节约时间时,将执行第二分支代码的线程和执行第 三分支代码的线程设置在同一线程束中。或者,在示例二中,确定节约时间最长的一对分支代码,则将执行该节约时间最长的一对分支代码的一对线程设置在同一线程束中;或者,在示例三中,在循环过程中,每确定一次节约时间最长的一对分支代码,则可得到支路排列方案,在该支路排列方案中将执行该对分支代码的一对线程设置在同一线程束中,且节约时间长的一对分支代码对应的一对线程先位于同一线程束中,然后节约时间短的一对分支代码对应的一对线程再位于同一线程束中。在循环过程结束后,根据该支路排列方案,根据节约时间由长至短的顺序,将执行被跨层取同的一对分支代码的线程设于同一线程束中。
可选地,上述使用节约时间大的一对分支代码执行跨层取同方案的实施例中,即示例一至三,可以使用上述的后置取同的跨层取同方案。因为,在后置取同方案中,生成的不同分支代码的目标条件判断式基本都是相同的样式,从而可以认为不同对的分支代码的开销时间相等,此时,节约时间最长的一对分支代码执行后置取同方式的跨层取同后,能为代码减少最多的执行时间。根据节约时间由长至短的顺序,对不同对分支代码执行跨层取同,使得减少代码所需执行时间最多的一对分支代码优先被执行跨层取同。若本实施例的代码处理方法还包括TDR的处理方案,则使减少代码所需执行时间最多的一对分支代码对应的一对线程优先在TDR时相邻排列。
可以理解,在本发明的其它实施例中,若第三分支代码和第四分支代码包括相同的第三执行逻辑,则可以先将第三分支代码和第四分支代码包括的第三执行逻辑提取到第三分支代码和第四分支代码之外后,再执行步骤602。例如,当位于同一if语句下的两条分支代码的条件执行语句分别包括相同的执行逻辑时,将该相同的执行逻辑提取到该位于同一if语句下的两条分支代码之外。
综上所述,本发明实施例提供的代码处理方法中,获取待执行代码,其中,待执行代码包括第一分支代码和第二分支代码,第一分支代码包括第三分支代码和第四分支代码,第二分支代码和第三分支代码包括相同的第一执行逻辑,第四分支代码不包括第一执行逻辑。因第二分支代码和第三分支代码包括相同的第一执行逻辑,从而该第一执行逻辑在待执行代码中包括两份,为冗余代码。为此,获取目标条件判断式,和提取第二分支代码和第三分支代码中的第一执行逻辑,然后,使用目标条件判断式和第一执行逻辑生成第五分支代码,该目标条件判断式用于在第五分支代码中控制第一执行逻辑的执行。因第一执行逻辑的提取,在第二分支代码和第三分支代码中不包括第一执行逻辑,而在生成的第五分支代码中包括该第一执行逻辑,使得在该待执行代码中保留了第一执行逻辑,且在该待执行代码中第一执行逻辑的数量由两份减少为一份,减少了待执行代码中的冗余代码。
上文对本发明实施例的代码处理方法进行了详细介绍,为了对本发明实施例的代码处理方法有更直观的理解,下文将对本发明实施例的代码处理方法结合具体的应用场景进行说明。
下面以待执行代码为GPU程序的内核代码,分支代码为if语句的分支代码为例对上述各实施例的代码处理方法进行说明,其中,以执行逻辑为指令作为示例。
参阅图8,该代码处理方法针对的场景是GPU代码的编译执行。用户写好的GPU代码(包括主机端代码和内核代码)先经过本发明实施例的方法处理,然后再如常进行编译执行。该方法可应用于BATCH等单元上,该BATCH包括同层取同模块、跨层取同模块和TDR模块。参考上文的内容和图8,本发明实施例的代码处理方法的流程要点为:
1)首先,同层取同模块会对第一内核代码(图8所示的内核代码)进行针对所有位于同一if语句的两分支代码的代码取同,即提取该两分支代码包括的相同的指令,从而产生第二内核代码。
2)接着,跨层取同模块会分析第二内核代码中位于不同if语句下的分支代码间两两存在的取同机会,若位于不同if语句的两条分支代码包括相同的指令,则从该两条支路中提取该相同的指令,执行跨层取同。提取了该相同的指令后,还使用该相同的指令生成新的分支,从而得到第三内核代码。每完成一对分支代码的跨层取同,跨层取同模块会记录这一对分支代码需要在TDR支路排列当中邻近。因此,跨层取同模块除了生成第三内核代码之外,还会得到TDR支路排列方案。
3)然后,TDR模块会根据跨层取同模块得到的支路排列方案,完成TDR。视乎具体的TDR手段,该模块会对数据、主机端代码或者第三内核代码进行必要的修改。
4)最后,经过BATCH处理的代码将进行编译,并在GPU设备上执行。
关于图8所示实施例的代码处理方法,详细内容如下:
参考图9,本发明实施例的代码处理方法包括:
步骤901:获取GPU程序的内核代码。
代码处理设备获取用户写好的GPU程序的内核代码。
该内核代码包括至少两条分支代码,对这些分支代码的执行将产生支路,换言之该内核代码包括至少两条支路。其中,内核代码包括的分支代码为if语句的分支代码。这些分支代码可以位于同一if语句下,也可以位于不同的if语句下,这些分支代码所属的if语句可以为嵌套的if语句,也可以不是嵌套的if语句。
例如,如图7a所示,图7a示出了内核代码中的部分代码(代码段a),该代码段a包括多条分支代码,图中标示出了4条支路。
GPU的内核代码先经过传统的代码取同处理,在嵌套分支中由内而外,逐次将位于同一if语句下的分支代码里相同的指令提取到该位于同一if语句下的分支代码之外,具体使用图3所示的方法。具体来说,位于同一if语句下的两条分支代码可以为:一条分支代码是该if语句的条件判断式满足时执行的分支代码,另一条分支代码为该if语句的条件判断式不满足时执行的分支代码(例如else子句的支路)。此时,该if语句也可称之为if-else语句。当该两条分支代码的条件执行语句分别包括相同的指令时,则将该相同的指令提取到该两条分支代码之外。若一if语句还嵌套了一层或多层if语句,当在嵌套的if语句中,位于同一if语句下的两条分支代码的条件执行语句也包括相同的指令时,也可将该相同的指令提取到该两条分支代码之外,且可以由内而外地逐次将位于同一if语句下的两分支代码里的相同指令提取到两分支代码之外。该指令提取方式可称为同层取同。其中,同层取同可以不生成新分支,因为被取同的两条支路位于同一if语句下,执行这两 条支路的线程在该if语句前后一起执行代码,因此只需把相同指令放到该if语句之外即可。
步骤902:对位于不同if语句下的分支代码进行配对,得到至少两个分支组。
其中,每一分支组包括两条位于不同if语句下的分支代码,位于不同if语句下的分支代码表示不同的分支代码直接所属的if语句不同,分支代码直接所属的if语句为最内层的if语句。
步骤903:从至少两个分支组中确定至少两个目标分支组,并计算每一目标分支组的节约时间。
目标分支组属于步骤902中的至少两个分支组,但是目标分支组的两条分支代码包括相同的目标指令,目标分支组的该目标指令所需的执行时间为该目标分支组的节约时间。
目标分支组包括的两条分支代码可以看做为上述图6所示实施例的第二分支代码和第三分支代码,只是在本实施例中第一执行逻辑为目标指令。
例如,步骤902和步骤903的实现方式为:将位于不同的if语句下的分支代码两两配对,得到多组分支组。然后,确定出目标分支组,并对目标分支组使用传统的解析器分析可取同的指令数,可据此估算出这些指令所需的执行时间,该执行时间也是执行跨层取同后可节省的节约时间。
在目标分支组包括至少两组时,可执行循环过程,以对每一目标分支组进行跨层取同。具体过程如下:
步骤904:从至少两个目标分支组中,确定节约时间最长的取同分支组。
在循环过程的开头,确定节约时间最长的取同分支组。并且,可以在TDR支路排列方案中标记该取同分支组的两条分支代码邻近。取同分支组为该至少两个目标分支组中的一分支组,且在该至少两个目标分支组中,取同分支组的节约时间最长。
步骤905:判断取同分支组的开销时间是否小于取同分支组的节约时间,若取同分支组的开销时间小于取同分支组的节约时间,则执行步骤906。
确定了取同分支组后,代码处理设备可以判断取同分支组的开销时间是否小于取同分支组的节约时间。若取同分支组的开销时间小于取同分支组的节约时间,则执行步骤906,实施跨层取同,若取同分支组的开销时间不小于取同分支组的节约时间,取同分支组的跨层取同不会实行,且本发明实施例的方法跳出循环过程,剩下的目标分支组不进行跨层取同。因为剩下的目标分支组的节约时间为递减的,难以大于开销时间。这在不同目标分支组的开销时间相等时,尤其适用,例如在后置取同方案中,因为在后置取同的方案中,新生成的分支代码的目标条件判断式基本都是相同的样式,从而可以认为不同目标分支组的开销时间相等。此时,节约时间越长的目标分支组进行跨层取同后,内核代码所需的执行时间越少。
其中,取同分支组的开销时间表示取同分支组的目标条件判断式产生的执行时间,取同分支组的节约时间表示取同分支组的目标指令所需的执行时间。
若提取取同分支组的两条分支代码的目标指令,则后续要根据该目标指令再生成新的分支代码,该新的分支代码为if语句,该新的分支代码包括目标条件判断式,以用于控制 目标指令的执行。该新的分支代码的目标条件判断式会使得GPU在执行内核代码时产生执行时间,该执行时间即为取同分支组的开销时间。
目标条件判断式产生的开销时间主要来源于两个方面:
一方面是:执行目标条件判断式需要执行时间,代码长度反映了代码所需的执行时间的长短,从而可以判断所述目标条件判断式的代码长度是否小于取同分支组的目标指令的代码长度,若所述目标条件判断式的代码长度小于取同分支组的目标指令的代码长度,则表示取同分支组的开销时间小于取同分支组的节约时间,可以执行跨层取同,否则不执行跨层取同。
另一方面是:目标条件判断式可能使用寄存器,从而增加内核代码使用的寄存器,若目标条件判断式使用的寄存器数量超出寄存器数量阈值,则会减少每个GPU上CU/SM上的workgroup数目,造成性能损失。从而可以判断目标条件判断式使用的寄存器数量是否小于预设寄存器数量阈值,若目标条件判断式使用的寄存器数量小于预设寄存器数量阈值,则表示取同分支组的开销时间小于取同分支组的节约时间。
代码处理设备可以在获取到目标条件判断式后,使用该目标条件判断式计算出开销时间,该目标条件判断式的获取可以参考上述实施例的前置取同和后置取同等部分的详细描述。不过,为了简化计算过程,也可以使用其它判断方式来替代直接使用开销时间和节约时间进行比较,该其它判断方式得出的判断结果可以用于表示取同分支组的开销时间是否小于取同分支组的节约时间。
换言之,步骤905的具体实现方式有多种,例如,计算出取同分支组的开销时间,然后使用取同分支组的开销时间和取同分支组的节约时间进行大小的比较,或者,取同分支组的目标条件判断式的代码长度和目标指令的代码长度进行比较,或者,使用目标条件判断式使用的寄存器数量和预设寄存器数量阈值进行比较,具体的实现方式可参阅上文的描述。
步骤906:对取同分支组进行跨层取同。
跨层取同为上述步骤602至步骤604的实现方案,具体来说,获取目标条件判断式,提取取同分支组的两条分支代码中的目标指令,然后,使用目标条件判断式和目标指令生成新的分支代码,该新的分支代码为if语句的分支代码。
跨层取同包括前置取同和后置取同两种实现方式。具体的实现过程可参阅上文的“例一、后置取同”和“例二、前置取同”的详细描述。
其中,前置取同中须将取同分支组的两条分支代码的条件判断式复制出来并合并在一起,后置取同中可在取同分支组的两条分支代码的目标指令所在的位置处设置标志符,例如flag,然后在新生成的分支代码中,由该标识符控制是否执行取同出来的目标指令。
步骤907:计算包括取同分支组的其中一条分支代码的目标分支组的节约时间。
若开销时间小于节约时间,则执行步骤906后,取同分支组的两条分支代码的目标指令被提取,从而取同分支组的两条分支代码和其它分支代码组成的目标分支组的可取同指令或会受到影响,因此,若一目标分支组包括取同分支组的一分支代码,则该目标分支组的节约时间需要重新计算,以使节约时间更加准确。然后,回到循环过程的开头,重新执 行步骤904,此时该至少两个目标分支组不包括前述的取同分支组,且包括被提取目标指令的分支代码的目标分支组的节约时间已被重新计算。从剩下的目标分支组中,确定节约时间最长的另一取同分支组,以使用该另一取同分支组执行循环过程。例如,确定节约时间最长的另一取同分支组后,判断该另一取同分支组的开销时间是否小于该另一取同分支组的节约时间,若该另一取同分支组的开销时间小于该另一取同分支组的节约时间,则对该另一取同分支组进行跨层取同,对该另一取同分支组进行跨层取同后,生成另一新的分支代码,该另一新的分支代码也位于内核代码的原分支代码之外。即对一取同分支组进行跨层取同,则生成一新的分支代码。
通过上述步骤904-步骤907的循环过程的执行,TDR支路排列方案也逐步得到,例如,在执行步骤904时,确定节约时间最长的取同分支组,则可以在支路排列方案中标记在TDR时该取同分支组的两条分支代码相邻排列。若上述的循环过程执行多次,每次确定节约时间最长的取同分支组,则在支路排列方案中标记在TDR时,将上一循环的取同分支组的两条分支代码相邻排列后,将当前循环的取同分支组的两条分支代码相邻排列。如此随着循环过程的结束,得到支路排列方案。然后,可根据该支路排列方案对内核代码的原分支代码进行TDR,使得执行取同分支组的两条分支代码的线程设置在同一线程束中,或者,先将执行节约时间长的取同分支组的两条分支代码的两线程设置在同一线程束中,再将执行节约时间短的取同分支组的两条分支代码的两线程设置在同一线程束中。可以理解,在TDR时,先考虑将执行相同分支代码的线程设于同一线程束中,若同一线程束中包括执行不同分支代码的不同线程,则该不同线程为执行取同分支组的分支代码的线程。
通过上述步骤904-步骤907的执行,内核代码包括原分支代码和新分支代码,新分支代码为使用目标条件判断式和目标指令生成的新的分支代码,原分支代码为内核代码在跨层取同前即包括的分支代码,其中,原分支代码被执行跨层取同后,不包括被提取的目标指令。跨层取同在原分支代码外引入的新分支代码不必额外进行TDR,对原分支代码进行TDR后,新的分支代码也不再有额外的分歧。
可以理解,在本发明其它的实施例中,还可以是根据节约时间由长至短的顺序从前述的至少两个目标分支组中确定出不同的目标分支组,然后,在对内核代码进行TDR时根据节约时间由长至短的顺序将不同的目标分支组的两条分支代码相邻排列。跟着,再对该不同的目标分支组的两条分支代码执行跨层取同。
对内核代码的原分支代码进行TDR后,若线程束产生“边界问题”,即执行不同分支代码的线程位于同一线程束中,则本实施例可将执行被跨层取同的目标分支组的两线程设于同一线程束中,因目标指令被提取,从而该两线程束串行执行的代码段减少。
本发明实施例的方法综合运用了TDR和代码取同,实现了位于不同的if语句下的分支代码的跨层取同,并利用TDR使得新生成的分支代码不会引入新的分歧。由跨层取同来指导TDR中的支路排列,使得执行被跨层取同的两分支代码的两线程有可能位于同一线程束中,且该两分支代码相同的目标指令被提取出该两分支代码,减少了该两线程执行该两分支代码的时间,缓解了TDR的“边界问题”。若先让节约时间长的两条分支代码进行相邻排列,再对节约时间短的两条分支代码进行相邻排列,可以使得TDR后,若出现“边界问题”, 节约时间长的两条分支代码比节约时间短的两条分支代码更可能出现在边界上,从而对两条分支代码进行跨层取同最大化地减少线程束需串行读取的目标指令的长度。
下面将举例说明本发明实施例的代码处理方法为何可节省较多的时间。
参阅图10,使用图4和图7a至图7d的示例,代码段a包括四条支路,该4条支路可称之为主分支,其中支路1和支路3包括相同的代码段u,假设该代码段u所需的执行时间为1ms(毫秒)。其中,图10所示的1、2、3和4是支路的编号。
1)当不对该四条支路进行优化时,代码段u需要被执行5次,耗时5ms。其中图9中最左侧的线程束既含有支路1的线程也含有支路3的线程,因此需要串行执行两条支路,代码段u也就被这个线程束执行了两次。
2)当对该四条支路执行跨层取同但不进行TDR时,需要计算支路1和支路3的线程会合并在一起,在主分支之外执行代码段u,因此代码段u的执行次数节省了一次,代码段u需要被执行4次,耗时4ms。
3)当对该四条支路进行TDR时,若将各支路按照支路1、支路2、支路3和支路4的顺序排列,各线程束的支路判断情况会相应地改变,执行代码段u将耗时4ms。
4)当对该四条支路使用图9所示实施例的方法时,会提取支路1和支路3的代码段u,并且得到将支路1和支路3相邻排列的支路排列方案,根据该支路排列方案进行TDR,例如,将各支路按照支路1、支路3、支路2和支路4的顺序进行TDR,得到的结果如图10所示,此时,需要执行代码段u的线程被集中在前两个线程束中,执行代码段u的耗时仅为2ms。其中,TDR可以看作一种排序,相同支路的线程自然归在一个线程束,图9所示实施例的方法通过支路排序信息进一步决定不同支路间的相邻关系。
可见,使用图9所示实施例的代码处理方法,在TDR时,将跨层取同的支路安排在邻近位置,可以让两组线程横跨最少的线程束,从而节省更多的运算时间。
图11为本发明实施例提供的一种代码处理设备的硬件结构示意图。该代码处理设备可用于执行上述图6、图8和图9所示的各实施例的代码处理方法。在有的实施例中,该代码处理设备也叫计算设备。
如图11所示,该代码处理设备包括中央处理器(Central Processing Unit,CPU)、GPU、存储器和总线。该CPU、GPU和存储器通过总线进行通信连接。
该代码处理设备可以是智能电话(smart phone)、个人数字助理(personal digital assistant,PDA)电脑、平板型电脑、膝上型电脑(laptop computer)、计算机、服务器等类型。本发明实施例以代码处理设备为计算机为例进行说明。
CPU是一集成电路,是一台计算机的运算核心(Core)和控制核心(Control Unit)。它的功能主要是解释计算机指令以及处理计算机软件中的数据。CPU是计算机的控制中心,利用各种接口和线路连接整个计算机的各个部分,通过运行或执行存储在存储器内的软件程序和/或模块,以及调用存储在存储器内的数据,执行计算机的各种功能和处理数据,从而对计算机进行整体监控。CPU可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。CPU也可以是实现计算功能的组合,例如包含一个或多个微处 理器组合,DSP和微处理器的组合等等。
GPU的内容可参考上述技术术语介绍部分对GPU的详细描述。
存储器可用于存储软件程序以及模块,CPU通过运行存储在存储器的软件程序以及模块,从而执行计算机的各种功能应用以及数据处理。存储器可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据计算机的使用所创建的数据等。此外,存储器可以包括易失性存储器,例如随机存取存储器(random access memory,RAM)、非挥发性动态随机存取内存(Nonvolatile Random Access Memory,NVRAM)、相变化随机存取内存(Phase Change RAM,PRAM)、磁阻式随机存取内存(Magetoresistive RAM,MRAM)等,还可以包括非易失性存储器,例如至少一个磁盘存储器件、只读存储器(read-only memory,ROM)、电子可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、闪存器件,例如反或闪存(NOR flash memory)或是反与闪存(NAND flash memory)、半导体器件,例如固态硬盘(Solid State Disk,SSD)等。
总线(Bus)是计算机各种功能部件之间传送信息的公共通信干线,它是由导线组成的传输线束,按照计算机所传输的信息种类,计算机的总线可以划分为数据总线、地址总线和控制总线,分别用来传输数据、数据地址和控制信号。计算机的各个部件通过总线相连接,外部设备通过相应的接口电路再与总线相连接,从而形成了计算机硬件系统。在计算机系统中,各个部件之间传送信息的公共通路叫总线,微型计算机通过总线结构来连接各个功能部件。
需要说明的是,尽管未示出,计算机还可以包括输入/输出(input/output,I/O)接口、显示器等输出单元、键盘等输入单元、以及其它单元,在此不予赘述。
在本发明实施例中,CPU可以设置用于:获取待执行代码,其中,待执行代码包括第一分支代码和第二分支代码,第一分支代码包括第三分支代码和第四分支代码,第二分支代码和第三分支代码包括相同的第一执行逻辑,第四分支代码不包括第一执行逻辑;获取目标条件判断式;提取第二分支代码和第三分支代码中的第一执行逻辑;使用目标条件判断式和第一执行逻辑生成第五分支代码,目标条件判断式用于在第五分支代码中控制第一执行逻辑的执行。
可选地,CPU还可以设置用于:提取第二分支代码和第三分支代码中的第一执行逻辑之前,判断开销时间是否小于第一节约时间,开销时间表示目标条件判断式产生的执行时间,第一节约时间表示执行第一执行逻辑所需的时间;若开销时间小于第一节约时间,则执行提取第二分支代码和第三分支代码中的第一执行逻辑的步骤。
可选地,CPU还可以设置用于:计算目标条件判断式的代码长度和第一执行逻辑的代码长度;判断目标条件判断式的代码长度是否小于第一执行逻辑的代码长度,其中,目标条件判断式的代码长度小于第一执行逻辑的代码长度表示开销时间小于第一节约时间。
可选地,CPU还可以设置用于:计算目标条件判断式使用的寄存器数量;判断目标条件判断式使用的寄存器数量是否小于预设寄存器数量阈值,其中,目标条件判断式使用的寄存器数量小于预设寄存器数量阈值表示开销时间小于第一节约时间。
可选地,CPU还可以设置用于:在第二分支代码和第三分支代码中的第一执行逻辑所在的位置处,分别设置标识符,标识符用于当标识符所在的分支代码中的控制第一执行逻辑执行的条件判断式满足时设定特定值;使用第二分支代码的标识符和第三分支代码的标识符生成目标条件判断式;在第一分支代码和第二分支代码之后,使用目标条件判断式和第一执行逻辑生成第五分支代码,目标条件判断式用于当目标条件判断式的标识符满足特定值时,控制执行第一执行逻辑。
可选地,CPU还可以设置用于:合并第一条件判断式和第二条件判断式,得到目标条件判断式,其中,第一条件判断式用于在第二分支代码中控制第一执行逻辑的执行,第二条件判断式用于在第三分支代码中控制第一执行逻辑的执行;在第一分支代码和第二分支代码之前,使用目标条件判断式和第一执行逻辑生成第五分支代码。
可选地,CPU还可以设置用于:将执行第二分支代码的线程和执行第三分支代码的线程设置在同一线程束中。
可选地,第二分支代码和第四分支代码包括相同的第二执行逻辑,CPU还可以设置用于:提取第二分支代码和第三分支代码中的第一执行逻辑之前,计算第一节约时间和第二节约时间,第一节约时间表示执行第一执行逻辑所需的时间,第二节约时间表示执行第二执行逻辑所需的时间;当第一节约时间大于第二节约时间时,提取第二分支代码和第三分支代码中的第一执行逻辑。
综上所述,CPU获取待执行代码,其中,待执行代码包括第一分支代码和第二分支代码,第一分支代码包括第三分支代码和第四分支代码,第二分支代码和第三分支代码包括相同的第一执行逻辑,第四分支代码不包括第一执行逻辑。因第二分支代码和第三分支代码包括相同的第一执行逻辑,从而该第一执行逻辑在待执行代码中包括两份,为冗余代码。为此,CPU获取目标条件判断式,和CPU提取第二分支代码和第三分支代码中的第一执行逻辑,然后,CPU使用目标条件判断式和第一执行逻辑生成第五分支代码,该目标条件判断式用于在第五分支代码中控制第一执行逻辑的执行。因第一执行逻辑的提取,在第二分支代码和第三分支代码中不包括第一执行逻辑,而在生成的第五分支代码中包括该第一执行逻辑,使得在该待执行代码中保留了第一执行逻辑,且在该待执行代码中第一执行逻辑的数量由两份减少为一份,减少了待执行代码中的冗余代码。
图12为本发明实施例提供的一种代码处理设备的结构示意图。图12所示的代码处理设备可用于执行上述图6、图8和图9所示的各实施例的代码处理方法,该代码处理设备可集成在图11所示的代码处理设备上。
参阅图12,本发明实施例的代码处理设备,包括:
第一获取单元1201,用于获取待执行代码,待执行代码包括第一分支代码和第二分支代码,第一分支代码包括第三分支代码和第四分支代码,第二分支代码和第三分支代码包括相同的第一执行逻辑,第四分支代码不包括第一执行逻辑;
第二获取单元1202,用于获取目标条件判断式;
提取单元1203,用于提取第二分支代码和第三分支代码中的第一执行逻辑;
生成单元1204,用于使用目标条件判断式和第一执行逻辑生成第五分支代码,目标条件判断式用于在第五分支代码中控制第一执行逻辑的执行。
可选地,本发明实施例的代码处理设备还包括判断单元1205;
判断单元1205,用于判断开销时间是否小于第一节约时间,开销时间表示目标条件判断式产生的执行时间,第一节约时间表示执行第一执行逻辑所需的时间;
提取单元1203,还用于若开销时间小于第一节约时间,则执行提取第二分支代码和第三分支代码中的第一执行逻辑的步骤。
可选地,判断单元1205包括计算模块1206和判断模块1207;
计算模块1206,用于计算目标条件判断式的代码长度和第一执行逻辑的代码长度;
判断模块1207,用于判断目标条件判断式的代码长度是否小于第一执行逻辑的代码长度,其中,目标条件判断式的代码长度小于第一执行逻辑的代码长度表示开销时间小于第一节约时间。
可选地,判断单元1205包括计算模块1206和判断模块1207;
计算模块1206,用于计算目标条件判断式使用的寄存器数量;
判断模块1207,用于判断目标条件判断式使用的寄存器数量是否小于预设寄存器数量阈值,其中,目标条件判断式使用的寄存器数量小于预设寄存器数量阈值表示开销时间小于第一节约时间。
可选地,第二获取单元1202包括设置模块1208和生成模块1209;
设置模块1208,用于在第二分支代码和第三分支代码中的第一执行逻辑所在的位置处,分别设置标识符,标识符用于当标识符所在的分支代码中的控制第一执行逻辑执行的条件判断式满足时设定特定值;
生成模块1209,用于使用第二分支代码的标识符和第三分支代码的标识符生成目标条件判断式;
生成单元1204,还用于在第一分支代码和第二分支代码之后,使用目标条件判断式和第一执行逻辑生成第五分支代码,目标条件判断式用于当目标条件判断式的标识符满足特定值时,控制执行第一执行逻辑。
可选地,第二获取单元1202包括合并模块1210;
合并模块1210,用于合并第一条件判断式和第二条件判断式,得到目标条件判断式,其中,第一条件判断式用于在第二分支代码中控制第一执行逻辑的执行,第二条件判断式用于在第三分支代码中控制第一执行逻辑的执行;
生成单元1204,还用于在第一分支代码和第二分支代码之前,使用目标条件判断式和第一执行逻辑生成第五分支代码。
可选地,本发明实施例的代码处理设备还包括设置单元1211;
设置单元1211,用于将执行第二分支代码的线程和执行第三分支代码的线程设置在同一线程束中。
可选地,第二分支代码和第四分支代码包括相同的第二执行逻辑;本发明实施例的代码处理设备还包括计算单元1212;
计算单元1212,用于计算第一节约时间和第二节约时间,第一节约时间表示执行第一执行逻辑所需的时间,第二节约时间表示执行第二执行逻辑所需的时间;
提取单元1203,还用于当第一节约时间大于第二节约时间时,提取第二分支代码和第三分支代码中的第一执行逻辑。
可选地,待执行代码为图形处理器GPU程序的内核代码,第一分支代码、第二分支代码、第三分支代码、第四分支代码、和第五分支代码为if语句的分支代码。
综上所述,第一获取单元1201获取待执行代码,其中,待执行代码包括第一分支代码和第二分支代码,第一分支代码包括第三分支代码和第四分支代码,第二分支代码和第三分支代码包括相同的第一执行逻辑,第四分支代码不包括第一执行逻辑。因第二分支代码和第三分支代码包括相同的第一执行逻辑,从而该第一执行逻辑在待执行代码中包括两份,为冗余代码。为此,第二获取单元1202获取目标条件判断式,和提取单元1203提取第二分支代码和第三分支代码中的第一执行逻辑,然后,生成单元1204使用目标条件判断式和第一执行逻辑生成第五分支代码,该目标条件判断式用于在第五分支代码中控制第一执行逻辑的执行。因第一执行逻辑的提取,在第二分支代码和第三分支代码中不包括第一执行逻辑,而在生成的第五分支代码中包括该第一执行逻辑,使得在该待执行代码中保留了第一执行逻辑,且在该待执行代码中第一执行逻辑的数量由两份减少为一份,减少了待执行代码中的冗余代码。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。

Claims (21)

  1. 一种代码处理方法,其特征在于,包括:
    获取待执行代码,所述待执行代码包括第一分支代码和第二分支代码,所述第一分支代码包括第三分支代码和第四分支代码,所述第二分支代码和所述第三分支代码包括相同的第一执行逻辑,所述第四分支代码不包括所述第一执行逻辑;
    获取目标条件判断式;
    提取所述第二分支代码和所述第三分支代码中的所述第一执行逻辑;
    使用所述目标条件判断式和所述第一执行逻辑生成第五分支代码,所述目标条件判断式用于在所述第五分支代码中控制所述第一执行逻辑的执行。
  2. 根据权利要求1所述的方法,其特征在于,
    所述提取所述第二分支代码和所述第三分支代码中的所述第一执行逻辑之前,所述方法还包括:
    判断开销时间是否小于第一节约时间,所述开销时间表示所述目标条件判断式产生的执行时间,所述第一节约时间表示执行所述第一执行逻辑所需的时间;
    若开销时间小于第一节约时间,则执行所述提取所述第二分支代码和所述第三分支代码中的所述第一执行逻辑的步骤。
  3. 根据权利要求2所述的方法,其特征在于,
    所述判断开销时间是否小于第一节约时间,包括:
    计算所述目标条件判断式的代码长度和所述第一执行逻辑的代码长度;
    判断所述目标条件判断式的代码长度是否小于所述第一执行逻辑的代码长度,其中,所述目标条件判断式的代码长度小于所述第一执行逻辑的代码长度表示所述开销时间小于所述第一节约时间。
  4. 根据权利要求2所述的方法,其特征在于,
    所述判断开销时间是否小于第一节约时间,包括:
    计算所述目标条件判断式使用的寄存器数量;
    判断所述目标条件判断式使用的寄存器数量是否小于预设寄存器数量阈值,其中,所述目标条件判断式使用的寄存器数量小于所述预设寄存器数量阈值表示所述开销时间小于所述第一节约时间。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,
    所述获取目标条件判断式,包括:
    在所述第二分支代码和所述第三分支代码中的所述第一执行逻辑所在的位置处,分别设置标识符,所述标识符用于当标识符所在的分支代码中的控制所述第一执行逻辑执行的条件判断式满足时设定特定值;
    使用所述第二分支代码的标识符和所述第三分支代码的标识符生成目标条件判断式;
    所述使用所述目标条件判断式和所述第一执行逻辑生成第五分支代码,包括:
    在所述第一分支代码和所述第二分支代码之后,使用所述目标条件判断式和所述第一执行逻辑生成第五分支代码,所述目标条件判断式用于当所述目标条件判断式的标识符满 足所述特定值时,控制执行所述第一执行逻辑。
  6. 根据权利要求1-4任一项所述的方法,其特征在于,
    所述获取目标条件判断式,包括:
    合并第一条件判断式和第二条件判断式,得到目标条件判断式,其中,所述第一条件判断式用于在所述第二分支代码中控制所述第一执行逻辑的执行,所述第二条件判断式用于在所述第三分支代码中控制所述第一执行逻辑的执行;
    所述使用所述目标条件判断式和所述第一执行逻辑生成第五分支代码,包括:
    在所述第一分支代码和所述第二分支代码之前,使用所述目标条件判断式和所述第一执行逻辑生成第五分支代码。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,
    所述方法还包括:
    将执行所述第二分支代码的线程和执行所述第三分支代码的线程设置在同一线程束中。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,
    所述第二分支代码和所述第四分支代码包括相同的第二执行逻辑;
    所述提取所述第二分支代码和所述第三分支代码中的所述第一执行逻辑之前,所述方法还包括:
    计算第一节约时间和第二节约时间,所述第一节约时间表示执行所述第一执行逻辑所需的时间,所述第二节约时间表示执行所述第二执行逻辑所需的时间;
    所述提取所述第二分支代码和所述第三分支代码中的所述第一执行逻辑,包括:
    当所述第一节约时间大于所述第二节约时间时,提取所述第二分支代码和所述第三分支代码中的所述第一执行逻辑。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,
    所述待执行代码为图形处理器GPU程序的内核代码,所述第一分支代码、所述第二分支代码、所述第三分支代码、所述第四分支代码、和所述第五分支代码为if语句的分支代码。
  10. 一种代码处理设备,其特征在于,包括:
    第一获取单元,用于获取待执行代码,所述待执行代码包括第一分支代码和第二分支代码,所述第一分支代码包括第三分支代码和第四分支代码,所述第二分支代码和所述第三分支代码包括相同的第一执行逻辑,所述第四分支代码不包括所述第一执行逻辑;
    第二获取单元,用于获取目标条件判断式;
    提取单元,用于提取所述第二分支代码和所述第三分支代码中的所述第一执行逻辑;
    生成单元,用于使用所述目标条件判断式和所述第一执行逻辑生成第五分支代码,所述目标条件判断式用于在所述第五分支代码中控制所述第一执行逻辑的执行。
  11. 根据权利要求10所述的设备,其特征在于,
    所述设备还包括判断单元;
    所述判断单元,用于判断开销时间是否小于第一节约时间,所述开销时间表示所述目 标条件判断式产生的执行时间,所述第一节约时间表示执行所述第一执行逻辑所需的时间;
    所述提取单元,还用于若开销时间小于第一节约时间,则执行所述提取所述第二分支代码和所述第三分支代码中的所述第一执行逻辑的步骤。
  12. 根据权利要求11所述的设备,其特征在于,
    所述判断单元包括计算模块和判断模块;
    所述计算模块,用于计算所述目标条件判断式的代码长度和所述第一执行逻辑的代码长度;
    所述判断模块,用于判断所述目标条件判断式的代码长度是否小于所述第一执行逻辑的代码长度,其中,所述目标条件判断式的代码长度小于所述第一执行逻辑的代码长度表示所述开销时间小于所述第一节约时间。
  13. 根据权利要求11所述的设备,其特征在于,
    所述判断单元包括计算模块和判断模块;
    所述计算模块,用于计算所述目标条件判断式使用的寄存器数量;
    所述判断模块,用于判断所述目标条件判断式使用的寄存器数量是否小于预设寄存器数量阈值,其中,所述目标条件判断式使用的寄存器数量小于所述预设寄存器数量阈值表示所述开销时间小于所述第一节约时间。
  14. 根据权利要求10-13任一项所述的设备,其特征在于,
    所述第二获取单元包括设置模块和生成模块;
    所述设置模块,用于在所述第二分支代码和所述第三分支代码中的所述第一执行逻辑所在的位置处,分别设置标识符,所述标识符用于当标识符所在的分支代码中的控制所述第一执行逻辑执行的条件判断式满足时设定特定值;
    所述生成模块,用于使用所述第二分支代码的标识符和所述第三分支代码的标识符生成目标条件判断式;
    所述生成单元,还用于在所述第一分支代码和所述第二分支代码之后,使用所述目标条件判断式和所述第一执行逻辑生成第五分支代码,所述目标条件判断式用于当所述目标条件判断式的标识符满足所述特定值时,控制执行所述第一执行逻辑。
  15. 根据权利要求10-13任一项所述的设备,其特征在于,
    所述第二获取单元包括合并模块;
    所述合并模块,用于合并第一条件判断式和第二条件判断式,得到目标条件判断式,其中,所述第一条件判断式用于在所述第二分支代码中控制所述第一执行逻辑的执行,所述第二条件判断式用于在所述第三分支代码中控制所述第一执行逻辑的执行;
    所述生成单元,还用于在所述第一分支代码和所述第二分支代码之前,使用所述目标条件判断式和所述第一执行逻辑生成第五分支代码。
  16. 根据权利要求10-15任一项所述的设备,其特征在于,
    所述设备还包括设置单元;
    所述设置单元,用于将执行所述第二分支代码的线程和执行所述第三分支代码的线程设置在同一线程束中。
  17. 根据权利要求10-16任一项所述的设备,其特征在于,
    所述第二分支代码和所述第四分支代码包括相同的第二执行逻辑;
    所述设备还包括计算单元;
    所述计算单元,用于计算第一节约时间和第二节约时间,所述第一节约时间表示执行所述第一执行逻辑所需的时间,所述第二节约时间表示执行所述第二执行逻辑所需的时间;
    所述提取单元,还用于当所述第一节约时间大于所述第二节约时间时,提取所述第二分支代码和所述第三分支代码中的所述第一执行逻辑。
  18. 根据权利要求10-17任一项所述的设备,其特征在于,
    所述待执行代码为图形处理器GPU程序的内核代码,所述第一分支代码、所述第二分支代码、所述第三分支代码、所述第四分支代码、和所述第五分支代码为if语句的分支代码。
  19. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1-9任意一项所述的方法。
  20. 一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行如权利要求1-9任意一项所述的方法。
  21. 一种计算设备,其特征在于,包括:处理器、存储器和总线;
    所述存储器用于存储执行指令,所述处理器与所述存储器通过所述总线连接,当所述计算设备运行时,所述处理器执行所述存储器存储的所述执行指令,以使所述计算设备执行权利要求1-9任一项所述的方法。
PCT/CN2017/108003 2017-10-27 2017-10-27 代码处理方法和设备 WO2019080091A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201780094772.8A CN111095197B (zh) 2017-10-27 2017-10-27 代码处理方法和设备
PCT/CN2017/108003 WO2019080091A1 (zh) 2017-10-27 2017-10-27 代码处理方法和设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/108003 WO2019080091A1 (zh) 2017-10-27 2017-10-27 代码处理方法和设备

Publications (1)

Publication Number Publication Date
WO2019080091A1 true WO2019080091A1 (zh) 2019-05-02

Family

ID=66246183

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/108003 WO2019080091A1 (zh) 2017-10-27 2017-10-27 代码处理方法和设备

Country Status (2)

Country Link
CN (1) CN111095197B (zh)
WO (1) WO2019080091A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377269A (zh) * 2019-06-17 2019-10-25 平安科技(深圳)有限公司 业务审批系统配置化方法、装置及存储介质
CN111767038A (zh) * 2020-06-28 2020-10-13 烟台东方威思顿电气有限公司 一种脚本化的智能电表事件判断解析方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302533A (zh) * 2014-07-25 2016-02-03 腾讯科技(深圳)有限公司 代码同步方法和装置
CN106598690A (zh) * 2016-12-19 2017-04-26 广州视源电子科技股份有限公司 一种用于代码的管理方法及装置
CN107038058A (zh) * 2017-02-08 2017-08-11 阿里巴巴集团控股有限公司 一种代码处理方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3651579B2 (ja) * 2000-04-06 2005-05-25 インターナショナル・ビジネス・マシーンズ・コーポレーション コンピュータシステム、プログラム変換方法および記憶媒体
CN103714511B (zh) * 2013-12-17 2017-01-18 华为技术有限公司 一种基于gpu的分支处理方法及装置
CN105760143B (zh) * 2014-12-15 2019-03-05 龙芯中科技术有限公司 图像处理执行代码的重构方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302533A (zh) * 2014-07-25 2016-02-03 腾讯科技(深圳)有限公司 代码同步方法和装置
CN106598690A (zh) * 2016-12-19 2017-04-26 广州视源电子科技股份有限公司 一种用于代码的管理方法及装置
CN107038058A (zh) * 2017-02-08 2017-08-11 阿里巴巴集团控股有限公司 一种代码处理方法及装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377269A (zh) * 2019-06-17 2019-10-25 平安科技(深圳)有限公司 业务审批系统配置化方法、装置及存储介质
CN110377269B (zh) * 2019-06-17 2024-04-19 平安科技(深圳)有限公司 业务审批系统配置化方法、装置及存储介质
CN111767038A (zh) * 2020-06-28 2020-10-13 烟台东方威思顿电气有限公司 一种脚本化的智能电表事件判断解析方法
CN111767038B (zh) * 2020-06-28 2023-05-26 烟台东方威思顿电气有限公司 一种脚本化的智能电表事件判断解析方法

Also Published As

Publication number Publication date
CN111095197A (zh) 2020-05-01
CN111095197B (zh) 2021-10-15

Similar Documents

Publication Publication Date Title
Fung et al. Dynamic warp formation and scheduling for efficient GPU control flow
JP2021192257A (ja) プログラム可能な最適化を有するメモリネットワークプロセッサ
JP2928695B2 (ja) 静的インタリーブを用いたマルチスレッドマイクロプロセッサおよびそれを備えたシステムでの命令スレッド実行方法
EP0450658B1 (en) Parallel pipelined instruction processing system for very long instruction word
JP5224498B2 (ja) メモリ管理方法、情報処理装置、プログラムの作成方法及びプログラム
US8677106B2 (en) Unanimous branch instructions in a parallel thread processor
US20110231616A1 (en) Data processing method and system
US20130332711A1 (en) Systems and methods for efficient scheduling of concurrent applications in multithreaded processors
US11900113B2 (en) Data flow processing method and related device
CN110427337B (zh) 基于现场可编程门阵列的处理器内核及其运行方法
US10133827B2 (en) Automatic generation of multi-source breadth-first search from high-level graph language
CN114153500A (zh) 指令调度方法、指令调度装置、处理器及存储介质
Yang et al. A case for a flexible scalar unit in SIMT architecture
CN114416045A (zh) 自动生成算子的方法和装置
WO2019080091A1 (zh) 代码处理方法和设备
CN111026444A (zh) 一种gpu并行阵列simt指令处理模型
US20190220257A1 (en) Method and apparatus for detecting inter-instruction data dependency
Owaida et al. Massively parallel programming models used as hardware description languages: The OpenCL case
US20060200648A1 (en) High-level language processor apparatus and method
CN112463218B (zh) 指令发射控制方法及电路、数据处理方法及电路
US11403082B1 (en) Systems and methods for increased bandwidth utilization regarding irregular memory accesses using software pre-execution
Purkayastha et al. Exploring the efficiency of opencl pipe for hiding memory latency on cloud fpgas
US20140223419A1 (en) Compiler, object code generation method, information processing apparatus, and information processing method
Houzet et al. SysCellC: a data-flow programming model on multi-GPU
US20130166887A1 (en) Data processing apparatus and data processing method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17929488

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17929488

Country of ref document: EP

Kind code of ref document: A1